2023-06-17 16:35:02,085 INFO [train.py:1064] (2/4) Training started 2023-06-17 16:35:02,086 INFO [train.py:1074] (2/4) Device: cuda:2 2023-06-17 16:35:04,198 INFO [lexicon.py:168] (2/4) Loading pre-compiled data/lang_char/Linv.pt 2023-06-17 16:35:04,466 INFO [train.py:1085] (2/4) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.1', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'c51a0b9684442a88ee37f3ce0af686a04b66855b', 'k2-git-date': 'Mon May 1 21:38:03 2023', 'lhotse-version': '1.14.0.dev+git.0f812851.dirty', 'torch-version': '1.10.0+cu102', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.8', 'icefall-git-branch': 'zipformer_wenetspeech', 'icefall-git-sha1': '802bf98-dirty', 'icefall-git-date': 'Fri Jun 16 18:26:55 2023', 'icefall-path': '/star-kw/kangwei/code/icefall_wenetspeech', 'k2-path': '/ceph-hw/kangwei/code/k2_release/k2/k2/python/k2/__init__.py', 'lhotse-path': '/ceph-hw/kangwei/dev_tools/anaconda3/envs/rnnt2/lib/python3.8/site-packages/lhotse-1.14.0.dev0+git.0f812851.dirty-py3.8.egg/lhotse/__init__.py', 'hostname': 'de-74279-k2-train-3-0423201227-84b4557756-8lx4n', 'IP address': '10.177.6.147'}, 'world_size': 4, 'master_port': 12537, 'tensorboard': True, 'num_epochs': 12, 'start_epoch': 1, 'start_batch': 0, 'exp_dir': PosixPath('zipformer/exp_L_small_causal'), 'lang_dir': PosixPath('data/lang_char'), 'base_lr': 0.045, 'lr_batches': 7500, 'lr_epochs': 1.5, 'ref_duration': 600, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 4000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': True, 'num_encoder_layers': '2,2,2,2,2,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,768,768,768,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,256,256,256,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,192,192,192,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16,32,64,-1', 'left_context_frames': '64,128,256,-1', 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 900, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 8, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'training_subset': 'L', 'blank_id': 0, 'vocab_size': 5537} 2023-06-17 16:35:04,467 INFO [train.py:1087] (2/4) About to create model 2023-06-17 16:35:05,038 INFO [train.py:1091] (2/4) Number of model parameters: 32669302 2023-06-17 16:35:10,018 INFO [train.py:1106] (2/4) Using DDP 2023-06-17 16:35:10,597 INFO [asr_datamodule.py:390] (2/4) About to get train cuts 2023-06-17 16:35:10,600 INFO [asr_datamodule.py:398] (2/4) About to get dev cuts 2023-06-17 16:35:10,601 INFO [asr_datamodule.py:211] (2/4) About to get Musan cuts 2023-06-17 16:35:13,405 INFO [asr_datamodule.py:216] (2/4) Enable MUSAN 2023-06-17 16:35:13,405 INFO [asr_datamodule.py:239] (2/4) Enable SpecAugment 2023-06-17 16:35:13,405 INFO [asr_datamodule.py:240] (2/4) Time warp factor: 80 2023-06-17 16:35:13,406 INFO [asr_datamodule.py:250] (2/4) Num frame mask: 10 2023-06-17 16:35:13,406 INFO [asr_datamodule.py:263] (2/4) About to create train dataset 2023-06-17 16:35:13,406 INFO [asr_datamodule.py:289] (2/4) Using DynamicBucketingSampler. 2023-06-17 16:35:18,362 INFO [asr_datamodule.py:305] (2/4) About to create train dataloader 2023-06-17 16:35:18,363 INFO [asr_datamodule.py:336] (2/4) About to create dev dataset 2023-06-17 16:35:19,207 INFO [asr_datamodule.py:354] (2/4) About to create dev dataloader 2023-06-17 16:37:08,030 INFO [train.py:996] (2/4) Epoch 1, batch 0, loss[loss=10.5, simple_loss=9.53, pruned_loss=9.7, over 21828.00 frames. ], tot_loss[loss=10.5, simple_loss=9.53, pruned_loss=9.7, over 21828.00 frames. ], batch size: 118, lr: 2.25e-02, grad_scale: 1.0 2023-06-17 16:37:08,030 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-17 16:37:25,629 INFO [train.py:1028] (2/4) Epoch 1, validation: loss=10.9, simple_loss=9.897, pruned_loss=10.04, over 1796401.00 frames. 2023-06-17 16:37:25,630 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 22880MB 2023-06-17 16:37:37,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.whiten.whitening_limit, batch_count=0.0, ans=4.0 2023-06-17 16:37:44,298 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.07 vs. limit=5.03 2023-06-17 16:38:31,776 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=134.89 vs. limit=7.5675 2023-06-17 16:38:46,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=180.0, ans=0.19325 2023-06-17 16:38:52,011 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=255.49 vs. limit=7.68 2023-06-17 16:39:11,378 INFO [train.py:996] (2/4) Epoch 1, batch 50, loss[loss=1.304, simple_loss=1.157, pruned_loss=1.309, over 21290.00 frames. ], tot_loss[loss=4.097, simple_loss=3.784, pruned_loss=3.076, over 950839.14 frames. ], batch size: 159, lr: 2.48e-02, grad_scale: 0.5 2023-06-17 16:39:23,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=300.0, ans=0.18875 2023-06-17 16:39:32,188 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=244.58 vs. limit=7.635 2023-06-17 16:39:33,893 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=86.79 vs. limit=4.144 2023-06-17 16:39:37,183 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=97.71 vs. limit=5.18 2023-06-17 16:39:54,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=420.0, ans=0.18425000000000002 2023-06-17 16:40:30,729 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=190.95 vs. limit=5.24 2023-06-17 16:40:34,637 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=118.28 vs. limit=5.0 2023-06-17 16:40:52,304 INFO [train.py:996] (2/4) Epoch 1, batch 100, loss[loss=1.279, simple_loss=1.116, pruned_loss=1.319, over 21566.00 frames. ], tot_loss[loss=2.564, simple_loss=2.336, pruned_loss=2.122, over 1685255.44 frames. ], batch size: 414, lr: 2.70e-02, grad_scale: 1.0 2023-06-17 16:40:56,010 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.080e+02 2.341e+02 3.851e+02 6.975e+03 2.847e+04, threshold=7.702e+02, percent-clipped=0.0 2023-06-17 16:40:58,430 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 16:41:08,045 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.27 vs. limit=4.24 2023-06-17 16:41:22,235 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=36.71 vs. limit=5.165 2023-06-17 16:41:30,855 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=48.28 vs. limit=7.77 2023-06-17 16:41:33,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=720.0, ans=0.04775 2023-06-17 16:41:35,614 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 16:41:36,161 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=34.44 vs. limit=7.77 2023-06-17 16:41:44,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=780.0, ans=0.0475625 2023-06-17 16:41:52,327 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=70.38 vs. limit=8.085 2023-06-17 16:41:58,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=780.0, ans=0.2922 2023-06-17 16:42:24,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=840.0, ans=0.460625 2023-06-17 16:42:32,916 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=43.84 vs. limit=7.815 2023-06-17 16:42:37,172 INFO [train.py:996] (2/4) Epoch 1, batch 150, loss[loss=0.8717, simple_loss=0.7433, pruned_loss=0.9315, over 21824.00 frames. ], tot_loss[loss=1.996, simple_loss=1.793, pruned_loss=1.768, over 2264191.98 frames. ], batch size: 98, lr: 2.93e-02, grad_scale: 1.0 2023-06-17 16:42:48,692 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=141.15 vs. limit=5.45 2023-06-17 16:42:52,109 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=44.51 vs. limit=7.8375 2023-06-17 16:43:01,244 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.18 vs. limit=8.22 2023-06-17 16:43:07,699 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.17 vs. limit=4.384 2023-06-17 16:43:14,919 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.14 vs. limit=5.51 2023-06-17 16:43:18,548 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=53.45 vs. limit=7.8825 2023-06-17 16:43:24,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1020.0, ans=0.16175 2023-06-17 16:43:47,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1080.0, ans=0.44937499999999997 2023-06-17 16:43:49,732 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=36.25 vs. limit=7.905 2023-06-17 16:43:51,771 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.04 vs. limit=8.31 2023-06-17 16:44:12,853 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=33.75 vs. limit=7.9275 2023-06-17 16:44:19,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1140.0, ans=0.092875 2023-06-17 16:44:25,195 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=94.29 vs. limit=7.95 2023-06-17 16:44:26,076 INFO [train.py:996] (2/4) Epoch 1, batch 200, loss[loss=1.048, simple_loss=0.8991, pruned_loss=1.02, over 21789.00 frames. ], tot_loss[loss=1.679, simple_loss=1.494, pruned_loss=1.527, over 2692148.99 frames. ], batch size: 316, lr: 3.15e-02, grad_scale: 2.0 2023-06-17 16:44:27,445 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=61.79 vs. limit=7.95 2023-06-17 16:44:29,947 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 7.013e+01 1.220e+02 1.520e+02 2.087e+02 3.052e+02, threshold=3.040e+02, percent-clipped=0.0 2023-06-17 16:44:58,363 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=13.77 vs. limit=5.315 2023-06-17 16:45:02,225 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.28 vs. limit=7.995 2023-06-17 16:45:27,040 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.11 vs. limit=8.49 2023-06-17 16:45:27,179 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=16.05 vs. limit=7.995 2023-06-17 16:45:35,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1380.0, ans=0.4353125 2023-06-17 16:45:38,862 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.54 vs. limit=8.535 2023-06-17 16:45:56,427 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=31.42 vs. limit=8.0175 2023-06-17 16:46:15,758 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=29.86 vs. limit=8.0625 2023-06-17 16:46:16,429 INFO [train.py:996] (2/4) Epoch 1, batch 250, loss[loss=0.9911, simple_loss=0.8512, pruned_loss=0.9079, over 21918.00 frames. ], tot_loss[loss=1.474, simple_loss=1.301, pruned_loss=1.351, over 3036000.99 frames. ], batch size: 316, lr: 3.38e-02, grad_scale: 2.0 2023-06-17 16:46:19,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1500.0, ans=0.14375 2023-06-17 16:46:25,104 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.79 vs. limit=8.625 2023-06-17 16:46:30,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1500.0, ans=3.225 2023-06-17 16:46:35,643 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=22.45 vs. limit=8.085 2023-06-17 16:46:38,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1560.0, ans=0.426875 2023-06-17 16:46:42,772 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=28.65 vs. limit=8.085 2023-06-17 16:46:55,668 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=16.44 vs. limit=8.1075 2023-06-17 16:47:22,996 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.18 vs. limit=8.76 2023-06-17 16:47:46,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1740.0, ans=5.87 2023-06-17 16:47:52,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1740.0, ans=0.4184375 2023-06-17 16:47:54,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1740.0, ans=0.13474999999999998 2023-06-17 16:48:03,338 INFO [train.py:996] (2/4) Epoch 1, batch 300, loss[loss=1.06, simple_loss=0.9009, pruned_loss=0.9654, over 21736.00 frames. ], tot_loss[loss=1.323, simple_loss=1.16, pruned_loss=1.212, over 3305845.15 frames. ], batch size: 298, lr: 3.60e-02, grad_scale: 4.0 2023-06-17 16:48:05,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1800.0, ans=0.415625 2023-06-17 16:48:07,108 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 9.171e+01 1.173e+02 1.354e+02 1.820e+02 4.361e+02, threshold=2.708e+02, percent-clipped=2.0 2023-06-17 16:48:13,436 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=24.12 vs. limit=8.175 2023-06-17 16:48:18,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1800.0, ans=0.282 2023-06-17 16:48:43,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1920.0, ans=0.8328 2023-06-17 16:48:45,703 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.47 vs. limit=8.94 2023-06-17 16:49:09,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1980.0, ans=6.2375 2023-06-17 16:49:22,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1980.0, ans=0.2525 2023-06-17 16:49:37,490 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.76 vs. limit=5.51 2023-06-17 16:49:42,838 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=9.03 2023-06-17 16:49:43,045 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.01 vs. limit=5.51 2023-06-17 16:49:44,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=2040.0, ans=9.03 2023-06-17 16:49:48,620 INFO [train.py:996] (2/4) Epoch 1, batch 350, loss[loss=0.8652, simple_loss=0.7285, pruned_loss=0.7777, over 21471.00 frames. ], tot_loss[loss=1.207, simple_loss=1.051, pruned_loss=1.101, over 3522900.31 frames. ], batch size: 144, lr: 3.83e-02, grad_scale: 4.0 2023-06-17 16:50:03,416 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=36.95 vs. limit=8.2875 2023-06-17 16:50:25,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2220.0, ans=0.3959375 2023-06-17 16:50:59,892 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.56 vs. limit=9.21 2023-06-17 16:51:10,449 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.45 vs. limit=9.21 2023-06-17 16:51:17,355 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.35 vs. limit=8.355 2023-06-17 16:51:25,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2340.0, ans=0.2766 2023-06-17 16:51:26,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2340.0, ans=0.11225 2023-06-17 16:51:34,462 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=45.94 vs. limit=8.3775 2023-06-17 16:51:36,717 INFO [train.py:996] (2/4) Epoch 1, batch 400, loss[loss=0.7929, simple_loss=0.6648, pruned_loss=0.6945, over 21549.00 frames. ], tot_loss[loss=1.119, simple_loss=0.9683, pruned_loss=1.014, over 3683232.16 frames. ], batch size: 263, lr: 4.05e-02, grad_scale: 8.0 2023-06-17 16:51:37,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2400.0, ans=0.3875 2023-06-17 16:51:37,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2400.0, ans=0.8160000000000001 2023-06-17 16:51:37,973 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=4.96 2023-06-17 16:51:40,358 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 8.615e+01 1.452e+02 1.814e+02 2.451e+02 4.544e+02, threshold=3.628e+02, percent-clipped=11.0 2023-06-17 16:51:40,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=2400.0, ans=0.23600000000000002 2023-06-17 16:51:47,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2400.0, ans=0.11 2023-06-17 16:52:00,675 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=8.4225 2023-06-17 16:52:02,428 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.77 vs. limit=8.4225 2023-06-17 16:52:12,833 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=5.008 2023-06-17 16:52:16,774 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.24 vs. limit=5.63 2023-06-17 16:53:00,482 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=16.26 vs. limit=8.4675 2023-06-17 16:53:10,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2640.0, ans=0.2736 2023-06-17 16:53:12,937 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=5.056 2023-06-17 16:53:26,164 INFO [train.py:996] (2/4) Epoch 1, batch 450, loss[loss=0.7841, simple_loss=0.66, pruned_loss=0.6553, over 21161.00 frames. ], tot_loss[loss=1.063, simple_loss=0.9137, pruned_loss=0.9537, over 3811267.58 frames. ], batch size: 548, lr: 4.28e-02, grad_scale: 8.0 2023-06-17 16:53:51,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2760.0, ans=0.037899999999999996 2023-06-17 16:53:52,002 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=8.535 2023-06-17 16:53:52,250 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=11.87 vs. limit=8.535 2023-06-17 16:54:00,814 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.27 vs. limit=8.557500000000001 2023-06-17 16:54:19,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2820.0, ans=0.3678125 2023-06-17 16:54:26,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2820.0, ans=0.3678125 2023-06-17 16:54:40,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2880.0, ans=0.365 2023-06-17 16:54:47,187 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.73 vs. limit=9.66 2023-06-17 16:54:50,612 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.82 vs. limit=9.66 2023-06-17 16:55:03,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2940.0, ans=0.1325 2023-06-17 16:55:06,145 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=5.176 2023-06-17 16:55:15,298 INFO [train.py:996] (2/4) Epoch 1, batch 500, loss[loss=0.7726, simple_loss=0.6462, pruned_loss=0.6371, over 21643.00 frames. ], tot_loss[loss=1.038, simple_loss=0.8872, pruned_loss=0.9173, over 3905401.21 frames. ], batch size: 264, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 16:55:19,006 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 9.969e+01 1.768e+02 2.484e+02 3.323e+02 7.392e+02, threshold=4.968e+02, percent-clipped=16.0 2023-06-17 16:55:40,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=3060.0, ans=0.1175 2023-06-17 16:56:12,440 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=10.82 vs. limit=8.67 2023-06-17 16:56:40,356 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=18.73 vs. limit=8.692499999999999 2023-06-17 16:57:02,943 INFO [train.py:996] (2/4) Epoch 1, batch 550, loss[loss=0.9078, simple_loss=0.7723, pruned_loss=0.6951, over 21377.00 frames. ], tot_loss[loss=1.012, simple_loss=0.862, pruned_loss=0.8762, over 3989449.83 frames. ], batch size: 194, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 16:57:10,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=3300.0, ans=8.7375 2023-06-17 16:57:36,351 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.75 vs. limit=5.84 2023-06-17 16:58:41,530 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=5.416 2023-06-17 16:58:44,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=3540.0, ans=0.33406250000000004 2023-06-17 16:58:50,987 INFO [train.py:996] (2/4) Epoch 1, batch 600, loss[loss=0.7814, simple_loss=0.6744, pruned_loss=0.5625, over 21629.00 frames. ], tot_loss[loss=0.9853, simple_loss=0.8381, pruned_loss=0.8312, over 4054637.90 frames. ], batch size: 263, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 16:58:54,335 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.505e+02 2.961e+02 3.893e+02 6.488e+02 1.570e+03, threshold=7.787e+02, percent-clipped=36.0 2023-06-17 16:59:08,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=3660.0, ans=0.2549 2023-06-17 16:59:21,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=3660.0, ans=0.06274999999999997 2023-06-17 16:59:42,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=3720.0, ans=0.01630000000000001 2023-06-17 17:00:11,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=3780.0, ans=0.014949999999999991 2023-06-17 17:00:12,421 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=8.9175 2023-06-17 17:00:21,374 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=8.94 2023-06-17 17:00:36,759 INFO [train.py:996] (2/4) Epoch 1, batch 650, loss[loss=0.9328, simple_loss=0.8018, pruned_loss=0.6654, over 21682.00 frames. ], tot_loss[loss=0.9515, simple_loss=0.8098, pruned_loss=0.7815, over 4107411.35 frames. ], batch size: 414, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:00:47,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=3900.0, ans=0.261 2023-06-17 17:01:08,315 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=8.985 2023-06-17 17:01:11,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=3960.0, ans=0.31437499999999996 2023-06-17 17:01:20,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=4020.0, ans=0.7593000000000001 2023-06-17 17:01:20,662 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.30 vs. limit=6.005 2023-06-17 17:01:33,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=4020.0, ans=0.31156249999999996 2023-06-17 17:01:48,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=4080.0, ans=0.009982608695652173 2023-06-17 17:01:50,502 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=9.03 2023-06-17 17:01:51,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=4080.0, ans=0.2592 2023-06-17 17:01:59,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=4080.0, ans=0.30874999999999997 2023-06-17 17:02:08,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=4140.0, ans=0.3059375 2023-06-17 17:02:21,251 INFO [train.py:996] (2/4) Epoch 1, batch 700, loss[loss=0.7582, simple_loss=0.6505, pruned_loss=0.5335, over 21738.00 frames. ], tot_loss[loss=0.9127, simple_loss=0.7782, pruned_loss=0.7294, over 4152764.16 frames. ], batch size: 112, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:02:24,617 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.523e+02 4.078e+02 5.855e+02 9.456e+02 2.667e+03, threshold=1.171e+03, percent-clipped=39.0 2023-06-17 17:02:31,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=4200.0, ans=0.753 2023-06-17 17:02:42,661 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.38 vs. limit=7.13 2023-06-17 17:03:32,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=4380.0, ans=0.009917391304347826 2023-06-17 17:03:42,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=4380.0, ans=0.2946875 2023-06-17 17:04:05,988 INFO [train.py:996] (2/4) Epoch 1, batch 750, loss[loss=0.8406, simple_loss=0.7226, pruned_loss=0.5792, over 21792.00 frames. ], tot_loss[loss=0.8733, simple_loss=0.7459, pruned_loss=0.6801, over 4177964.81 frames. ], batch size: 351, lr: 4.49e-02, grad_scale: 8.0 2023-06-17 17:04:19,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=4500.0, ans=0.04791666666666667 2023-06-17 17:05:03,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=4620.0, ans=10.965 2023-06-17 17:05:13,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=4620.0, ans=0.2834375 2023-06-17 17:05:16,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=4680.0, ans=0.280625 2023-06-17 17:05:32,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=4680.0, ans=0.280625 2023-06-17 17:05:38,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=4740.0, ans=0.2778125 2023-06-17 17:05:42,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=4740.0, ans=0.2778125 2023-06-17 17:05:42,666 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=9.2775 2023-06-17 17:05:51,453 INFO [train.py:996] (2/4) Epoch 1, batch 800, loss[loss=0.7101, simple_loss=0.6208, pruned_loss=0.4639, over 21004.00 frames. ], tot_loss[loss=0.8326, simple_loss=0.7132, pruned_loss=0.6314, over 4206043.49 frames. ], batch size: 608, lr: 4.49e-02, grad_scale: 16.0 2023-06-17 17:05:54,754 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 4.402e+02 7.390e+02 1.255e+03 3.583e+03, threshold=1.478e+03, percent-clipped=27.0 2023-06-17 17:05:55,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=4800.0, ans=0.275 2023-06-17 17:06:04,271 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.05 vs. limit=7.4 2023-06-17 17:06:15,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=4860.0, ans=0.7299 2023-06-17 17:06:24,623 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.47 vs. limit=6.215 2023-06-17 17:06:36,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=4920.0, ans=0.26937500000000003 2023-06-17 17:06:46,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=4920.0, ans=0.26937500000000003 2023-06-17 17:06:46,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=4920.0, ans=0.7278 2023-06-17 17:07:09,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=4980.0, ans=0.04591666666666667 2023-06-17 17:07:17,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=5040.0, ans=0.26375000000000004 2023-06-17 17:07:19,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=5040.0, ans=0.26375000000000004 2023-06-17 17:07:35,959 INFO [train.py:996] (2/4) Epoch 1, batch 850, loss[loss=0.6269, simple_loss=0.5526, pruned_loss=0.3973, over 21887.00 frames. ], tot_loss[loss=0.7979, simple_loss=0.6854, pruned_loss=0.5896, over 4224896.92 frames. ], batch size: 107, lr: 4.49e-02, grad_scale: 16.0 2023-06-17 17:07:43,954 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=11.325 2023-06-17 17:07:48,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=5100.0, ans=0.2609375 2023-06-17 17:08:54,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=5280.0, ans=0.2525 2023-06-17 17:09:19,834 INFO [train.py:996] (2/4) Epoch 1, batch 900, loss[loss=0.5491, simple_loss=0.4927, pruned_loss=0.3312, over 21291.00 frames. ], tot_loss[loss=0.7621, simple_loss=0.6566, pruned_loss=0.5497, over 4242225.15 frames. ], batch size: 159, lr: 4.48e-02, grad_scale: 16.0 2023-06-17 17:09:23,171 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 4.491e+02 8.246e+02 1.178e+03 2.944e+03, threshold=1.649e+03, percent-clipped=18.0 2023-06-17 17:09:28,987 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=11.55 2023-06-17 17:10:27,674 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.97 vs. limit=6.38 2023-06-17 17:10:37,713 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=9.5925 2023-06-17 17:10:38,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=5580.0, ans=0.23843750000000002 2023-06-17 17:11:05,168 INFO [train.py:996] (2/4) Epoch 1, batch 950, loss[loss=0.6748, simple_loss=0.5945, pruned_loss=0.4198, over 21500.00 frames. ], tot_loss[loss=0.7333, simple_loss=0.6344, pruned_loss=0.5155, over 4255981.37 frames. ], batch size: 131, lr: 4.48e-02, grad_scale: 16.0 2023-06-17 17:11:26,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=5760.0, ans=0.22999999999999998 2023-06-17 17:12:00,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=5820.0, ans=0.2271875 2023-06-17 17:12:30,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=5940.0, ans=0.24059999999999998 2023-06-17 17:12:44,954 INFO [train.py:996] (2/4) Epoch 1, batch 1000, loss[loss=0.742, simple_loss=0.6391, pruned_loss=0.4776, over 21439.00 frames. ], tot_loss[loss=0.7133, simple_loss=0.6188, pruned_loss=0.4901, over 4268252.06 frames. ], batch size: 508, lr: 4.48e-02, grad_scale: 8.0 2023-06-17 17:12:50,050 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 4.573e+02 9.444e+02 1.523e+03 4.461e+03, threshold=1.889e+03, percent-clipped=19.0 2023-06-17 17:13:55,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=6180.0, ans=0.2103125 2023-06-17 17:14:00,893 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.95 vs. limit=12.135 2023-06-17 17:14:14,572 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.43 vs. limit=3.936 2023-06-17 17:14:25,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=6240.0, ans=0.20750000000000002 2023-06-17 17:14:25,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=6240.0, ans=0.23759999999999998 2023-06-17 17:14:30,033 INFO [train.py:996] (2/4) Epoch 1, batch 1050, loss[loss=0.7408, simple_loss=0.6465, pruned_loss=0.4611, over 21529.00 frames. ], tot_loss[loss=0.693, simple_loss=0.6035, pruned_loss=0.4656, over 4272707.89 frames. ], batch size: 471, lr: 4.48e-02, grad_scale: 8.0 2023-06-17 17:15:05,655 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=2.831e-01 2023-06-17 17:15:15,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=6360.0, ans=0.04016666666666667 2023-06-17 17:15:32,902 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.34 vs. limit=9.9075 2023-06-17 17:15:52,471 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.58 vs. limit=6.62 2023-06-17 17:16:19,613 INFO [train.py:996] (2/4) Epoch 1, batch 1100, loss[loss=0.6016, simple_loss=0.5417, pruned_loss=0.351, over 21261.00 frames. ], tot_loss[loss=0.6702, simple_loss=0.587, pruned_loss=0.4393, over 4275452.76 frames. ], batch size: 548, lr: 4.48e-02, grad_scale: 8.0 2023-06-17 17:16:24,917 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.653e+02 4.618e+02 6.760e+02 9.652e+02 3.048e+03, threshold=1.352e+03, percent-clipped=4.0 2023-06-17 17:16:39,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=6600.0, ans=0.23399999999999999 2023-06-17 17:16:49,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=6660.0, ans=0.1878125 2023-06-17 17:17:12,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=6720.0, ans=0.185 2023-06-17 17:17:14,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=6720.0, ans=0.2328 2023-06-17 17:18:16,870 INFO [train.py:996] (2/4) Epoch 1, batch 1150, loss[loss=0.626, simple_loss=0.5646, pruned_loss=0.3622, over 21789.00 frames. ], tot_loss[loss=0.6507, simple_loss=0.5726, pruned_loss=0.4176, over 4274432.88 frames. ], batch size: 316, lr: 4.47e-02, grad_scale: 8.0 2023-06-17 17:18:52,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=6960.0, ans=0.6564 2023-06-17 17:18:55,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=7020.0, ans=0.17093750000000002 2023-06-17 17:19:00,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=7020.0, ans=9.3875 2023-06-17 17:19:03,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=7020.0, ans=0.009343478260869566 2023-06-17 17:20:04,107 INFO [train.py:996] (2/4) Epoch 1, batch 1200, loss[loss=0.6403, simple_loss=0.5765, pruned_loss=0.3699, over 21809.00 frames. ], tot_loss[loss=0.6376, simple_loss=0.564, pruned_loss=0.4007, over 4282262.24 frames. ], batch size: 124, lr: 4.47e-02, grad_scale: 16.0 2023-06-17 17:20:09,130 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.453e+02 4.949e+02 7.827e+02 1.470e+03 3.073e+03, threshold=1.565e+03, percent-clipped=26.0 2023-06-17 17:20:43,442 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.38 vs. limit=12.99 2023-06-17 17:21:44,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=7440.0, ans=0.15125 2023-06-17 17:21:48,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=7500.0, ans=0.1484375 2023-06-17 17:21:50,153 INFO [train.py:996] (2/4) Epoch 1, batch 1250, loss[loss=0.5465, simple_loss=0.496, pruned_loss=0.3102, over 21390.00 frames. ], tot_loss[loss=0.6325, simple_loss=0.5614, pruned_loss=0.3906, over 4285096.91 frames. ], batch size: 159, lr: 4.47e-02, grad_scale: 8.0 2023-06-17 17:22:41,480 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 17:22:51,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=7680.0, ans=0.03466666666666667 2023-06-17 17:23:02,681 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.23 vs. limit=8.84 2023-06-17 17:23:12,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=7740.0, ans=0.13718750000000002 2023-06-17 17:23:24,699 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.62 vs. limit=13.305 2023-06-17 17:23:34,688 INFO [train.py:996] (2/4) Epoch 1, batch 1300, loss[loss=0.4988, simple_loss=0.4369, pruned_loss=0.2976, over 20850.00 frames. ], tot_loss[loss=0.6234, simple_loss=0.5557, pruned_loss=0.3784, over 4288328.72 frames. ], batch size: 608, lr: 4.47e-02, grad_scale: 8.0 2023-06-17 17:23:46,458 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.014e+02 6.355e+02 9.383e+02 1.437e+03 4.251e+03, threshold=1.877e+03, percent-clipped=19.0 2023-06-17 17:24:01,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=7860.0, ans=0.22139999999999999 2023-06-17 17:24:03,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=7860.0, ans=0.13156250000000003 2023-06-17 17:24:11,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=7920.0, ans=0.03366666666666667 2023-06-17 17:24:22,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=7920.0, ans=0.12874999999999998 2023-06-17 17:25:18,188 INFO [train.py:996] (2/4) Epoch 1, batch 1350, loss[loss=0.5236, simple_loss=0.4661, pruned_loss=0.3036, over 21743.00 frames. ], tot_loss[loss=0.6152, simple_loss=0.5508, pruned_loss=0.3677, over 4295896.94 frames. ], batch size: 247, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:26:03,413 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.95 vs. limit=4.2330000000000005 2023-06-17 17:26:06,337 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=10.5825 2023-06-17 17:26:21,386 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.52 vs. limit=13.71 2023-06-17 17:27:04,254 INFO [train.py:996] (2/4) Epoch 1, batch 1400, loss[loss=0.4519, simple_loss=0.4209, pruned_loss=0.2442, over 14706.00 frames. ], tot_loss[loss=0.6022, simple_loss=0.5414, pruned_loss=0.355, over 4289894.64 frames. ], batch size: 61, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:27:16,084 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.599e+02 4.966e+02 8.167e+02 1.163e+03 2.690e+03, threshold=1.633e+03, percent-clipped=5.0 2023-06-17 17:28:04,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=8580.0, ans=0.125 2023-06-17 17:28:38,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=8640.0, ans=0.125 2023-06-17 17:28:47,820 INFO [train.py:996] (2/4) Epoch 1, batch 1450, loss[loss=0.6441, simple_loss=0.5731, pruned_loss=0.3704, over 21815.00 frames. ], tot_loss[loss=0.5957, simple_loss=0.5368, pruned_loss=0.3474, over 4291870.40 frames. ], batch size: 332, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:29:05,683 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.13 vs. limit=7.175 2023-06-17 17:30:15,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=8940.0, ans=0.00892608695652174 2023-06-17 17:30:31,063 INFO [train.py:996] (2/4) Epoch 1, batch 1500, loss[loss=0.5162, simple_loss=0.4745, pruned_loss=0.2834, over 21853.00 frames. ], tot_loss[loss=0.5872, simple_loss=0.5307, pruned_loss=0.3387, over 4289611.35 frames. ], batch size: 282, lr: 4.46e-02, grad_scale: 8.0 2023-06-17 17:30:42,882 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.853e+02 4.683e+02 9.412e+02 1.321e+03 2.952e+03, threshold=1.882e+03, percent-clipped=11.0 2023-06-17 17:31:24,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=9120.0, ans=0.00888695652173913 2023-06-17 17:31:45,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=9180.0, ans=0.008873913043478262 2023-06-17 17:32:09,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=9240.0, ans=0.125 2023-06-17 17:32:22,538 INFO [train.py:996] (2/4) Epoch 1, batch 1550, loss[loss=0.5622, simple_loss=0.5189, pruned_loss=0.3064, over 21508.00 frames. ], tot_loss[loss=0.5724, simple_loss=0.5209, pruned_loss=0.3256, over 4285835.22 frames. ], batch size: 473, lr: 4.45e-02, grad_scale: 8.0 2023-06-17 17:32:36,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=9300.0, ans=0.02791666666666667 2023-06-17 17:32:40,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=9360.0, ans=0.5724 2023-06-17 17:32:43,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=9360.0, ans=0.125 2023-06-17 17:32:44,399 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=8.08 vs. limit=9.68 2023-06-17 17:32:47,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=9360.0, ans=0.125 2023-06-17 17:33:11,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=9420.0, ans=0.02741666666666667 2023-06-17 17:33:11,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=9420.0, ans=0.02741666666666667 2023-06-17 17:33:59,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=9540.0, ans=0.02691666666666667 2023-06-17 17:34:06,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=9540.0, ans=0.125 2023-06-17 17:34:09,494 INFO [train.py:996] (2/4) Epoch 1, batch 1600, loss[loss=0.5853, simple_loss=0.5302, pruned_loss=0.3258, over 21369.00 frames. ], tot_loss[loss=0.5677, simple_loss=0.5183, pruned_loss=0.3199, over 4281516.07 frames. ], batch size: 548, lr: 4.45e-02, grad_scale: 16.0 2023-06-17 17:34:15,874 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.713e+02 5.768e+02 7.778e+02 1.283e+03 4.290e+03, threshold=1.556e+03, percent-clipped=12.0 2023-06-17 17:35:09,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=9720.0, ans=0.0 2023-06-17 17:35:14,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=9780.0, ans=0.008743478260869565 2023-06-17 17:35:36,758 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=11.19 2023-06-17 17:35:46,500 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=7.936 2023-06-17 17:35:49,468 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.05 vs. limit=14.879999999999999 2023-06-17 17:35:53,564 INFO [train.py:996] (2/4) Epoch 1, batch 1650, loss[loss=0.5803, simple_loss=0.5229, pruned_loss=0.3242, over 21845.00 frames. ], tot_loss[loss=0.561, simple_loss=0.515, pruned_loss=0.3126, over 4276170.06 frames. ], batch size: 441, lr: 4.45e-02, grad_scale: 8.0 2023-06-17 17:36:10,818 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 17:36:23,700 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.41 vs. limit=7.49 2023-06-17 17:36:52,630 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.88 vs. limit=4.503 2023-06-17 17:37:33,275 INFO [train.py:996] (2/4) Epoch 1, batch 1700, loss[loss=0.4977, simple_loss=0.4624, pruned_loss=0.268, over 21175.00 frames. ], tot_loss[loss=0.5634, simple_loss=0.5186, pruned_loss=0.3115, over 4282042.37 frames. ], batch size: 607, lr: 4.44e-02, grad_scale: 8.0 2023-06-17 17:37:41,962 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.683e+02 4.849e+02 8.667e+02 1.230e+03 2.717e+03, threshold=1.733e+03, percent-clipped=16.0 2023-06-17 17:39:09,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=10440.0, ans=0.125 2023-06-17 17:39:19,726 INFO [train.py:996] (2/4) Epoch 1, batch 1750, loss[loss=0.3901, simple_loss=0.4042, pruned_loss=0.1834, over 21617.00 frames. ], tot_loss[loss=0.5457, simple_loss=0.5081, pruned_loss=0.297, over 4283177.67 frames. ], batch size: 247, lr: 4.44e-02, grad_scale: 8.0 2023-06-17 17:39:45,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=10500.0, ans=0.125 2023-06-17 17:40:16,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=10620.0, ans=0.022416666666666668 2023-06-17 17:40:26,428 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=8.852e-02 2023-06-17 17:40:43,390 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=8.438e-02 2023-06-17 17:40:46,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=10740.0, ans=0.1926 2023-06-17 17:41:18,157 INFO [train.py:996] (2/4) Epoch 1, batch 1800, loss[loss=0.5208, simple_loss=0.5266, pruned_loss=0.2541, over 21769.00 frames. ], tot_loss[loss=0.5319, simple_loss=0.4988, pruned_loss=0.2865, over 4276300.63 frames. ], batch size: 282, lr: 4.44e-02, grad_scale: 8.0 2023-06-17 17:41:26,433 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.462e+02 4.676e+02 7.112e+02 1.184e+03 2.740e+03, threshold=1.422e+03, percent-clipped=6.0 2023-06-17 17:41:54,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=10860.0, ans=0.5199 2023-06-17 17:42:01,433 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=11.594999999999999 2023-06-17 17:42:14,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=10980.0, ans=0.02091666666666667 2023-06-17 17:42:30,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=11040.0, ans=0.1896 2023-06-17 17:42:59,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=11040.0, ans=0.1896 2023-06-17 17:43:02,052 INFO [train.py:996] (2/4) Epoch 1, batch 1850, loss[loss=0.4966, simple_loss=0.4747, pruned_loss=0.2589, over 21411.00 frames. ], tot_loss[loss=0.5241, simple_loss=0.4963, pruned_loss=0.2787, over 4271108.02 frames. ], batch size: 194, lr: 4.43e-02, grad_scale: 8.0 2023-06-17 17:43:08,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=11100.0, ans=0.5115000000000001 2023-06-17 17:44:20,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=11340.0, ans=0.125 2023-06-17 17:44:39,753 INFO [train.py:996] (2/4) Epoch 1, batch 1900, loss[loss=0.3965, simple_loss=0.3972, pruned_loss=0.1968, over 21210.00 frames. ], tot_loss[loss=0.5204, simple_loss=0.4937, pruned_loss=0.2756, over 4278487.69 frames. ], batch size: 143, lr: 4.43e-02, grad_scale: 8.0 2023-06-17 17:44:47,640 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.112e+02 4.687e+02 6.940e+02 1.118e+03 3.518e+03, threshold=1.388e+03, percent-clipped=15.0 2023-06-17 17:44:48,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=11400.0, ans=0.186 2023-06-17 17:45:19,382 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=11.82 2023-06-17 17:45:23,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=11520.0, ans=0.01866666666666667 2023-06-17 17:45:32,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=11580.0, ans=0.2 2023-06-17 17:46:22,605 INFO [train.py:996] (2/4) Epoch 1, batch 1950, loss[loss=0.4944, simple_loss=0.4704, pruned_loss=0.2591, over 21786.00 frames. ], tot_loss[loss=0.5115, simple_loss=0.4849, pruned_loss=0.2707, over 4262271.29 frames. ], batch size: 351, lr: 4.43e-02, grad_scale: 4.0 2023-06-17 17:46:58,357 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=8.704 2023-06-17 17:47:09,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=11820.0, ans=0.01741666666666667 2023-06-17 17:47:17,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=11880.0, ans=0.0 2023-06-17 17:47:42,292 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=11.955 2023-06-17 17:48:00,886 INFO [train.py:996] (2/4) Epoch 1, batch 2000, loss[loss=0.3433, simple_loss=0.3538, pruned_loss=0.1663, over 21764.00 frames. ], tot_loss[loss=0.4971, simple_loss=0.4747, pruned_loss=0.261, over 4263614.38 frames. ], batch size: 124, lr: 4.42e-02, grad_scale: 8.0 2023-06-17 17:48:15,569 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.181e+02 5.337e+02 7.170e+02 1.145e+03 2.393e+03, threshold=1.434e+03, percent-clipped=15.0 2023-06-17 17:48:21,124 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=12.0225 2023-06-17 17:48:44,186 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=4.95 vs. limit=12.045 2023-06-17 17:49:00,180 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=12.067499999999999 2023-06-17 17:49:30,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=12240.0, ans=0.125 2023-06-17 17:49:37,872 INFO [train.py:996] (2/4) Epoch 1, batch 2050, loss[loss=0.5911, simple_loss=0.5706, pruned_loss=0.3058, over 21604.00 frames. ], tot_loss[loss=0.4973, simple_loss=0.476, pruned_loss=0.2602, over 4267975.85 frames. ], batch size: 442, lr: 4.42e-02, grad_scale: 8.0 2023-06-17 17:49:48,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=12300.0, ans=0.177 2023-06-17 17:51:14,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=12600.0, ans=0.008130434782608695 2023-06-17 17:51:15,795 INFO [train.py:996] (2/4) Epoch 1, batch 2100, loss[loss=0.4884, simple_loss=0.4727, pruned_loss=0.252, over 21729.00 frames. ], tot_loss[loss=0.4998, simple_loss=0.4794, pruned_loss=0.2609, over 4276958.55 frames. ], batch size: 282, lr: 4.42e-02, grad_scale: 8.0 2023-06-17 17:51:31,181 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.912e+02 5.111e+02 7.540e+02 1.226e+03 2.396e+03, threshold=1.508e+03, percent-clipped=15.0 2023-06-17 17:51:56,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=12720.0, ans=0.008104347826086957 2023-06-17 17:52:28,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=12780.0, ans=0.125 2023-06-17 17:52:50,666 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.97 vs. limit=11.42 2023-06-17 17:53:00,926 INFO [train.py:996] (2/4) Epoch 1, batch 2150, loss[loss=0.4438, simple_loss=0.4063, pruned_loss=0.2406, over 20056.00 frames. ], tot_loss[loss=0.4986, simple_loss=0.4789, pruned_loss=0.2598, over 4274140.15 frames. ], batch size: 703, lr: 4.41e-02, grad_scale: 8.0 2023-06-17 17:53:33,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=12960.0, ans=0.008052173913043479 2023-06-17 17:53:34,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=12960.0, ans=0.125 2023-06-17 17:53:47,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=13020.0, ans=0.125 2023-06-17 17:54:11,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=13080.0, ans=0.125 2023-06-17 17:54:23,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=13080.0, ans=0.125 2023-06-17 17:54:44,038 INFO [train.py:996] (2/4) Epoch 1, batch 2200, loss[loss=0.4377, simple_loss=0.4521, pruned_loss=0.2116, over 21805.00 frames. ], tot_loss[loss=0.4972, simple_loss=0.4814, pruned_loss=0.257, over 4268522.01 frames. ], batch size: 371, lr: 4.41e-02, grad_scale: 8.0 2023-06-17 17:54:59,362 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 5.225e+02 6.882e+02 1.154e+03 2.681e+03, threshold=1.376e+03, percent-clipped=19.0 2023-06-17 17:55:35,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=13320.0, ans=0.011166666666666672 2023-06-17 17:55:40,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=13320.0, ans=0.09899494936611666 2023-06-17 17:56:34,502 INFO [train.py:996] (2/4) Epoch 1, batch 2250, loss[loss=0.4624, simple_loss=0.4368, pruned_loss=0.244, over 21787.00 frames. ], tot_loss[loss=0.4898, simple_loss=0.4773, pruned_loss=0.2515, over 4272942.91 frames. ], batch size: 371, lr: 4.40e-02, grad_scale: 8.0 2023-06-17 17:56:48,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=13500.0, ans=0.007934782608695653 2023-06-17 17:58:07,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=13740.0, ans=0.125 2023-06-17 17:58:19,053 INFO [train.py:996] (2/4) Epoch 1, batch 2300, loss[loss=0.4092, simple_loss=0.4043, pruned_loss=0.2071, over 21274.00 frames. ], tot_loss[loss=0.4817, simple_loss=0.4697, pruned_loss=0.2471, over 4276054.35 frames. ], batch size: 176, lr: 4.40e-02, grad_scale: 8.0 2023-06-17 17:58:29,083 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.055e+02 5.278e+02 8.077e+02 1.161e+03 3.244e+03, threshold=1.615e+03, percent-clipped=15.0 2023-06-17 17:58:48,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=13860.0, ans=0.125 2023-06-17 17:59:07,252 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.05 vs. limit=12.719999999999999 2023-06-17 17:59:12,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=13980.0, ans=0.125 2023-06-17 17:59:22,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=13980.0, ans=0.125 2023-06-17 18:00:03,819 INFO [train.py:996] (2/4) Epoch 1, batch 2350, loss[loss=0.4393, simple_loss=0.4395, pruned_loss=0.2196, over 21432.00 frames. ], tot_loss[loss=0.479, simple_loss=0.4678, pruned_loss=0.2453, over 4269768.28 frames. ], batch size: 131, lr: 4.40e-02, grad_scale: 8.0 2023-06-17 18:00:31,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=14160.0, ans=0.125 2023-06-17 18:00:35,198 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=12.809999999999999 2023-06-17 18:01:49,087 INFO [train.py:996] (2/4) Epoch 1, batch 2400, loss[loss=0.5159, simple_loss=0.5046, pruned_loss=0.2636, over 21353.00 frames. ], tot_loss[loss=0.483, simple_loss=0.4726, pruned_loss=0.2469, over 4269967.17 frames. ], batch size: 143, lr: 4.39e-02, grad_scale: 16.0 2023-06-17 18:01:59,367 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.448e+02 4.626e+02 8.072e+02 1.275e+03 2.674e+03, threshold=1.614e+03, percent-clipped=13.0 2023-06-17 18:02:49,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=14520.0, ans=0.15480000000000002 2023-06-17 18:03:04,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=14580.0, ans=0.125 2023-06-17 18:03:34,296 INFO [train.py:996] (2/4) Epoch 1, batch 2450, loss[loss=0.4361, simple_loss=0.4353, pruned_loss=0.2184, over 21502.00 frames. ], tot_loss[loss=0.4859, simple_loss=0.4752, pruned_loss=0.2484, over 4268915.69 frames. ], batch size: 230, lr: 4.39e-02, grad_scale: 16.0 2023-06-17 18:03:44,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=14700.0, ans=0.38550000000000006 2023-06-17 18:03:47,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=14700.0, ans=0.125 2023-06-17 18:03:58,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=14760.0, ans=0.005166666666666667 2023-06-17 18:05:16,580 INFO [train.py:996] (2/4) Epoch 1, batch 2500, loss[loss=0.637, simple_loss=0.5748, pruned_loss=0.3496, over 21765.00 frames. ], tot_loss[loss=0.4775, simple_loss=0.468, pruned_loss=0.2436, over 4268530.78 frames. ], batch size: 441, lr: 4.38e-02, grad_scale: 8.0 2023-06-17 18:05:22,523 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:05:28,406 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.577e+02 4.881e+02 6.609e+02 9.679e+02 1.963e+03, threshold=1.322e+03, percent-clipped=4.0 2023-06-17 18:05:37,158 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:05:42,601 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.29 vs. limit=13.1475 2023-06-17 18:05:46,024 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.08 vs. limit=8.765 2023-06-17 18:05:51,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=15120.0, ans=0.0036666666666666722 2023-06-17 18:06:18,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=15180.0, ans=0.0 2023-06-17 18:06:18,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=15180.0, ans=0.125 2023-06-17 18:06:44,747 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=10.096 2023-06-17 18:07:00,748 INFO [train.py:996] (2/4) Epoch 1, batch 2550, loss[loss=0.5079, simple_loss=0.4888, pruned_loss=0.2634, over 21464.00 frames. ], tot_loss[loss=0.4733, simple_loss=0.4668, pruned_loss=0.24, over 4261330.01 frames. ], batch size: 211, lr: 4.38e-02, grad_scale: 8.0 2023-06-17 18:07:18,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=15360.0, ans=19.02 2023-06-17 18:08:38,605 INFO [train.py:996] (2/4) Epoch 1, batch 2600, loss[loss=0.4564, simple_loss=0.4562, pruned_loss=0.2283, over 21798.00 frames. ], tot_loss[loss=0.4756, simple_loss=0.4707, pruned_loss=0.2403, over 4258545.42 frames. ], batch size: 247, lr: 4.37e-02, grad_scale: 8.0 2023-06-17 18:08:50,286 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.934e+02 4.629e+02 7.030e+02 1.078e+03 2.784e+03, threshold=1.406e+03, percent-clipped=16.0 2023-06-17 18:09:01,202 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.50 vs. limit=5.349 2023-06-17 18:09:03,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=15660.0, ans=0.1434 2023-06-17 18:09:44,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=15780.0, ans=0.14220000000000002 2023-06-17 18:09:56,441 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:10:21,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=15840.0, ans=0.125 2023-06-17 18:10:24,613 INFO [train.py:996] (2/4) Epoch 1, batch 2650, loss[loss=0.4346, simple_loss=0.4346, pruned_loss=0.2173, over 21524.00 frames. ], tot_loss[loss=0.4743, simple_loss=0.4702, pruned_loss=0.2392, over 4258678.70 frames. ], batch size: 194, lr: 4.37e-02, grad_scale: 8.0 2023-06-17 18:10:33,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=15900.0, ans=0.02 2023-06-17 18:11:24,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=16020.0, ans=0.0 2023-06-17 18:11:55,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=16140.0, ans=0.0 2023-06-17 18:11:58,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=16140.0, ans=0.33510000000000006 2023-06-17 18:12:01,496 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:12:07,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=16140.0, ans=0.0 2023-06-17 18:12:10,382 INFO [train.py:996] (2/4) Epoch 1, batch 2700, loss[loss=0.4009, simple_loss=0.4078, pruned_loss=0.197, over 21834.00 frames. ], tot_loss[loss=0.4639, simple_loss=0.4618, pruned_loss=0.2331, over 4257628.06 frames. ], batch size: 118, lr: 4.36e-02, grad_scale: 8.0 2023-06-17 18:12:14,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=16200.0, ans=0.0 2023-06-17 18:12:20,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=16200.0, ans=0.125 2023-06-17 18:12:21,529 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.712e+02 4.286e+02 6.579e+02 1.091e+03 3.152e+03, threshold=1.316e+03, percent-clipped=14.0 2023-06-17 18:12:22,784 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.60 vs. limit=13.1 2023-06-17 18:13:54,407 INFO [train.py:996] (2/4) Epoch 1, batch 2750, loss[loss=0.5282, simple_loss=0.4895, pruned_loss=0.2835, over 21737.00 frames. ], tot_loss[loss=0.4636, simple_loss=0.462, pruned_loss=0.2326, over 4259007.85 frames. ], batch size: 508, lr: 4.36e-02, grad_scale: 4.0 2023-06-17 18:14:10,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=16560.0, ans=0.125 2023-06-17 18:14:38,926 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.66 vs. limit=9.155000000000001 2023-06-17 18:15:35,690 INFO [train.py:996] (2/4) Epoch 1, batch 2800, loss[loss=0.4448, simple_loss=0.4433, pruned_loss=0.2232, over 21130.00 frames. ], tot_loss[loss=0.4715, simple_loss=0.4695, pruned_loss=0.2367, over 4260751.64 frames. ], batch size: 143, lr: 4.36e-02, grad_scale: 8.0 2023-06-17 18:15:59,819 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.283e+02 4.894e+02 6.832e+02 1.003e+03 4.773e+03, threshold=1.366e+03, percent-clipped=15.0 2023-06-17 18:16:10,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=16860.0, ans=0.0 2023-06-17 18:16:46,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=16980.0, ans=0.0 2023-06-17 18:17:16,522 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.05 vs. limit=20.28 2023-06-17 18:17:20,327 INFO [train.py:996] (2/4) Epoch 1, batch 2850, loss[loss=0.5172, simple_loss=0.4989, pruned_loss=0.2677, over 21436.00 frames. ], tot_loss[loss=0.4676, simple_loss=0.4658, pruned_loss=0.2347, over 4259394.85 frames. ], batch size: 507, lr: 4.35e-02, grad_scale: 8.0 2023-06-17 18:17:58,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=17160.0, ans=0.07 2023-06-17 18:18:35,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=17280.0, ans=0.1272 2023-06-17 18:18:40,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=17280.0, ans=0.125 2023-06-17 18:19:02,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=17400.0, ans=0.007086956521739131 2023-06-17 18:19:03,968 INFO [train.py:996] (2/4) Epoch 1, batch 2900, loss[loss=0.5635, simple_loss=0.5593, pruned_loss=0.2839, over 20829.00 frames. ], tot_loss[loss=0.4626, simple_loss=0.4615, pruned_loss=0.2319, over 4259397.35 frames. ], batch size: 607, lr: 4.35e-02, grad_scale: 8.0 2023-06-17 18:19:28,661 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 4.517e+02 6.306e+02 8.812e+02 1.788e+03, threshold=1.261e+03, percent-clipped=6.0 2023-06-17 18:19:30,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=17400.0, ans=14.025 2023-06-17 18:19:46,056 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:19:47,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=17520.0, ans=0.007060869565217391 2023-06-17 18:20:54,504 INFO [train.py:996] (2/4) Epoch 1, batch 2950, loss[loss=0.4303, simple_loss=0.4672, pruned_loss=0.1967, over 21799.00 frames. ], tot_loss[loss=0.462, simple_loss=0.4625, pruned_loss=0.2308, over 4272902.60 frames. ], batch size: 298, lr: 4.34e-02, grad_scale: 8.0 2023-06-17 18:22:06,036 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:22:17,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=17940.0, ans=0.125 2023-06-17 18:22:27,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=17940.0, ans=0.125 2023-06-17 18:22:44,545 INFO [train.py:996] (2/4) Epoch 1, batch 3000, loss[loss=0.538, simple_loss=0.5258, pruned_loss=0.2751, over 21537.00 frames. ], tot_loss[loss=0.4656, simple_loss=0.4672, pruned_loss=0.232, over 4277182.50 frames. ], batch size: 441, lr: 4.34e-02, grad_scale: 8.0 2023-06-17 18:22:44,545 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-17 18:23:01,490 INFO [train.py:1028] (2/4) Epoch 1, validation: loss=0.3658, simple_loss=0.4363, pruned_loss=0.1476, over 1796401.00 frames. 2023-06-17 18:23:01,491 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-17 18:23:19,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=18000.0, ans=0.0 2023-06-17 18:23:20,427 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.233e+02 5.025e+02 6.573e+02 9.808e+02 2.550e+03, threshold=1.315e+03, percent-clipped=11.0 2023-06-17 18:23:40,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=18060.0, ans=0.1194 2023-06-17 18:23:44,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=18120.0, ans=0.125 2023-06-17 18:24:04,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=18180.0, ans=0.1182 2023-06-17 18:24:24,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=18240.0, ans=0.006904347826086957 2023-06-17 18:24:45,857 INFO [train.py:996] (2/4) Epoch 1, batch 3050, loss[loss=0.37, simple_loss=0.3975, pruned_loss=0.1712, over 21625.00 frames. ], tot_loss[loss=0.4594, simple_loss=0.4643, pruned_loss=0.2273, over 4280703.36 frames. ], batch size: 230, lr: 4.33e-02, grad_scale: 8.0 2023-06-17 18:25:05,689 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.75 vs. limit=9.575 2023-06-17 18:25:17,161 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=19.50 vs. limit=21.27 2023-06-17 18:25:24,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=18360.0, ans=0.0 2023-06-17 18:25:43,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=18420.0, ans=0.125 2023-06-17 18:26:35,510 INFO [train.py:996] (2/4) Epoch 1, batch 3100, loss[loss=0.4078, simple_loss=0.4189, pruned_loss=0.1984, over 21455.00 frames. ], tot_loss[loss=0.4568, simple_loss=0.463, pruned_loss=0.2253, over 4279730.11 frames. ], batch size: 211, lr: 4.33e-02, grad_scale: 8.0 2023-06-17 18:26:53,817 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.845e+02 4.821e+02 6.301e+02 1.043e+03 2.318e+03, threshold=1.260e+03, percent-clipped=14.0 2023-06-17 18:27:21,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=18720.0, ans=0.125 2023-06-17 18:27:26,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=18720.0, ans=0.025 2023-06-17 18:27:28,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=18780.0, ans=0.0 2023-06-17 18:28:12,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=18840.0, ans=0.0 2023-06-17 18:28:20,395 INFO [train.py:996] (2/4) Epoch 1, batch 3150, loss[loss=0.594, simple_loss=0.6012, pruned_loss=0.2934, over 21260.00 frames. ], tot_loss[loss=0.4592, simple_loss=0.4649, pruned_loss=0.2267, over 4282480.29 frames. ], batch size: 548, lr: 4.32e-02, grad_scale: 8.0 2023-06-17 18:28:43,789 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=14.61 2023-06-17 18:28:58,572 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.02 vs. limit=14.6325 2023-06-17 18:29:36,854 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=14.655000000000001 2023-06-17 18:29:55,628 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.90 vs. limit=14.6775 2023-06-17 18:30:01,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=19140.0, ans=0.0 2023-06-17 18:30:11,525 INFO [train.py:996] (2/4) Epoch 1, batch 3200, loss[loss=0.4613, simple_loss=0.4425, pruned_loss=0.2401, over 20075.00 frames. ], tot_loss[loss=0.4542, simple_loss=0.4617, pruned_loss=0.2234, over 4277774.42 frames. ], batch size: 707, lr: 4.32e-02, grad_scale: 16.0 2023-06-17 18:30:11,928 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:30:15,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=19200.0, ans=0.10800000000000001 2023-06-17 18:30:24,584 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.487e+02 4.999e+02 6.065e+02 1.040e+03 2.031e+03, threshold=1.213e+03, percent-clipped=14.0 2023-06-17 18:31:49,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=19440.0, ans=0.21960000000000002 2023-06-17 18:31:55,106 INFO [train.py:996] (2/4) Epoch 1, batch 3250, loss[loss=0.3886, simple_loss=0.3896, pruned_loss=0.1938, over 21363.00 frames. ], tot_loss[loss=0.4556, simple_loss=0.4613, pruned_loss=0.225, over 4277487.87 frames. ], batch size: 194, lr: 4.31e-02, grad_scale: 8.0 2023-06-17 18:32:47,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=19620.0, ans=0.125 2023-06-17 18:33:29,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=19740.0, ans=0.125 2023-06-17 18:33:40,012 INFO [train.py:996] (2/4) Epoch 1, batch 3300, loss[loss=0.5027, simple_loss=0.4968, pruned_loss=0.2543, over 21771.00 frames. ], tot_loss[loss=0.4518, simple_loss=0.4574, pruned_loss=0.2231, over 4272532.15 frames. ], batch size: 441, lr: 4.31e-02, grad_scale: 8.0 2023-06-17 18:34:06,167 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.398e+02 4.541e+02 6.764e+02 1.015e+03 2.529e+03, threshold=1.353e+03, percent-clipped=14.0 2023-06-17 18:34:06,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=19860.0, ans=0.125 2023-06-17 18:34:34,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=19920.0, ans=0.125 2023-06-17 18:34:36,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=19920.0, ans=0.125 2023-06-17 18:34:43,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=19920.0, ans=0.125 2023-06-17 18:35:24,218 INFO [train.py:996] (2/4) Epoch 1, batch 3350, loss[loss=0.4095, simple_loss=0.419, pruned_loss=0.2, over 21052.00 frames. ], tot_loss[loss=0.4513, simple_loss=0.4595, pruned_loss=0.2216, over 4276697.61 frames. ], batch size: 607, lr: 4.30e-02, grad_scale: 8.0 2023-06-17 18:36:26,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=20220.0, ans=0.0064739130434782605 2023-06-17 18:36:47,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=20280.0, ans=0.0 2023-06-17 18:37:12,830 INFO [train.py:996] (2/4) Epoch 1, batch 3400, loss[loss=0.4331, simple_loss=0.4419, pruned_loss=0.2122, over 21774.00 frames. ], tot_loss[loss=0.4514, simple_loss=0.4587, pruned_loss=0.2221, over 4278276.90 frames. ], batch size: 351, lr: 4.29e-02, grad_scale: 8.0 2023-06-17 18:37:30,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=20400.0, ans=0.125 2023-06-17 18:37:34,560 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.423e+02 4.338e+02 6.007e+02 8.675e+02 3.027e+03, threshold=1.201e+03, percent-clipped=6.0 2023-06-17 18:37:55,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=20520.0, ans=0.006408695652173913 2023-06-17 18:38:25,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=20580.0, ans=0.125 2023-06-17 18:38:32,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=20580.0, ans=0.125 2023-06-17 18:39:03,065 INFO [train.py:996] (2/4) Epoch 1, batch 3450, loss[loss=0.4447, simple_loss=0.4361, pruned_loss=0.2267, over 21423.00 frames. ], tot_loss[loss=0.4472, simple_loss=0.4538, pruned_loss=0.2203, over 4280145.15 frames. ], batch size: 389, lr: 4.29e-02, grad_scale: 8.0 2023-06-17 18:39:10,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=20700.0, ans=0.006369565217391304 2023-06-17 18:39:39,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=20760.0, ans=0.025 2023-06-17 18:39:44,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=20820.0, ans=0.125 2023-06-17 18:40:04,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=20880.0, ans=0.2 2023-06-17 18:40:47,111 INFO [train.py:996] (2/4) Epoch 1, batch 3500, loss[loss=0.4881, simple_loss=0.4865, pruned_loss=0.2448, over 21481.00 frames. ], tot_loss[loss=0.457, simple_loss=0.4636, pruned_loss=0.2251, over 4275020.74 frames. ], batch size: 211, lr: 4.28e-02, grad_scale: 8.0 2023-06-17 18:41:09,671 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 4.958e+02 6.770e+02 9.160e+02 2.307e+03, threshold=1.354e+03, percent-clipped=16.0 2023-06-17 18:41:14,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=21060.0, ans=0.125 2023-06-17 18:42:19,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=21240.0, ans=0.2 2023-06-17 18:42:21,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=21240.0, ans=0.0 2023-06-17 18:42:32,173 INFO [train.py:996] (2/4) Epoch 1, batch 3550, loss[loss=0.4342, simple_loss=0.431, pruned_loss=0.2187, over 21543.00 frames. ], tot_loss[loss=0.458, simple_loss=0.465, pruned_loss=0.2255, over 4281565.26 frames. ], batch size: 441, lr: 4.28e-02, grad_scale: 4.0 2023-06-17 18:43:49,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=21480.0, ans=0.125 2023-06-17 18:44:21,641 INFO [train.py:996] (2/4) Epoch 1, batch 3600, loss[loss=0.4923, simple_loss=0.4703, pruned_loss=0.2571, over 21599.00 frames. ], tot_loss[loss=0.453, simple_loss=0.4586, pruned_loss=0.2237, over 4276500.75 frames. ], batch size: 415, lr: 4.27e-02, grad_scale: 8.0 2023-06-17 18:44:39,442 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.661e+02 4.436e+02 5.716e+02 8.040e+02 1.927e+03, threshold=1.143e+03, percent-clipped=4.0 2023-06-17 18:44:39,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=21660.0, ans=0.2 2023-06-17 18:44:40,554 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=12.0 2023-06-17 18:44:41,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=21660.0, ans=0.07 2023-06-17 18:46:04,489 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-06-17 18:46:05,198 INFO [train.py:996] (2/4) Epoch 1, batch 3650, loss[loss=0.4432, simple_loss=0.46, pruned_loss=0.2132, over 21835.00 frames. ], tot_loss[loss=0.4536, simple_loss=0.4604, pruned_loss=0.2234, over 4264266.94 frames. ], batch size: 371, lr: 4.27e-02, grad_scale: 8.0 2023-06-17 18:47:48,621 INFO [train.py:996] (2/4) Epoch 1, batch 3700, loss[loss=0.5175, simple_loss=0.5049, pruned_loss=0.2651, over 21861.00 frames. ], tot_loss[loss=0.4492, simple_loss=0.4573, pruned_loss=0.2206, over 4272965.51 frames. ], batch size: 414, lr: 4.26e-02, grad_scale: 8.0 2023-06-17 18:47:50,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=22200.0, ans=0.125 2023-06-17 18:48:06,735 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.875e+02 4.996e+02 7.328e+02 1.013e+03 2.628e+03, threshold=1.466e+03, percent-clipped=16.0 2023-06-17 18:48:39,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=22320.0, ans=0.125 2023-06-17 18:49:09,659 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=15.0 2023-06-17 18:49:12,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=22440.0, ans=0.2 2023-06-17 18:49:32,164 INFO [train.py:996] (2/4) Epoch 1, batch 3750, loss[loss=0.4625, simple_loss=0.4637, pruned_loss=0.2307, over 21738.00 frames. ], tot_loss[loss=0.4431, simple_loss=0.452, pruned_loss=0.2171, over 4280855.26 frames. ], batch size: 441, lr: 4.26e-02, grad_scale: 8.0 2023-06-17 18:49:33,221 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=18.97 vs. limit=22.5 2023-06-17 18:50:28,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=22620.0, ans=0.125 2023-06-17 18:50:40,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=22680.0, ans=0.1 2023-06-17 18:50:40,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=22680.0, ans=0.005939130434782608 2023-06-17 18:50:46,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=22680.0, ans=0.0 2023-06-17 18:51:16,307 INFO [train.py:996] (2/4) Epoch 1, batch 3800, loss[loss=0.4843, simple_loss=0.4924, pruned_loss=0.2381, over 21836.00 frames. ], tot_loss[loss=0.4376, simple_loss=0.4487, pruned_loss=0.2132, over 4277618.38 frames. ], batch size: 118, lr: 4.25e-02, grad_scale: 8.0 2023-06-17 18:51:39,171 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.895e+02 5.418e+02 7.571e+02 2.562e+03, threshold=1.084e+03, percent-clipped=5.0 2023-06-17 18:51:41,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=22860.0, ans=0.125 2023-06-17 18:52:37,857 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-06-17 18:52:59,002 INFO [train.py:996] (2/4) Epoch 1, batch 3850, loss[loss=0.4133, simple_loss=0.4325, pruned_loss=0.197, over 20997.00 frames. ], tot_loss[loss=0.4385, simple_loss=0.4478, pruned_loss=0.2146, over 4259620.97 frames. ], batch size: 608, lr: 4.24e-02, grad_scale: 8.0 2023-06-17 18:53:12,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=23100.0, ans=0.1 2023-06-17 18:53:50,149 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:54:23,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=23340.0, ans=0.005795652173913044 2023-06-17 18:54:28,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=23340.0, ans=0.2 2023-06-17 18:54:36,625 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-06-17 18:54:40,132 INFO [train.py:996] (2/4) Epoch 1, batch 3900, loss[loss=0.4286, simple_loss=0.4328, pruned_loss=0.2122, over 21590.00 frames. ], tot_loss[loss=0.4352, simple_loss=0.4441, pruned_loss=0.2131, over 4268244.77 frames. ], batch size: 263, lr: 4.24e-02, grad_scale: 8.0 2023-06-17 18:54:54,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=23400.0, ans=0.125 2023-06-17 18:54:59,273 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.423e+02 4.722e+02 6.490e+02 9.055e+02 2.329e+03, threshold=1.298e+03, percent-clipped=15.0 2023-06-17 18:55:24,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=23520.0, ans=0.125 2023-06-17 18:55:43,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=23580.0, ans=0.125 2023-06-17 18:56:02,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=23580.0, ans=0.125 2023-06-17 18:56:03,291 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.72 vs. limit=10.0 2023-06-17 18:56:25,085 INFO [train.py:996] (2/4) Epoch 1, batch 3950, loss[loss=0.3648, simple_loss=0.385, pruned_loss=0.1723, over 21815.00 frames. ], tot_loss[loss=0.4307, simple_loss=0.4421, pruned_loss=0.2097, over 4271602.06 frames. ], batch size: 124, lr: 4.23e-02, grad_scale: 8.0 2023-06-17 18:56:46,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=23760.0, ans=0.005704347826086957 2023-06-17 18:56:53,671 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 18:56:53,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=23760.0, ans=0.125 2023-06-17 18:56:55,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=23760.0, ans=0.125 2023-06-17 18:57:14,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=23820.0, ans=0.125 2023-06-17 18:58:09,778 INFO [train.py:996] (2/4) Epoch 1, batch 4000, loss[loss=0.3432, simple_loss=0.3617, pruned_loss=0.1623, over 21839.00 frames. ], tot_loss[loss=0.4229, simple_loss=0.4353, pruned_loss=0.2052, over 4266067.71 frames. ], batch size: 98, lr: 4.23e-02, grad_scale: 16.0 2023-06-17 18:58:33,383 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.808e+02 4.109e+02 5.052e+02 7.332e+02 1.857e+03, threshold=1.010e+03, percent-clipped=6.0 2023-06-17 18:58:38,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=24060.0, ans=0.5 2023-06-17 18:58:45,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=24060.0, ans=0.05 2023-06-17 18:58:59,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=24120.0, ans=0.125 2023-06-17 18:59:20,406 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.08 vs. limit=15.0 2023-06-17 18:59:32,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=24180.0, ans=0.0 2023-06-17 18:59:52,466 INFO [train.py:996] (2/4) Epoch 1, batch 4050, loss[loss=0.398, simple_loss=0.4091, pruned_loss=0.1935, over 21378.00 frames. ], tot_loss[loss=0.4195, simple_loss=0.4341, pruned_loss=0.2025, over 4258305.71 frames. ], batch size: 131, lr: 4.22e-02, grad_scale: 8.0 2023-06-17 19:01:06,613 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 19:01:10,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=24480.0, ans=0.0 2023-06-17 19:01:10,654 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=15.0 2023-06-17 19:01:21,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=24540.0, ans=0.2 2023-06-17 19:01:35,958 INFO [train.py:996] (2/4) Epoch 1, batch 4100, loss[loss=0.4247, simple_loss=0.4506, pruned_loss=0.1994, over 21686.00 frames. ], tot_loss[loss=0.4206, simple_loss=0.4351, pruned_loss=0.203, over 4266525.44 frames. ], batch size: 389, lr: 4.22e-02, grad_scale: 8.0 2023-06-17 19:01:41,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=24600.0, ans=0.125 2023-06-17 19:02:01,003 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.803e+02 4.142e+02 6.350e+02 1.020e+03 2.376e+03, threshold=1.270e+03, percent-clipped=25.0 2023-06-17 19:03:05,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=24840.0, ans=0.125 2023-06-17 19:03:10,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=24840.0, ans=0.1 2023-06-17 19:03:19,149 INFO [train.py:996] (2/4) Epoch 1, batch 4150, loss[loss=0.2943, simple_loss=0.3614, pruned_loss=0.1136, over 21209.00 frames. ], tot_loss[loss=0.4105, simple_loss=0.4317, pruned_loss=0.1947, over 4270827.19 frames. ], batch size: 176, lr: 4.21e-02, grad_scale: 8.0 2023-06-17 19:03:20,325 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=3.68 vs. limit=15.0 2023-06-17 19:03:42,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=24960.0, ans=0.05 2023-06-17 19:03:59,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=24960.0, ans=0.125 2023-06-17 19:04:05,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=24960.0, ans=0.0 2023-06-17 19:04:07,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=25020.0, ans=0.125 2023-06-17 19:04:32,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=25080.0, ans=0.05 2023-06-17 19:04:42,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=25080.0, ans=0.125 2023-06-17 19:04:52,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=25140.0, ans=0.1 2023-06-17 19:05:09,995 INFO [train.py:996] (2/4) Epoch 1, batch 4200, loss[loss=0.4934, simple_loss=0.529, pruned_loss=0.2289, over 21597.00 frames. ], tot_loss[loss=0.4099, simple_loss=0.4312, pruned_loss=0.1943, over 4265608.67 frames. ], batch size: 389, lr: 4.20e-02, grad_scale: 8.0 2023-06-17 19:05:17,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=25200.0, ans=0.0 2023-06-17 19:05:36,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=25260.0, ans=0.005378260869565218 2023-06-17 19:05:46,229 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 4.243e+02 5.422e+02 7.726e+02 1.559e+03, threshold=1.084e+03, percent-clipped=3.0 2023-06-17 19:06:37,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=25440.0, ans=0.125 2023-06-17 19:07:07,396 INFO [train.py:996] (2/4) Epoch 1, batch 4250, loss[loss=0.6338, simple_loss=0.59, pruned_loss=0.3388, over 21331.00 frames. ], tot_loss[loss=0.4178, simple_loss=0.4388, pruned_loss=0.1984, over 4261999.07 frames. ], batch size: 507, lr: 4.20e-02, grad_scale: 8.0 2023-06-17 19:07:07,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=25500.0, ans=0.1 2023-06-17 19:07:12,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=25500.0, ans=0.125 2023-06-17 19:07:26,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=25500.0, ans=0.005326086956521739 2023-06-17 19:07:31,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=25560.0, ans=0.0 2023-06-17 19:07:54,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=25620.0, ans=0.125 2023-06-17 19:07:58,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=25620.0, ans=0.125 2023-06-17 19:08:05,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=25680.0, ans=0.125 2023-06-17 19:08:58,887 INFO [train.py:996] (2/4) Epoch 1, batch 4300, loss[loss=0.4336, simple_loss=0.4364, pruned_loss=0.2154, over 21544.00 frames. ], tot_loss[loss=0.4269, simple_loss=0.4474, pruned_loss=0.2032, over 4266832.90 frames. ], batch size: 548, lr: 4.19e-02, grad_scale: 8.0 2023-06-17 19:09:08,240 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-17 19:09:18,526 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.691e+02 4.313e+02 6.396e+02 8.892e+02 2.391e+03, threshold=1.279e+03, percent-clipped=16.0 2023-06-17 19:09:22,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=25860.0, ans=0.0 2023-06-17 19:09:59,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=25980.0, ans=0.125 2023-06-17 19:10:22,144 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=12.0 2023-06-17 19:10:42,371 INFO [train.py:996] (2/4) Epoch 1, batch 4350, loss[loss=0.3797, simple_loss=0.3905, pruned_loss=0.1844, over 21628.00 frames. ], tot_loss[loss=0.4226, simple_loss=0.4434, pruned_loss=0.2008, over 4272130.52 frames. ], batch size: 282, lr: 4.19e-02, grad_scale: 8.0 2023-06-17 19:11:06,417 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-17 19:11:19,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=26220.0, ans=0.125 2023-06-17 19:11:59,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=26280.0, ans=0.125 2023-06-17 19:12:14,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=26340.0, ans=0.0 2023-06-17 19:12:26,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=26400.0, ans=0.125 2023-06-17 19:12:27,501 INFO [train.py:996] (2/4) Epoch 1, batch 4400, loss[loss=0.3663, simple_loss=0.3984, pruned_loss=0.1671, over 21695.00 frames. ], tot_loss[loss=0.4204, simple_loss=0.4402, pruned_loss=0.2003, over 4265558.76 frames. ], batch size: 247, lr: 4.18e-02, grad_scale: 16.0 2023-06-17 19:12:35,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=26400.0, ans=0.125 2023-06-17 19:12:48,108 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.456e+02 3.803e+02 5.319e+02 7.173e+02 2.856e+03, threshold=1.064e+03, percent-clipped=8.0 2023-06-17 19:13:09,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=26520.0, ans=0.2 2023-06-17 19:13:22,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=26520.0, ans=0.0 2023-06-17 19:13:46,952 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.47 vs. limit=6.0 2023-06-17 19:14:13,430 INFO [train.py:996] (2/4) Epoch 1, batch 4450, loss[loss=0.5426, simple_loss=0.5544, pruned_loss=0.2654, over 21530.00 frames. ], tot_loss[loss=0.4261, simple_loss=0.4483, pruned_loss=0.202, over 4266312.50 frames. ], batch size: 471, lr: 4.17e-02, grad_scale: 8.0 2023-06-17 19:15:13,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=26880.0, ans=0.125 2023-06-17 19:15:28,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=26880.0, ans=0.0 2023-06-17 19:15:51,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=27000.0, ans=0.125 2023-06-17 19:15:52,015 INFO [train.py:996] (2/4) Epoch 1, batch 4500, loss[loss=0.3864, simple_loss=0.4229, pruned_loss=0.1749, over 21211.00 frames. ], tot_loss[loss=0.4313, simple_loss=0.451, pruned_loss=0.2058, over 4275388.88 frames. ], batch size: 159, lr: 4.17e-02, grad_scale: 8.0 2023-06-17 19:16:07,211 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.40 vs. limit=12.0 2023-06-17 19:16:19,148 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.844e+02 4.691e+02 6.117e+02 8.779e+02 1.856e+03, threshold=1.223e+03, percent-clipped=14.0 2023-06-17 19:16:24,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=27060.0, ans=0.2 2023-06-17 19:16:46,125 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-06-17 19:17:22,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=27240.0, ans=0.07 2023-06-17 19:17:33,407 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 19:17:36,037 INFO [train.py:996] (2/4) Epoch 1, batch 4550, loss[loss=0.5595, simple_loss=0.537, pruned_loss=0.291, over 21826.00 frames. ], tot_loss[loss=0.4316, simple_loss=0.4535, pruned_loss=0.2049, over 4278467.39 frames. ], batch size: 441, lr: 4.16e-02, grad_scale: 8.0 2023-06-17 19:17:59,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=27360.0, ans=0.125 2023-06-17 19:18:14,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=27360.0, ans=0.125 2023-06-17 19:18:17,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=27360.0, ans=0.004921739130434783 2023-06-17 19:18:46,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=27480.0, ans=0.2 2023-06-17 19:18:54,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=27480.0, ans=0.2 2023-06-17 19:19:00,430 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=25.05 vs. limit=22.5 2023-06-17 19:19:19,250 INFO [train.py:996] (2/4) Epoch 1, batch 4600, loss[loss=0.3583, simple_loss=0.4007, pruned_loss=0.1579, over 21622.00 frames. ], tot_loss[loss=0.4339, simple_loss=0.4555, pruned_loss=0.2061, over 4279695.14 frames. ], batch size: 263, lr: 4.15e-02, grad_scale: 8.0 2023-06-17 19:19:34,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=27660.0, ans=0.2 2023-06-17 19:19:46,059 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.894e+02 4.493e+02 6.587e+02 9.549e+02 1.987e+03, threshold=1.317e+03, percent-clipped=15.0 2023-06-17 19:19:46,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=27660.0, ans=0.125 2023-06-17 19:20:58,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=27840.0, ans=0.125 2023-06-17 19:21:02,844 INFO [train.py:996] (2/4) Epoch 1, batch 4650, loss[loss=0.3776, simple_loss=0.3817, pruned_loss=0.1868, over 20313.00 frames. ], tot_loss[loss=0.4213, simple_loss=0.4435, pruned_loss=0.1995, over 4276061.82 frames. ], batch size: 703, lr: 4.15e-02, grad_scale: 8.0 2023-06-17 19:22:03,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=28080.0, ans=0.004765217391304348 2023-06-17 19:22:26,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=28140.0, ans=0.2 2023-06-17 19:22:40,793 INFO [train.py:996] (2/4) Epoch 1, batch 4700, loss[loss=0.3969, simple_loss=0.3983, pruned_loss=0.1977, over 21465.00 frames. ], tot_loss[loss=0.4119, simple_loss=0.4333, pruned_loss=0.1953, over 4270987.33 frames. ], batch size: 195, lr: 4.14e-02, grad_scale: 8.0 2023-06-17 19:23:12,947 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.969e+02 4.560e+02 5.738e+02 6.731e+02 1.328e+03, threshold=1.148e+03, percent-clipped=1.0 2023-06-17 19:23:15,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=28260.0, ans=0.0 2023-06-17 19:23:16,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=28260.0, ans=0.125 2023-06-17 19:24:09,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=28440.0, ans=0.125 2023-06-17 19:24:15,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=28440.0, ans=0.0 2023-06-17 19:24:22,967 INFO [train.py:996] (2/4) Epoch 1, batch 4750, loss[loss=0.4748, simple_loss=0.4595, pruned_loss=0.245, over 21865.00 frames. ], tot_loss[loss=0.4092, simple_loss=0.4276, pruned_loss=0.1955, over 4264906.68 frames. ], batch size: 371, lr: 4.14e-02, grad_scale: 8.0 2023-06-17 19:24:42,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28500.0, ans=0.1 2023-06-17 19:24:42,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=28500.0, ans=0.125 2023-06-17 19:24:45,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=28560.0, ans=0.2 2023-06-17 19:25:13,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=28620.0, ans=0.5 2023-06-17 19:25:18,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=28620.0, ans=0.004647826086956522 2023-06-17 19:25:25,455 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.16 vs. limit=10.0 2023-06-17 19:25:31,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=28680.0, ans=0.125 2023-06-17 19:25:48,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=28740.0, ans=0.0046217391304347825 2023-06-17 19:25:56,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=28740.0, ans=0.1 2023-06-17 19:26:08,429 INFO [train.py:996] (2/4) Epoch 1, batch 4800, loss[loss=0.3683, simple_loss=0.3999, pruned_loss=0.1684, over 21635.00 frames. ], tot_loss[loss=0.4117, simple_loss=0.4299, pruned_loss=0.1968, over 4265324.30 frames. ], batch size: 247, lr: 4.13e-02, grad_scale: 16.0 2023-06-17 19:26:08,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=28800.0, ans=0.125 2023-06-17 19:26:29,225 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 19:26:40,349 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.777e+02 4.396e+02 5.630e+02 9.544e+02 1.768e+03, threshold=1.126e+03, percent-clipped=14.0 2023-06-17 19:27:06,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=28980.0, ans=0.125 2023-06-17 19:27:09,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=28980.0, ans=0.125 2023-06-17 19:27:14,744 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.43 vs. limit=15.0 2023-06-17 19:27:44,668 INFO [train.py:996] (2/4) Epoch 1, batch 4850, loss[loss=0.3835, simple_loss=0.4062, pruned_loss=0.1803, over 21434.00 frames. ], tot_loss[loss=0.4113, simple_loss=0.4284, pruned_loss=0.1971, over 4266950.23 frames. ], batch size: 211, lr: 4.12e-02, grad_scale: 16.0 2023-06-17 19:28:11,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=29160.0, ans=0.04949747468305833 2023-06-17 19:28:33,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=29220.0, ans=0.125 2023-06-17 19:28:48,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=29220.0, ans=0.1 2023-06-17 19:29:01,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=29280.0, ans=0.125 2023-06-17 19:29:10,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=29280.0, ans=0.004504347826086956 2023-06-17 19:29:14,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=29340.0, ans=0.125 2023-06-17 19:29:29,300 INFO [train.py:996] (2/4) Epoch 1, batch 4900, loss[loss=0.363, simple_loss=0.4137, pruned_loss=0.1561, over 21332.00 frames. ], tot_loss[loss=0.4158, simple_loss=0.4328, pruned_loss=0.1994, over 4272428.24 frames. ], batch size: 176, lr: 4.12e-02, grad_scale: 16.0 2023-06-17 19:29:58,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=29460.0, ans=0.1 2023-06-17 19:30:02,448 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.963e+02 4.351e+02 5.424e+02 7.801e+02 1.566e+03, threshold=1.085e+03, percent-clipped=9.0 2023-06-17 19:30:19,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=29520.0, ans=0.125 2023-06-17 19:30:24,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=29520.0, ans=0.004452173913043478 2023-06-17 19:30:38,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=29580.0, ans=0.2 2023-06-17 19:31:12,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=29640.0, ans=0.1 2023-06-17 19:31:24,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=29700.0, ans=0.05 2023-06-17 19:31:25,693 INFO [train.py:996] (2/4) Epoch 1, batch 4950, loss[loss=0.3254, simple_loss=0.3862, pruned_loss=0.1323, over 21284.00 frames. ], tot_loss[loss=0.414, simple_loss=0.4357, pruned_loss=0.1961, over 4274641.80 frames. ], batch size: 176, lr: 4.11e-02, grad_scale: 16.0 2023-06-17 19:32:12,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=29820.0, ans=0.125 2023-06-17 19:33:07,744 INFO [train.py:996] (2/4) Epoch 1, batch 5000, loss[loss=0.3102, simple_loss=0.3758, pruned_loss=0.1223, over 21476.00 frames. ], tot_loss[loss=0.408, simple_loss=0.4336, pruned_loss=0.1912, over 4273389.33 frames. ], batch size: 194, lr: 4.10e-02, grad_scale: 16.0 2023-06-17 19:33:17,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=30000.0, ans=0.125 2023-06-17 19:33:34,080 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.310e+02 4.453e+02 5.189e+02 7.873e+02 1.529e+03, threshold=1.038e+03, percent-clipped=6.0 2023-06-17 19:33:56,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=30120.0, ans=0.1 2023-06-17 19:34:05,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=30180.0, ans=0.0 2023-06-17 19:34:13,912 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.63 vs. limit=10.0 2023-06-17 19:34:15,510 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2023-06-17 19:34:31,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=30240.0, ans=0.0 2023-06-17 19:34:45,169 INFO [train.py:996] (2/4) Epoch 1, batch 5050, loss[loss=0.4381, simple_loss=0.4441, pruned_loss=0.2161, over 21346.00 frames. ], tot_loss[loss=0.4102, simple_loss=0.4342, pruned_loss=0.1931, over 4269650.33 frames. ], batch size: 159, lr: 4.10e-02, grad_scale: 16.0 2023-06-17 19:34:46,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=30300.0, ans=0.125 2023-06-17 19:34:49,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=30300.0, ans=0.2 2023-06-17 19:35:45,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=30480.0, ans=0.125 2023-06-17 19:36:08,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=30540.0, ans=0.1 2023-06-17 19:36:27,891 INFO [train.py:996] (2/4) Epoch 1, batch 5100, loss[loss=0.4627, simple_loss=0.5491, pruned_loss=0.1881, over 19669.00 frames. ], tot_loss[loss=0.4133, simple_loss=0.437, pruned_loss=0.1948, over 4272306.31 frames. ], batch size: 702, lr: 4.09e-02, grad_scale: 16.0 2023-06-17 19:36:49,500 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-17 19:36:59,588 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.656e+02 4.521e+02 5.607e+02 7.657e+02 1.284e+03, threshold=1.121e+03, percent-clipped=8.0 2023-06-17 19:38:11,328 INFO [train.py:996] (2/4) Epoch 1, batch 5150, loss[loss=0.4978, simple_loss=0.5623, pruned_loss=0.2167, over 19788.00 frames. ], tot_loss[loss=0.4134, simple_loss=0.4365, pruned_loss=0.1952, over 4278742.27 frames. ], batch size: 702, lr: 4.09e-02, grad_scale: 16.0 2023-06-17 19:38:32,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=30960.0, ans=0.125 2023-06-17 19:39:25,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=31080.0, ans=0.125 2023-06-17 19:40:01,204 INFO [train.py:996] (2/4) Epoch 1, batch 5200, loss[loss=0.3484, simple_loss=0.3803, pruned_loss=0.1582, over 21287.00 frames. ], tot_loss[loss=0.4102, simple_loss=0.4332, pruned_loss=0.1936, over 4267712.91 frames. ], batch size: 176, lr: 4.08e-02, grad_scale: 32.0 2023-06-17 19:40:08,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=31200.0, ans=0.1 2023-06-17 19:40:19,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=31200.0, ans=0.2 2023-06-17 19:40:26,821 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-17 19:40:27,287 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.519e+02 4.450e+02 5.949e+02 9.427e+02 1.654e+03, threshold=1.190e+03, percent-clipped=14.0 2023-06-17 19:40:55,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=31320.0, ans=0.125 2023-06-17 19:41:02,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=31380.0, ans=0.125 2023-06-17 19:41:03,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=31380.0, ans=0.2 2023-06-17 19:41:44,216 INFO [train.py:996] (2/4) Epoch 1, batch 5250, loss[loss=0.4053, simple_loss=0.4642, pruned_loss=0.1732, over 21726.00 frames. ], tot_loss[loss=0.4093, simple_loss=0.4367, pruned_loss=0.191, over 4273513.94 frames. ], batch size: 351, lr: 4.07e-02, grad_scale: 16.0 2023-06-17 19:41:59,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=31500.0, ans=0.1 2023-06-17 19:42:30,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=31620.0, ans=0.1 2023-06-17 19:42:58,431 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=12.0 2023-06-17 19:43:25,585 INFO [train.py:996] (2/4) Epoch 1, batch 5300, loss[loss=0.399, simple_loss=0.4163, pruned_loss=0.1908, over 21895.00 frames. ], tot_loss[loss=0.4106, simple_loss=0.4364, pruned_loss=0.1924, over 4279908.35 frames. ], batch size: 298, lr: 4.07e-02, grad_scale: 16.0 2023-06-17 19:43:53,704 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.621e+02 4.200e+02 5.076e+02 7.002e+02 1.420e+03, threshold=1.015e+03, percent-clipped=3.0 2023-06-17 19:43:58,144 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.04 vs. limit=15.0 2023-06-17 19:44:12,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=31920.0, ans=0.125 2023-06-17 19:44:17,862 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-17 19:45:07,641 INFO [train.py:996] (2/4) Epoch 1, batch 5350, loss[loss=0.4681, simple_loss=0.5407, pruned_loss=0.1978, over 19588.00 frames. ], tot_loss[loss=0.4124, simple_loss=0.4359, pruned_loss=0.1945, over 4282332.53 frames. ], batch size: 702, lr: 4.06e-02, grad_scale: 16.0 2023-06-17 19:45:23,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=32100.0, ans=0.0 2023-06-17 19:45:52,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=32220.0, ans=0.125 2023-06-17 19:45:58,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=32220.0, ans=0.025 2023-06-17 19:46:36,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=32340.0, ans=0.125 2023-06-17 19:46:42,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=32340.0, ans=0.1 2023-06-17 19:46:53,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=32400.0, ans=0.125 2023-06-17 19:46:55,001 INFO [train.py:996] (2/4) Epoch 1, batch 5400, loss[loss=0.4322, simple_loss=0.4405, pruned_loss=0.2119, over 21049.00 frames. ], tot_loss[loss=0.4135, simple_loss=0.4349, pruned_loss=0.196, over 4279504.00 frames. ], batch size: 607, lr: 4.05e-02, grad_scale: 16.0 2023-06-17 19:47:23,226 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.349e+02 4.680e+02 5.760e+02 7.952e+02 1.690e+03, threshold=1.152e+03, percent-clipped=11.0 2023-06-17 19:47:26,674 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 19:47:33,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=32520.0, ans=0.0 2023-06-17 19:48:03,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=32580.0, ans=0.1 2023-06-17 19:48:15,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=32640.0, ans=0.125 2023-06-17 19:48:18,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=32640.0, ans=0.0037739130434782603 2023-06-17 19:48:19,434 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.99 vs. limit=15.0 2023-06-17 19:48:22,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=32640.0, ans=0.1 2023-06-17 19:48:38,281 INFO [train.py:996] (2/4) Epoch 1, batch 5450, loss[loss=0.3555, simple_loss=0.3924, pruned_loss=0.1593, over 21453.00 frames. ], tot_loss[loss=0.4107, simple_loss=0.4338, pruned_loss=0.1938, over 4282522.94 frames. ], batch size: 212, lr: 4.05e-02, grad_scale: 16.0 2023-06-17 19:48:55,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=32700.0, ans=0.0 2023-06-17 19:49:24,348 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.10 vs. limit=22.5 2023-06-17 19:49:35,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=32820.0, ans=0.125 2023-06-17 19:49:39,205 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-17 19:50:11,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=32940.0, ans=0.125 2023-06-17 19:50:26,867 INFO [train.py:996] (2/4) Epoch 1, batch 5500, loss[loss=0.4221, simple_loss=0.4946, pruned_loss=0.1748, over 21769.00 frames. ], tot_loss[loss=0.4078, simple_loss=0.4381, pruned_loss=0.1888, over 4286821.07 frames. ], batch size: 351, lr: 4.04e-02, grad_scale: 16.0 2023-06-17 19:50:35,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=33000.0, ans=0.5 2023-06-17 19:50:49,564 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.509e+02 4.085e+02 5.638e+02 7.299e+02 1.416e+03, threshold=1.128e+03, percent-clipped=6.0 2023-06-17 19:50:52,404 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.52 vs. limit=15.0 2023-06-17 19:51:25,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=33120.0, ans=0.003669565217391305 2023-06-17 19:52:13,379 INFO [train.py:996] (2/4) Epoch 1, batch 5550, loss[loss=0.251, simple_loss=0.3082, pruned_loss=0.09693, over 21735.00 frames. ], tot_loss[loss=0.3972, simple_loss=0.4323, pruned_loss=0.181, over 4290253.28 frames. ], batch size: 124, lr: 4.03e-02, grad_scale: 16.0 2023-06-17 19:53:01,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=33420.0, ans=0.125 2023-06-17 19:53:56,962 INFO [train.py:996] (2/4) Epoch 1, batch 5600, loss[loss=0.3244, simple_loss=0.3802, pruned_loss=0.1343, over 21281.00 frames. ], tot_loss[loss=0.3916, simple_loss=0.4296, pruned_loss=0.1768, over 4286837.42 frames. ], batch size: 176, lr: 4.03e-02, grad_scale: 32.0 2023-06-17 19:54:07,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=33600.0, ans=0.125 2023-06-17 19:54:07,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=33600.0, ans=0.0 2023-06-17 19:54:29,863 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.309e+02 4.088e+02 5.346e+02 7.510e+02 1.919e+03, threshold=1.069e+03, percent-clipped=8.0 2023-06-17 19:54:35,607 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=12.0 2023-06-17 19:54:35,661 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-17 19:55:03,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=33780.0, ans=0.125 2023-06-17 19:55:38,032 INFO [train.py:996] (2/4) Epoch 1, batch 5650, loss[loss=0.3987, simple_loss=0.4187, pruned_loss=0.1894, over 21895.00 frames. ], tot_loss[loss=0.3948, simple_loss=0.4307, pruned_loss=0.1795, over 4282477.91 frames. ], batch size: 316, lr: 4.02e-02, grad_scale: 32.0 2023-06-17 19:57:21,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=34200.0, ans=0.1 2023-06-17 19:57:27,472 INFO [train.py:996] (2/4) Epoch 1, batch 5700, loss[loss=0.4208, simple_loss=0.4587, pruned_loss=0.1914, over 21573.00 frames. ], tot_loss[loss=0.3969, simple_loss=0.4298, pruned_loss=0.182, over 4278834.52 frames. ], batch size: 441, lr: 4.02e-02, grad_scale: 32.0 2023-06-17 19:58:00,890 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 4.144e+02 5.223e+02 7.602e+02 1.708e+03, threshold=1.045e+03, percent-clipped=9.0 2023-06-17 19:58:10,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=34320.0, ans=0.125 2023-06-17 19:58:27,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=34320.0, ans=0.1 2023-06-17 19:59:11,784 INFO [train.py:996] (2/4) Epoch 1, batch 5750, loss[loss=0.3571, simple_loss=0.4141, pruned_loss=0.1501, over 21727.00 frames. ], tot_loss[loss=0.3914, simple_loss=0.426, pruned_loss=0.1784, over 4279494.99 frames. ], batch size: 351, lr: 4.01e-02, grad_scale: 32.0 2023-06-17 19:59:51,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=34560.0, ans=0.09899494936611666 2023-06-17 20:00:45,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=34740.0, ans=0.2 2023-06-17 20:00:46,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=34740.0, ans=0.125 2023-06-17 20:00:58,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=34800.0, ans=0.125 2023-06-17 20:00:59,964 INFO [train.py:996] (2/4) Epoch 1, batch 5800, loss[loss=0.459, simple_loss=0.4568, pruned_loss=0.2306, over 21208.00 frames. ], tot_loss[loss=0.3865, simple_loss=0.424, pruned_loss=0.1745, over 4286212.36 frames. ], batch size: 607, lr: 4.00e-02, grad_scale: 32.0 2023-06-17 20:01:28,189 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.257e+02 3.873e+02 4.586e+02 6.036e+02 1.114e+03, threshold=9.172e+02, percent-clipped=1.0 2023-06-17 20:02:10,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=34980.0, ans=0.0 2023-06-17 20:02:14,514 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.80 vs. limit=6.0 2023-06-17 20:02:14,791 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.03 vs. limit=15.0 2023-06-17 20:02:37,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=35040.0, ans=0.0 2023-06-17 20:02:43,518 INFO [train.py:996] (2/4) Epoch 1, batch 5850, loss[loss=0.2149, simple_loss=0.2894, pruned_loss=0.07018, over 21746.00 frames. ], tot_loss[loss=0.3748, simple_loss=0.4183, pruned_loss=0.1657, over 4284701.06 frames. ], batch size: 124, lr: 4.00e-02, grad_scale: 32.0 2023-06-17 20:04:20,753 INFO [train.py:996] (2/4) Epoch 1, batch 5900, loss[loss=0.4117, simple_loss=0.429, pruned_loss=0.1972, over 21749.00 frames. ], tot_loss[loss=0.3571, simple_loss=0.4051, pruned_loss=0.1546, over 4286899.13 frames. ], batch size: 441, lr: 3.99e-02, grad_scale: 32.0 2023-06-17 20:04:23,801 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.32 vs. limit=15.0 2023-06-17 20:04:35,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=35400.0, ans=0.125 2023-06-17 20:04:48,507 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.809e+02 3.363e+02 4.037e+02 5.226e+02 1.298e+03, threshold=8.074e+02, percent-clipped=7.0 2023-06-17 20:05:08,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=35520.0, ans=0.003147826086956522 2023-06-17 20:06:08,247 INFO [train.py:996] (2/4) Epoch 1, batch 5950, loss[loss=0.3878, simple_loss=0.4023, pruned_loss=0.1866, over 21753.00 frames. ], tot_loss[loss=0.3661, simple_loss=0.4073, pruned_loss=0.1625, over 4284780.50 frames. ], batch size: 112, lr: 3.98e-02, grad_scale: 32.0 2023-06-17 20:06:11,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=35700.0, ans=0.1 2023-06-17 20:06:32,940 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.83 vs. limit=12.0 2023-06-17 20:07:31,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=35940.0, ans=0.0 2023-06-17 20:07:37,793 INFO [train.py:996] (2/4) Epoch 1, batch 6000, loss[loss=0.3476, simple_loss=0.3643, pruned_loss=0.1654, over 21606.00 frames. ], tot_loss[loss=0.3729, simple_loss=0.4069, pruned_loss=0.1695, over 4285204.50 frames. ], batch size: 247, lr: 3.98e-02, grad_scale: 32.0 2023-06-17 20:07:37,794 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-17 20:07:56,509 INFO [train.py:1028] (2/4) Epoch 1, validation: loss=0.3636, simple_loss=0.4388, pruned_loss=0.1442, over 1796401.00 frames. 2023-06-17 20:07:56,510 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-17 20:08:10,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=36000.0, ans=0.125 2023-06-17 20:08:19,464 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.037e+02 4.782e+02 6.358e+02 7.928e+02 1.970e+03, threshold=1.272e+03, percent-clipped=23.0 2023-06-17 20:08:24,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=36060.0, ans=0.2 2023-06-17 20:09:24,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=36240.0, ans=0.0 2023-06-17 20:09:34,152 INFO [train.py:996] (2/4) Epoch 1, batch 6050, loss[loss=0.3359, simple_loss=0.3562, pruned_loss=0.1578, over 21564.00 frames. ], tot_loss[loss=0.3733, simple_loss=0.4026, pruned_loss=0.1721, over 4276595.15 frames. ], batch size: 213, lr: 3.97e-02, grad_scale: 32.0 2023-06-17 20:09:45,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=36300.0, ans=0.0029782608695652175 2023-06-17 20:09:46,245 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.63 vs. limit=10.0 2023-06-17 20:09:54,650 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=15.0 2023-06-17 20:11:15,946 INFO [train.py:996] (2/4) Epoch 1, batch 6100, loss[loss=0.4729, simple_loss=0.4664, pruned_loss=0.2397, over 21914.00 frames. ], tot_loss[loss=0.3725, simple_loss=0.4026, pruned_loss=0.1712, over 4270685.66 frames. ], batch size: 124, lr: 3.96e-02, grad_scale: 32.0 2023-06-17 20:11:27,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=36600.0, ans=0.0 2023-06-17 20:11:29,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=36600.0, ans=0.125 2023-06-17 20:11:38,232 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.396e+02 4.105e+02 5.881e+02 8.261e+02 1.678e+03, threshold=1.176e+03, percent-clipped=6.0 2023-06-17 20:12:06,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=36720.0, ans=0.5 2023-06-17 20:12:19,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=36780.0, ans=0.0028739130434782606 2023-06-17 20:12:57,173 INFO [train.py:996] (2/4) Epoch 1, batch 6150, loss[loss=0.3684, simple_loss=0.4061, pruned_loss=0.1653, over 21770.00 frames. ], tot_loss[loss=0.3782, simple_loss=0.4056, pruned_loss=0.1754, over 4270233.56 frames. ], batch size: 282, lr: 3.96e-02, grad_scale: 32.0 2023-06-17 20:13:05,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=36900.0, ans=0.0 2023-06-17 20:13:05,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=36900.0, ans=0.2 2023-06-17 20:13:40,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=37020.0, ans=0.0 2023-06-17 20:14:22,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=37140.0, ans=0.1 2023-06-17 20:14:39,011 INFO [train.py:996] (2/4) Epoch 1, batch 6200, loss[loss=0.5521, simple_loss=0.5679, pruned_loss=0.2681, over 21573.00 frames. ], tot_loss[loss=0.3791, simple_loss=0.4076, pruned_loss=0.1752, over 4265888.11 frames. ], batch size: 471, lr: 3.95e-02, grad_scale: 32.0 2023-06-17 20:14:50,243 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-17 20:15:07,350 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.629e+02 4.899e+02 6.626e+02 1.862e+03, threshold=9.798e+02, percent-clipped=4.0 2023-06-17 20:15:45,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=37380.0, ans=0.125 2023-06-17 20:16:22,551 INFO [train.py:996] (2/4) Epoch 1, batch 6250, loss[loss=0.3717, simple_loss=0.4369, pruned_loss=0.1533, over 21765.00 frames. ], tot_loss[loss=0.3792, simple_loss=0.4105, pruned_loss=0.1739, over 4266220.93 frames. ], batch size: 332, lr: 3.94e-02, grad_scale: 32.0 2023-06-17 20:16:23,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=37500.0, ans=0.125 2023-06-17 20:18:03,414 INFO [train.py:996] (2/4) Epoch 1, batch 6300, loss[loss=0.4012, simple_loss=0.4215, pruned_loss=0.1904, over 21867.00 frames. ], tot_loss[loss=0.3808, simple_loss=0.4159, pruned_loss=0.1728, over 4266332.48 frames. ], batch size: 124, lr: 3.94e-02, grad_scale: 32.0 2023-06-17 20:18:36,336 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=15.0 2023-06-17 20:18:41,523 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.831e+02 4.261e+02 6.027e+02 8.452e+02 1.541e+03, threshold=1.205e+03, percent-clipped=13.0 2023-06-17 20:18:45,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=37860.0, ans=0.125 2023-06-17 20:18:49,513 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.67 vs. limit=15.0 2023-06-17 20:19:13,910 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-17 20:19:41,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=38040.0, ans=0.125 2023-06-17 20:19:46,074 INFO [train.py:996] (2/4) Epoch 1, batch 6350, loss[loss=0.4001, simple_loss=0.429, pruned_loss=0.1856, over 21334.00 frames. ], tot_loss[loss=0.3943, simple_loss=0.4249, pruned_loss=0.1819, over 4271392.19 frames. ], batch size: 176, lr: 3.93e-02, grad_scale: 32.0 2023-06-17 20:20:04,368 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=12.0 2023-06-17 20:20:52,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=38220.0, ans=0.0025608695652173915 2023-06-17 20:21:46,065 INFO [train.py:996] (2/4) Epoch 1, batch 6400, loss[loss=0.4104, simple_loss=0.4327, pruned_loss=0.194, over 21358.00 frames. ], tot_loss[loss=0.4067, simple_loss=0.4344, pruned_loss=0.1895, over 4270587.82 frames. ], batch size: 159, lr: 3.92e-02, grad_scale: 32.0 2023-06-17 20:22:03,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=38460.0, ans=0.125 2023-06-17 20:22:15,064 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.706e+02 4.358e+02 5.224e+02 7.258e+02 1.926e+03, threshold=1.045e+03, percent-clipped=7.0 2023-06-17 20:22:39,662 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=22.5 2023-06-17 20:23:29,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=38700.0, ans=0.2 2023-06-17 20:23:30,214 INFO [train.py:996] (2/4) Epoch 1, batch 6450, loss[loss=0.3125, simple_loss=0.3701, pruned_loss=0.1274, over 21819.00 frames. ], tot_loss[loss=0.4007, simple_loss=0.432, pruned_loss=0.1847, over 4272514.55 frames. ], batch size: 118, lr: 3.92e-02, grad_scale: 32.0 2023-06-17 20:24:27,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=38880.0, ans=0.125 2023-06-17 20:25:09,312 INFO [train.py:996] (2/4) Epoch 1, batch 6500, loss[loss=0.3322, simple_loss=0.3653, pruned_loss=0.1495, over 21770.00 frames. ], tot_loss[loss=0.3954, simple_loss=0.4241, pruned_loss=0.1834, over 4265756.49 frames. ], batch size: 102, lr: 3.91e-02, grad_scale: 32.0 2023-06-17 20:25:37,747 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 3.797e+02 4.922e+02 6.987e+02 1.536e+03, threshold=9.843e+02, percent-clipped=9.0 2023-06-17 20:26:06,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=39180.0, ans=0.0023521739130434784 2023-06-17 20:26:33,393 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=22.5 2023-06-17 20:26:38,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=39240.0, ans=0.125 2023-06-17 20:26:48,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=39300.0, ans=0.1 2023-06-17 20:26:49,851 INFO [train.py:996] (2/4) Epoch 1, batch 6550, loss[loss=0.4329, simple_loss=0.445, pruned_loss=0.2104, over 21807.00 frames. ], tot_loss[loss=0.3947, simple_loss=0.4246, pruned_loss=0.1824, over 4271587.74 frames. ], batch size: 332, lr: 3.91e-02, grad_scale: 32.0 2023-06-17 20:27:25,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=39360.0, ans=0.07 2023-06-17 20:27:38,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=39420.0, ans=0.0 2023-06-17 20:28:28,293 INFO [train.py:996] (2/4) Epoch 1, batch 6600, loss[loss=0.4358, simple_loss=0.4227, pruned_loss=0.2244, over 21379.00 frames. ], tot_loss[loss=0.393, simple_loss=0.421, pruned_loss=0.1825, over 4268276.64 frames. ], batch size: 508, lr: 3.90e-02, grad_scale: 16.0 2023-06-17 20:28:39,513 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=22.5 2023-06-17 20:28:57,768 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.389e+02 4.216e+02 5.009e+02 6.284e+02 1.954e+03, threshold=1.002e+03, percent-clipped=7.0 2023-06-17 20:29:07,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=39720.0, ans=0.2 2023-06-17 20:30:17,389 INFO [train.py:996] (2/4) Epoch 1, batch 6650, loss[loss=0.3878, simple_loss=0.4134, pruned_loss=0.1811, over 21847.00 frames. ], tot_loss[loss=0.384, simple_loss=0.4133, pruned_loss=0.1774, over 4272096.48 frames. ], batch size: 373, lr: 3.89e-02, grad_scale: 16.0 2023-06-17 20:30:46,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=39960.0, ans=0.125 2023-06-17 20:31:20,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=40080.0, ans=0.2 2023-06-17 20:31:35,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=40080.0, ans=0.2 2023-06-17 20:31:47,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=40140.0, ans=0.0021434782608695645 2023-06-17 20:31:54,715 INFO [train.py:996] (2/4) Epoch 1, batch 6700, loss[loss=0.3478, simple_loss=0.3565, pruned_loss=0.1696, over 21427.00 frames. ], tot_loss[loss=0.3792, simple_loss=0.4071, pruned_loss=0.1756, over 4273956.54 frames. ], batch size: 212, lr: 3.89e-02, grad_scale: 16.0 2023-06-17 20:32:23,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=40260.0, ans=0.2 2023-06-17 20:32:24,892 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.374e+02 3.778e+02 4.910e+02 6.670e+02 1.888e+03, threshold=9.820e+02, percent-clipped=8.0 2023-06-17 20:32:39,432 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.16 vs. limit=15.0 2023-06-17 20:32:40,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff2.min_abs, batch_count=40320.0, ans=0.1 2023-06-17 20:33:26,037 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.49 vs. limit=10.0 2023-06-17 20:33:38,610 INFO [train.py:996] (2/4) Epoch 1, batch 6750, loss[loss=0.3556, simple_loss=0.3755, pruned_loss=0.1678, over 15228.00 frames. ], tot_loss[loss=0.3779, simple_loss=0.4052, pruned_loss=0.1753, over 4265954.85 frames. ], batch size: 60, lr: 3.88e-02, grad_scale: 16.0 2023-06-17 20:35:21,759 INFO [train.py:996] (2/4) Epoch 1, batch 6800, loss[loss=0.4258, simple_loss=0.41, pruned_loss=0.2208, over 21415.00 frames. ], tot_loss[loss=0.3824, simple_loss=0.4064, pruned_loss=0.1792, over 4266997.56 frames. ], batch size: 508, lr: 3.87e-02, grad_scale: 32.0 2023-06-17 20:35:38,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=40800.0, ans=0.05 2023-06-17 20:35:50,486 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.963e+02 4.135e+02 5.210e+02 7.018e+02 1.112e+03, threshold=1.042e+03, percent-clipped=5.0 2023-06-17 20:36:23,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=40980.0, ans=0.125 2023-06-17 20:36:43,450 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.85 vs. limit=15.0 2023-06-17 20:36:50,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=41040.0, ans=0.0 2023-06-17 20:37:01,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=41100.0, ans=0.0 2023-06-17 20:37:02,295 INFO [train.py:996] (2/4) Epoch 1, batch 6850, loss[loss=0.3819, simple_loss=0.391, pruned_loss=0.1864, over 21507.00 frames. ], tot_loss[loss=0.3806, simple_loss=0.4021, pruned_loss=0.1795, over 4271584.56 frames. ], batch size: 194, lr: 3.87e-02, grad_scale: 32.0 2023-06-17 20:37:10,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=41100.0, ans=0.125 2023-06-17 20:37:49,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=41220.0, ans=0.1 2023-06-17 20:38:18,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=41280.0, ans=0.125 2023-06-17 20:38:44,862 INFO [train.py:996] (2/4) Epoch 1, batch 6900, loss[loss=0.3625, simple_loss=0.3929, pruned_loss=0.1661, over 21336.00 frames. ], tot_loss[loss=0.3822, simple_loss=0.4031, pruned_loss=0.1806, over 4273243.47 frames. ], batch size: 159, lr: 3.86e-02, grad_scale: 32.0 2023-06-17 20:38:48,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=41400.0, ans=0.125 2023-06-17 20:39:15,049 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.588e+02 4.043e+02 5.136e+02 6.723e+02 1.147e+03, threshold=1.027e+03, percent-clipped=4.0 2023-06-17 20:40:33,702 INFO [train.py:996] (2/4) Epoch 1, batch 6950, loss[loss=0.372, simple_loss=0.4094, pruned_loss=0.1673, over 21092.00 frames. ], tot_loss[loss=0.3753, simple_loss=0.4023, pruned_loss=0.1741, over 4269113.64 frames. ], batch size: 607, lr: 3.85e-02, grad_scale: 32.0 2023-06-17 20:41:40,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=41880.0, ans=0.125 2023-06-17 20:41:52,035 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.09 vs. limit=15.0 2023-06-17 20:41:59,554 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-17 20:42:15,409 INFO [train.py:996] (2/4) Epoch 1, batch 7000, loss[loss=0.4164, simple_loss=0.4154, pruned_loss=0.2087, over 21440.00 frames. ], tot_loss[loss=0.3837, simple_loss=0.4077, pruned_loss=0.1798, over 4272419.05 frames. ], batch size: 389, lr: 3.85e-02, grad_scale: 32.0 2023-06-17 20:42:25,035 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.92 vs. limit=15.0 2023-06-17 20:42:40,008 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 3.856e+02 5.678e+02 7.793e+02 1.284e+03, threshold=1.136e+03, percent-clipped=9.0 2023-06-17 20:43:46,730 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=15.0 2023-06-17 20:43:58,217 INFO [train.py:996] (2/4) Epoch 1, batch 7050, loss[loss=0.3675, simple_loss=0.4006, pruned_loss=0.1672, over 21626.00 frames. ], tot_loss[loss=0.3815, simple_loss=0.4074, pruned_loss=0.1778, over 4264026.22 frames. ], batch size: 230, lr: 3.84e-02, grad_scale: 32.0 2023-06-17 20:44:21,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=42360.0, ans=10.0 2023-06-17 20:44:47,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=42420.0, ans=10.0 2023-06-17 20:44:57,442 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 20:45:08,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=42480.0, ans=0.0 2023-06-17 20:45:38,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=42540.0, ans=0.125 2023-06-17 20:45:41,232 INFO [train.py:996] (2/4) Epoch 1, batch 7100, loss[loss=0.4238, simple_loss=0.4464, pruned_loss=0.2005, over 21669.00 frames. ], tot_loss[loss=0.3885, simple_loss=0.4162, pruned_loss=0.1804, over 4259896.41 frames. ], batch size: 441, lr: 3.83e-02, grad_scale: 16.0 2023-06-17 20:46:06,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=42660.0, ans=0.025 2023-06-17 20:46:23,395 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.465e+02 3.499e+02 4.765e+02 6.343e+02 1.936e+03, threshold=9.530e+02, percent-clipped=5.0 2023-06-17 20:46:37,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=42720.0, ans=0.1 2023-06-17 20:46:55,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=42780.0, ans=0.125 2023-06-17 20:47:00,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=42780.0, ans=0.2 2023-06-17 20:47:04,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=42780.0, ans=0.0 2023-06-17 20:47:23,730 INFO [train.py:996] (2/4) Epoch 1, batch 7150, loss[loss=0.5083, simple_loss=0.4992, pruned_loss=0.2587, over 21418.00 frames. ], tot_loss[loss=0.3836, simple_loss=0.414, pruned_loss=0.1766, over 4252046.85 frames. ], batch size: 471, lr: 3.83e-02, grad_scale: 16.0 2023-06-17 20:48:09,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=42960.0, ans=10.0 2023-06-17 20:48:34,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=43080.0, ans=0.0 2023-06-17 20:49:07,080 INFO [train.py:996] (2/4) Epoch 1, batch 7200, loss[loss=0.4349, simple_loss=0.4499, pruned_loss=0.21, over 21381.00 frames. ], tot_loss[loss=0.3869, simple_loss=0.4151, pruned_loss=0.1793, over 4257125.38 frames. ], batch size: 549, lr: 3.82e-02, grad_scale: 32.0 2023-06-17 20:49:48,759 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.531e+02 4.300e+02 5.251e+02 6.410e+02 9.416e+02, threshold=1.050e+03, percent-clipped=0.0 2023-06-17 20:50:16,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=43380.0, ans=0.0014391304347826095 2023-06-17 20:50:16,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=43380.0, ans=0.0 2023-06-17 20:50:37,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=43440.0, ans=0.125 2023-06-17 20:50:49,425 INFO [train.py:996] (2/4) Epoch 1, batch 7250, loss[loss=0.3372, simple_loss=0.3554, pruned_loss=0.1595, over 21616.00 frames. ], tot_loss[loss=0.3833, simple_loss=0.409, pruned_loss=0.1788, over 4266009.40 frames. ], batch size: 282, lr: 3.82e-02, grad_scale: 32.0 2023-06-17 20:51:20,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=43560.0, ans=0.0 2023-06-17 20:51:28,816 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.10 vs. limit=22.5 2023-06-17 20:51:57,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=43680.0, ans=0.05 2023-06-17 20:52:07,494 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-17 20:52:31,524 INFO [train.py:996] (2/4) Epoch 1, batch 7300, loss[loss=0.3121, simple_loss=0.3379, pruned_loss=0.1431, over 21675.00 frames. ], tot_loss[loss=0.3772, simple_loss=0.4013, pruned_loss=0.1766, over 4253806.89 frames. ], batch size: 333, lr: 3.81e-02, grad_scale: 32.0 2023-06-17 20:52:58,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=43860.0, ans=0.2 2023-06-17 20:53:07,896 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.375e+02 3.808e+02 5.144e+02 6.713e+02 1.157e+03, threshold=1.029e+03, percent-clipped=4.0 2023-06-17 20:53:33,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=43920.0, ans=0.125 2023-06-17 20:54:24,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=44100.0, ans=0.04949747468305833 2023-06-17 20:54:25,394 INFO [train.py:996] (2/4) Epoch 1, batch 7350, loss[loss=0.5804, simple_loss=0.5343, pruned_loss=0.3133, over 21336.00 frames. ], tot_loss[loss=0.3776, simple_loss=0.3995, pruned_loss=0.1778, over 4250178.88 frames. ], batch size: 507, lr: 3.80e-02, grad_scale: 32.0 2023-06-17 20:54:54,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=44160.0, ans=0.0 2023-06-17 20:55:19,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=44280.0, ans=0.0 2023-06-17 20:56:05,316 INFO [train.py:996] (2/4) Epoch 1, batch 7400, loss[loss=0.3941, simple_loss=0.3959, pruned_loss=0.1961, over 21218.00 frames. ], tot_loss[loss=0.3868, simple_loss=0.4081, pruned_loss=0.1828, over 4256730.24 frames. ], batch size: 176, lr: 3.80e-02, grad_scale: 32.0 2023-06-17 20:56:32,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=44460.0, ans=0.125 2023-06-17 20:56:36,855 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.553e+02 4.273e+02 5.813e+02 7.639e+02 1.411e+03, threshold=1.163e+03, percent-clipped=7.0 2023-06-17 20:56:52,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=44520.0, ans=0.125 2023-06-17 20:57:13,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=44580.0, ans=0.125 2023-06-17 20:57:42,574 INFO [train.py:996] (2/4) Epoch 1, batch 7450, loss[loss=0.3559, simple_loss=0.3723, pruned_loss=0.1697, over 21323.00 frames. ], tot_loss[loss=0.3854, simple_loss=0.4066, pruned_loss=0.182, over 4251201.77 frames. ], batch size: 131, lr: 3.79e-02, grad_scale: 32.0 2023-06-17 20:57:44,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=44700.0, ans=0.0 2023-06-17 20:58:03,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=44760.0, ans=0.0011391304347826078 2023-06-17 20:58:08,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=44760.0, ans=0.2 2023-06-17 20:58:27,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=44820.0, ans=0.2 2023-06-17 20:58:37,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=44880.0, ans=0.125 2023-06-17 20:59:34,489 INFO [train.py:996] (2/4) Epoch 1, batch 7500, loss[loss=0.3675, simple_loss=0.4353, pruned_loss=0.1499, over 21703.00 frames. ], tot_loss[loss=0.3877, simple_loss=0.4106, pruned_loss=0.1824, over 4256730.54 frames. ], batch size: 298, lr: 3.78e-02, grad_scale: 32.0 2023-06-17 21:00:01,014 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.011e+02 4.419e+02 5.234e+02 7.057e+02 1.215e+03, threshold=1.047e+03, percent-clipped=2.0 2023-06-17 21:00:09,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=45120.0, ans=0.125 2023-06-17 21:01:07,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=45240.0, ans=0.05 2023-06-17 21:01:10,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=45240.0, ans=0.2 2023-06-17 21:01:13,106 INFO [train.py:996] (2/4) Epoch 1, batch 7550, loss[loss=0.4274, simple_loss=0.4284, pruned_loss=0.2131, over 21119.00 frames. ], tot_loss[loss=0.3868, simple_loss=0.4164, pruned_loss=0.1786, over 4265314.55 frames. ], batch size: 608, lr: 3.78e-02, grad_scale: 32.0 2023-06-17 21:01:18,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=45300.0, ans=0.0 2023-06-17 21:01:43,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=45360.0, ans=0.2 2023-06-17 21:01:56,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=45420.0, ans=0.125 2023-06-17 21:02:53,635 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 21:02:56,475 INFO [train.py:996] (2/4) Epoch 1, batch 7600, loss[loss=0.4094, simple_loss=0.4208, pruned_loss=0.199, over 21787.00 frames. ], tot_loss[loss=0.3831, simple_loss=0.4147, pruned_loss=0.1757, over 4269717.27 frames. ], batch size: 247, lr: 3.77e-02, grad_scale: 32.0 2023-06-17 21:03:07,151 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=19.91 vs. limit=15.0 2023-06-17 21:03:13,680 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-17 21:03:14,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=45660.0, ans=0.125 2023-06-17 21:03:22,200 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.375e+02 3.546e+02 4.998e+02 6.623e+02 1.459e+03, threshold=9.996e+02, percent-clipped=5.0 2023-06-17 21:04:37,669 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.97 vs. limit=15.0 2023-06-17 21:04:38,393 INFO [train.py:996] (2/4) Epoch 1, batch 7650, loss[loss=0.384, simple_loss=0.396, pruned_loss=0.186, over 21604.00 frames. ], tot_loss[loss=0.3863, simple_loss=0.4142, pruned_loss=0.1792, over 4275861.45 frames. ], batch size: 212, lr: 3.77e-02, grad_scale: 32.0 2023-06-17 21:04:55,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=45960.0, ans=0.0 2023-06-17 21:04:58,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=45960.0, ans=0.125 2023-06-17 21:05:05,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=45960.0, ans=0.0 2023-06-17 21:05:20,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=46020.0, ans=0.125 2023-06-17 21:06:24,111 INFO [train.py:996] (2/4) Epoch 1, batch 7700, loss[loss=0.3879, simple_loss=0.423, pruned_loss=0.1764, over 21555.00 frames. ], tot_loss[loss=0.3935, simple_loss=0.4193, pruned_loss=0.1838, over 4279332.00 frames. ], batch size: 230, lr: 3.76e-02, grad_scale: 32.0 2023-06-17 21:06:49,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=46260.0, ans=0.125 2023-06-17 21:06:50,968 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.909e+02 4.180e+02 5.539e+02 6.663e+02 1.200e+03, threshold=1.108e+03, percent-clipped=4.0 2023-06-17 21:06:53,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=46260.0, ans=0.09899494936611666 2023-06-17 21:08:00,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=46440.0, ans=0.07 2023-06-17 21:08:09,043 INFO [train.py:996] (2/4) Epoch 1, batch 7750, loss[loss=0.48, simple_loss=0.5223, pruned_loss=0.2188, over 21768.00 frames. ], tot_loss[loss=0.4, simple_loss=0.4271, pruned_loss=0.1864, over 4273540.09 frames. ], batch size: 332, lr: 3.75e-02, grad_scale: 32.0 2023-06-17 21:09:30,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=46680.0, ans=0.125 2023-06-17 21:09:50,651 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-17 21:09:53,170 INFO [train.py:996] (2/4) Epoch 1, batch 7800, loss[loss=0.3678, simple_loss=0.3994, pruned_loss=0.1681, over 21783.00 frames. ], tot_loss[loss=0.4015, simple_loss=0.4298, pruned_loss=0.1866, over 4273567.53 frames. ], batch size: 282, lr: 3.75e-02, grad_scale: 32.0 2023-06-17 21:10:03,994 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=15.0 2023-06-17 21:10:21,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=46860.0, ans=0.0 2023-06-17 21:10:29,654 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.732e+02 4.420e+02 5.608e+02 7.244e+02 1.529e+03, threshold=1.122e+03, percent-clipped=4.0 2023-06-17 21:11:01,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=46920.0, ans=0.0 2023-06-17 21:11:34,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=47100.0, ans=0.125 2023-06-17 21:11:34,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=47100.0, ans=0.125 2023-06-17 21:11:35,304 INFO [train.py:996] (2/4) Epoch 1, batch 7850, loss[loss=0.3407, simple_loss=0.3668, pruned_loss=0.1573, over 21532.00 frames. ], tot_loss[loss=0.3954, simple_loss=0.4217, pruned_loss=0.1845, over 4263027.21 frames. ], batch size: 230, lr: 3.74e-02, grad_scale: 32.0 2023-06-17 21:11:39,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=47100.0, ans=0.125 2023-06-17 21:11:39,858 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.57 vs. limit=6.0 2023-06-17 21:11:49,523 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-17 21:11:52,168 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 21:13:18,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=47400.0, ans=0.1 2023-06-17 21:13:19,400 INFO [train.py:996] (2/4) Epoch 1, batch 7900, loss[loss=0.3472, simple_loss=0.3641, pruned_loss=0.1651, over 21114.00 frames. ], tot_loss[loss=0.3915, simple_loss=0.417, pruned_loss=0.183, over 4266158.29 frames. ], batch size: 143, lr: 3.73e-02, grad_scale: 32.0 2023-06-17 21:13:56,674 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.667e+02 5.315e+02 6.476e+02 7.892e+02 1.492e+03, threshold=1.295e+03, percent-clipped=7.0 2023-06-17 21:14:16,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=47520.0, ans=0.125 2023-06-17 21:14:30,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=47580.0, ans=0.2 2023-06-17 21:14:58,006 INFO [train.py:996] (2/4) Epoch 1, batch 7950, loss[loss=0.4195, simple_loss=0.4569, pruned_loss=0.191, over 21321.00 frames. ], tot_loss[loss=0.3907, simple_loss=0.4203, pruned_loss=0.1806, over 4268372.98 frames. ], batch size: 548, lr: 3.73e-02, grad_scale: 32.0 2023-06-17 21:16:06,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=47820.0, ans=0.125 2023-06-17 21:16:50,558 INFO [train.py:996] (2/4) Epoch 1, batch 8000, loss[loss=0.3942, simple_loss=0.4451, pruned_loss=0.1717, over 21869.00 frames. ], tot_loss[loss=0.3998, simple_loss=0.4268, pruned_loss=0.1864, over 4268126.90 frames. ], batch size: 372, lr: 3.72e-02, grad_scale: 32.0 2023-06-17 21:17:28,901 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.757e+02 4.328e+02 5.465e+02 6.460e+02 1.072e+03, threshold=1.093e+03, percent-clipped=0.0 2023-06-17 21:17:48,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=48120.0, ans=0.125 2023-06-17 21:18:09,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=48180.0, ans=0.1 2023-06-17 21:18:14,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=48240.0, ans=0.125 2023-06-17 21:18:43,558 INFO [train.py:996] (2/4) Epoch 1, batch 8050, loss[loss=0.4106, simple_loss=0.4759, pruned_loss=0.1726, over 21240.00 frames. ], tot_loss[loss=0.4021, simple_loss=0.4302, pruned_loss=0.187, over 4264126.83 frames. ], batch size: 548, lr: 3.72e-02, grad_scale: 32.0 2023-06-17 21:18:45,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=48300.0, ans=0.125 2023-06-17 21:19:55,524 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2023-06-17 21:20:09,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=48540.0, ans=0.0003173913043478249 2023-06-17 21:20:27,382 INFO [train.py:996] (2/4) Epoch 1, batch 8100, loss[loss=0.3815, simple_loss=0.4019, pruned_loss=0.1805, over 21830.00 frames. ], tot_loss[loss=0.3989, simple_loss=0.4271, pruned_loss=0.1853, over 4264919.37 frames. ], batch size: 282, lr: 3.71e-02, grad_scale: 32.0 2023-06-17 21:20:36,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=48600.0, ans=0.125 2023-06-17 21:20:54,436 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 4.589e+02 6.477e+02 8.462e+02 1.426e+03, threshold=1.295e+03, percent-clipped=5.0 2023-06-17 21:20:55,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=48660.0, ans=0.125 2023-06-17 21:21:08,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=48720.0, ans=0.125 2023-06-17 21:21:20,478 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.22 vs. limit=15.0 2023-06-17 21:21:27,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=48720.0, ans=15.0 2023-06-17 21:22:12,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=48900.0, ans=0.0 2023-06-17 21:22:14,038 INFO [train.py:996] (2/4) Epoch 1, batch 8150, loss[loss=0.338, simple_loss=0.4235, pruned_loss=0.1262, over 21713.00 frames. ], tot_loss[loss=0.4017, simple_loss=0.434, pruned_loss=0.1847, over 4274683.50 frames. ], batch size: 351, lr: 3.70e-02, grad_scale: 16.0 2023-06-17 21:22:16,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=48900.0, ans=0.125 2023-06-17 21:22:26,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=48900.0, ans=0.1 2023-06-17 21:23:39,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=49080.0, ans=0.125 2023-06-17 21:23:43,296 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=24.55 vs. limit=15.0 2023-06-17 21:23:47,914 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-06-17 21:23:58,529 INFO [train.py:996] (2/4) Epoch 1, batch 8200, loss[loss=0.399, simple_loss=0.3981, pruned_loss=0.2, over 21466.00 frames. ], tot_loss[loss=0.394, simple_loss=0.4258, pruned_loss=0.1811, over 4262383.47 frames. ], batch size: 441, lr: 3.70e-02, grad_scale: 16.0 2023-06-17 21:24:29,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=49260.0, ans=0.2 2023-06-17 21:24:32,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=49260.0, ans=0.125 2023-06-17 21:24:36,652 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.823e+02 4.911e+02 6.054e+02 7.943e+02 1.649e+03, threshold=1.211e+03, percent-clipped=3.0 2023-06-17 21:25:42,211 INFO [train.py:996] (2/4) Epoch 1, batch 8250, loss[loss=0.3826, simple_loss=0.4336, pruned_loss=0.1659, over 21826.00 frames. ], tot_loss[loss=0.3934, simple_loss=0.4244, pruned_loss=0.1812, over 4264210.38 frames. ], batch size: 371, lr: 3.69e-02, grad_scale: 16.0 2023-06-17 21:26:50,841 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.57 vs. limit=15.0 2023-06-17 21:26:55,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=49680.0, ans=0.0 2023-06-17 21:27:25,188 INFO [train.py:996] (2/4) Epoch 1, batch 8300, loss[loss=0.2926, simple_loss=0.3424, pruned_loss=0.1214, over 21791.00 frames. ], tot_loss[loss=0.3873, simple_loss=0.4202, pruned_loss=0.1772, over 4262477.06 frames. ], batch size: 124, lr: 3.68e-02, grad_scale: 16.0 2023-06-17 21:28:03,866 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.103e+02 3.951e+02 4.948e+02 6.196e+02 1.080e+03, threshold=9.896e+02, percent-clipped=0.0 2023-06-17 21:28:21,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=49920.0, ans=0.125 2023-06-17 21:29:09,836 INFO [train.py:996] (2/4) Epoch 1, batch 8350, loss[loss=0.3703, simple_loss=0.3586, pruned_loss=0.191, over 20111.00 frames. ], tot_loss[loss=0.3813, simple_loss=0.416, pruned_loss=0.1733, over 4267999.78 frames. ], batch size: 704, lr: 3.68e-02, grad_scale: 16.0 2023-06-17 21:30:08,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=50220.0, ans=0.2 2023-06-17 21:30:10,011 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=27.59 vs. limit=22.5 2023-06-17 21:30:35,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=50340.0, ans=0.5 2023-06-17 21:30:53,942 INFO [train.py:996] (2/4) Epoch 1, batch 8400, loss[loss=0.3687, simple_loss=0.4154, pruned_loss=0.161, over 21682.00 frames. ], tot_loss[loss=0.3788, simple_loss=0.4158, pruned_loss=0.1709, over 4273943.98 frames. ], batch size: 441, lr: 3.67e-02, grad_scale: 32.0 2023-06-17 21:31:01,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=50400.0, ans=0.125 2023-06-17 21:31:05,387 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.91 vs. limit=22.5 2023-06-17 21:31:32,782 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.584e+02 3.530e+02 4.836e+02 6.875e+02 1.901e+03, threshold=9.672e+02, percent-clipped=8.0 2023-06-17 21:32:14,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=50640.0, ans=0.025 2023-06-17 21:32:41,511 INFO [train.py:996] (2/4) Epoch 1, batch 8450, loss[loss=0.4074, simple_loss=0.4201, pruned_loss=0.1973, over 21771.00 frames. ], tot_loss[loss=0.3822, simple_loss=0.4168, pruned_loss=0.1738, over 4283109.85 frames. ], batch size: 298, lr: 3.67e-02, grad_scale: 16.0 2023-06-17 21:33:53,522 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-17 21:33:57,069 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-06-17 21:33:58,488 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.60 vs. limit=22.5 2023-06-17 21:34:13,379 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.37 vs. limit=15.0 2023-06-17 21:34:13,646 INFO [train.py:996] (2/4) Epoch 1, batch 8500, loss[loss=0.4588, simple_loss=0.4621, pruned_loss=0.2277, over 21509.00 frames. ], tot_loss[loss=0.381, simple_loss=0.4121, pruned_loss=0.175, over 4279204.86 frames. ], batch size: 389, lr: 3.66e-02, grad_scale: 16.0 2023-06-17 21:34:55,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=51060.0, ans=0.1 2023-06-17 21:35:00,164 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.023e+02 4.457e+02 5.550e+02 7.009e+02 1.801e+03, threshold=1.110e+03, percent-clipped=10.0 2023-06-17 21:35:21,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=51120.0, ans=10.0 2023-06-17 21:35:31,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=51180.0, ans=0.1 2023-06-17 21:35:33,318 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-17 21:35:59,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=51240.0, ans=0.125 2023-06-17 21:36:04,108 INFO [train.py:996] (2/4) Epoch 1, batch 8550, loss[loss=0.3442, simple_loss=0.3926, pruned_loss=0.1479, over 21427.00 frames. ], tot_loss[loss=0.388, simple_loss=0.4179, pruned_loss=0.1791, over 4271211.74 frames. ], batch size: 211, lr: 3.65e-02, grad_scale: 16.0 2023-06-17 21:37:07,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=51420.0, ans=0.125 2023-06-17 21:37:28,500 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-17 21:37:56,926 INFO [train.py:996] (2/4) Epoch 1, batch 8600, loss[loss=0.4417, simple_loss=0.4667, pruned_loss=0.2083, over 21882.00 frames. ], tot_loss[loss=0.3922, simple_loss=0.4241, pruned_loss=0.1802, over 4272964.96 frames. ], batch size: 371, lr: 3.65e-02, grad_scale: 16.0 2023-06-17 21:38:38,370 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.423e+02 4.881e+02 5.851e+02 7.697e+02 1.206e+03, threshold=1.170e+03, percent-clipped=2.0 2023-06-17 21:38:58,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=51780.0, ans=10.0 2023-06-17 21:39:30,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=51840.0, ans=0.1 2023-06-17 21:39:47,597 INFO [train.py:996] (2/4) Epoch 1, batch 8650, loss[loss=0.2826, simple_loss=0.3516, pruned_loss=0.1069, over 21714.00 frames. ], tot_loss[loss=0.3953, simple_loss=0.4293, pruned_loss=0.1807, over 4272702.14 frames. ], batch size: 282, lr: 3.64e-02, grad_scale: 16.0 2023-06-17 21:40:30,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=52020.0, ans=0.125 2023-06-17 21:40:30,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=52020.0, ans=0.02 2023-06-17 21:40:50,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=52080.0, ans=0.125 2023-06-17 21:40:51,929 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.69 vs. limit=22.5 2023-06-17 21:41:15,817 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.75 vs. limit=15.0 2023-06-17 21:41:20,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=52140.0, ans=0.125 2023-06-17 21:41:24,331 INFO [train.py:996] (2/4) Epoch 1, batch 8700, loss[loss=0.3258, simple_loss=0.3659, pruned_loss=0.1428, over 21630.00 frames. ], tot_loss[loss=0.3818, simple_loss=0.417, pruned_loss=0.1733, over 4272757.88 frames. ], batch size: 263, lr: 3.64e-02, grad_scale: 16.0 2023-06-17 21:41:32,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=52200.0, ans=0.0 2023-06-17 21:42:04,521 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 3.951e+02 4.948e+02 6.720e+02 1.137e+03, threshold=9.897e+02, percent-clipped=0.0 2023-06-17 21:42:32,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=52380.0, ans=0.0 2023-06-17 21:42:53,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=52440.0, ans=0.0 2023-06-17 21:42:57,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=52440.0, ans=0.0 2023-06-17 21:43:13,274 INFO [train.py:996] (2/4) Epoch 1, batch 8750, loss[loss=0.5016, simple_loss=0.5509, pruned_loss=0.2262, over 20870.00 frames. ], tot_loss[loss=0.3824, simple_loss=0.4133, pruned_loss=0.1758, over 4278634.18 frames. ], batch size: 607, lr: 3.63e-02, grad_scale: 16.0 2023-06-17 21:44:06,126 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2023-06-17 21:44:37,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=52740.0, ans=0.0 2023-06-17 21:45:03,175 INFO [train.py:996] (2/4) Epoch 1, batch 8800, loss[loss=0.5555, simple_loss=0.5673, pruned_loss=0.2718, over 21483.00 frames. ], tot_loss[loss=0.3934, simple_loss=0.4239, pruned_loss=0.1815, over 4282530.95 frames. ], batch size: 471, lr: 3.62e-02, grad_scale: 32.0 2023-06-17 21:45:32,802 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.521e+02 5.221e+02 6.385e+02 9.121e+02 2.025e+03, threshold=1.277e+03, percent-clipped=20.0 2023-06-17 21:46:17,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=52980.0, ans=0.95 2023-06-17 21:46:49,111 INFO [train.py:996] (2/4) Epoch 1, batch 8850, loss[loss=0.3807, simple_loss=0.4139, pruned_loss=0.1737, over 21446.00 frames. ], tot_loss[loss=0.4008, simple_loss=0.4327, pruned_loss=0.1844, over 4285690.82 frames. ], batch size: 211, lr: 3.62e-02, grad_scale: 32.0 2023-06-17 21:47:00,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=53100.0, ans=0.125 2023-06-17 21:47:24,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=53220.0, ans=0.125 2023-06-17 21:48:07,150 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=22.5 2023-06-17 21:48:33,217 INFO [train.py:996] (2/4) Epoch 1, batch 8900, loss[loss=0.325, simple_loss=0.3577, pruned_loss=0.1462, over 21201.00 frames. ], tot_loss[loss=0.3962, simple_loss=0.4274, pruned_loss=0.1826, over 4286169.56 frames. ], batch size: 176, lr: 3.61e-02, grad_scale: 32.0 2023-06-17 21:48:44,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=53400.0, ans=0.125 2023-06-17 21:49:10,057 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.740e+02 3.829e+02 5.153e+02 6.429e+02 1.062e+03, threshold=1.031e+03, percent-clipped=0.0 2023-06-17 21:49:29,515 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.87 vs. limit=15.0 2023-06-17 21:49:30,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=53520.0, ans=0.1 2023-06-17 21:49:54,887 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=15.0 2023-06-17 21:50:19,534 INFO [train.py:996] (2/4) Epoch 1, batch 8950, loss[loss=0.33, simple_loss=0.3546, pruned_loss=0.1527, over 21332.00 frames. ], tot_loss[loss=0.3948, simple_loss=0.4279, pruned_loss=0.1809, over 4277957.27 frames. ], batch size: 131, lr: 3.61e-02, grad_scale: 32.0 2023-06-17 21:51:12,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=53820.0, ans=0.125 2023-06-17 21:51:29,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=53880.0, ans=0.2 2023-06-17 21:51:38,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=53880.0, ans=0.0 2023-06-17 21:52:04,327 INFO [train.py:996] (2/4) Epoch 1, batch 9000, loss[loss=0.3853, simple_loss=0.4131, pruned_loss=0.1787, over 21708.00 frames. ], tot_loss[loss=0.3875, simple_loss=0.4176, pruned_loss=0.1787, over 4279596.45 frames. ], batch size: 282, lr: 3.60e-02, grad_scale: 32.0 2023-06-17 21:52:04,327 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-17 21:52:23,533 INFO [train.py:1028] (2/4) Epoch 1, validation: loss=0.3404, simple_loss=0.4251, pruned_loss=0.1278, over 1796401.00 frames. 2023-06-17 21:52:23,534 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-17 21:52:39,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=54060.0, ans=10.0 2023-06-17 21:53:06,934 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 4.196e+02 5.638e+02 6.877e+02 1.385e+03, threshold=1.128e+03, percent-clipped=3.0 2023-06-17 21:53:50,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=54240.0, ans=0.0 2023-06-17 21:54:04,860 INFO [train.py:996] (2/4) Epoch 1, batch 9050, loss[loss=0.4211, simple_loss=0.4457, pruned_loss=0.1983, over 21607.00 frames. ], tot_loss[loss=0.3811, simple_loss=0.4154, pruned_loss=0.1734, over 4274885.23 frames. ], batch size: 389, lr: 3.59e-02, grad_scale: 32.0 2023-06-17 21:54:14,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=54300.0, ans=0.0 2023-06-17 21:54:59,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=54420.0, ans=0.1 2023-06-17 21:55:02,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=54420.0, ans=0.125 2023-06-17 21:55:39,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=54540.0, ans=0.125 2023-06-17 21:55:41,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=54540.0, ans=0.1 2023-06-17 21:55:45,378 INFO [train.py:996] (2/4) Epoch 1, batch 9100, loss[loss=0.3269, simple_loss=0.3968, pruned_loss=0.1285, over 21783.00 frames. ], tot_loss[loss=0.3899, simple_loss=0.4235, pruned_loss=0.1782, over 4266226.17 frames. ], batch size: 282, lr: 3.59e-02, grad_scale: 32.0 2023-06-17 21:56:31,625 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 3.753e+02 4.940e+02 6.814e+02 2.174e+03, threshold=9.881e+02, percent-clipped=7.0 2023-06-17 21:56:33,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=54720.0, ans=0.125 2023-06-17 21:56:38,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=54720.0, ans=0.2 2023-06-17 21:56:58,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=54780.0, ans=0.125 2023-06-17 21:57:35,314 INFO [train.py:996] (2/4) Epoch 1, batch 9150, loss[loss=0.511, simple_loss=0.5269, pruned_loss=0.2476, over 21545.00 frames. ], tot_loss[loss=0.3848, simple_loss=0.4236, pruned_loss=0.173, over 4269835.37 frames. ], batch size: 508, lr: 3.58e-02, grad_scale: 32.0 2023-06-17 21:57:39,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=54900.0, ans=0.125 2023-06-17 21:57:56,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=54900.0, ans=0.0 2023-06-17 21:58:02,567 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.76 vs. limit=15.0 2023-06-17 21:58:09,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=54960.0, ans=0.0 2023-06-17 21:58:09,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=54960.0, ans=10.0 2023-06-17 21:58:19,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=55020.0, ans=0.0 2023-06-17 21:58:21,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=55020.0, ans=0.125 2023-06-17 21:58:43,567 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.62 vs. limit=15.0 2023-06-17 21:59:29,981 INFO [train.py:996] (2/4) Epoch 1, batch 9200, loss[loss=0.376, simple_loss=0.4169, pruned_loss=0.1676, over 21390.00 frames. ], tot_loss[loss=0.3806, simple_loss=0.4218, pruned_loss=0.1697, over 4272378.07 frames. ], batch size: 194, lr: 3.58e-02, grad_scale: 32.0 2023-06-17 21:59:59,132 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.844e+02 4.127e+02 5.458e+02 7.694e+02 1.391e+03, threshold=1.092e+03, percent-clipped=9.0 2023-06-17 22:00:21,237 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.77 vs. limit=10.0 2023-06-17 22:01:13,167 INFO [train.py:996] (2/4) Epoch 1, batch 9250, loss[loss=0.4022, simple_loss=0.3961, pruned_loss=0.2041, over 21302.00 frames. ], tot_loss[loss=0.3914, simple_loss=0.4276, pruned_loss=0.1776, over 4273841.66 frames. ], batch size: 507, lr: 3.57e-02, grad_scale: 32.0 2023-06-17 22:01:25,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=55500.0, ans=0.125 2023-06-17 22:01:57,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=55620.0, ans=0.1 2023-06-17 22:02:58,750 INFO [train.py:996] (2/4) Epoch 1, batch 9300, loss[loss=0.3645, simple_loss=0.4055, pruned_loss=0.1618, over 21719.00 frames. ], tot_loss[loss=0.3896, simple_loss=0.423, pruned_loss=0.1781, over 4270189.87 frames. ], batch size: 282, lr: 3.57e-02, grad_scale: 32.0 2023-06-17 22:03:28,672 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.804e+02 4.090e+02 5.143e+02 6.278e+02 1.452e+03, threshold=1.029e+03, percent-clipped=2.0 2023-06-17 22:04:44,726 INFO [train.py:996] (2/4) Epoch 1, batch 9350, loss[loss=0.4046, simple_loss=0.44, pruned_loss=0.1846, over 21863.00 frames. ], tot_loss[loss=0.3955, simple_loss=0.431, pruned_loss=0.18, over 4274015.65 frames. ], batch size: 118, lr: 3.56e-02, grad_scale: 32.0 2023-06-17 22:05:59,598 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 22:06:29,238 INFO [train.py:996] (2/4) Epoch 1, batch 9400, loss[loss=0.4076, simple_loss=0.422, pruned_loss=0.1966, over 21498.00 frames. ], tot_loss[loss=0.3988, simple_loss=0.434, pruned_loss=0.1818, over 4276256.70 frames. ], batch size: 389, lr: 3.55e-02, grad_scale: 32.0 2023-06-17 22:06:36,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=56400.0, ans=0.07 2023-06-17 22:07:03,740 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.872e+02 4.762e+02 5.781e+02 7.006e+02 1.289e+03, threshold=1.156e+03, percent-clipped=1.0 2023-06-17 22:07:46,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=56580.0, ans=0.125 2023-06-17 22:07:59,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=56640.0, ans=0.2 2023-06-17 22:08:07,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=56640.0, ans=0.125 2023-06-17 22:08:11,267 INFO [train.py:996] (2/4) Epoch 1, batch 9450, loss[loss=0.3636, simple_loss=0.3867, pruned_loss=0.1702, over 21988.00 frames. ], tot_loss[loss=0.394, simple_loss=0.4258, pruned_loss=0.1811, over 4271659.39 frames. ], batch size: 103, lr: 3.55e-02, grad_scale: 16.0 2023-06-17 22:08:14,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=56700.0, ans=0.0 2023-06-17 22:08:58,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=56820.0, ans=0.125 2023-06-17 22:09:27,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=56880.0, ans=0.125 2023-06-17 22:09:42,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=56940.0, ans=0.0 2023-06-17 22:09:46,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=56940.0, ans=0.1 2023-06-17 22:09:52,854 INFO [train.py:996] (2/4) Epoch 1, batch 9500, loss[loss=0.3545, simple_loss=0.3935, pruned_loss=0.1577, over 21417.00 frames. ], tot_loss[loss=0.3843, simple_loss=0.4149, pruned_loss=0.1769, over 4268050.36 frames. ], batch size: 194, lr: 3.54e-02, grad_scale: 16.0 2023-06-17 22:10:01,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=57000.0, ans=0.125 2023-06-17 22:10:09,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=57060.0, ans=0.0 2023-06-17 22:10:10,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=57060.0, ans=0.1 2023-06-17 22:10:35,940 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.522e+02 3.981e+02 4.935e+02 6.509e+02 1.656e+03, threshold=9.871e+02, percent-clipped=4.0 2023-06-17 22:10:55,736 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=22.5 2023-06-17 22:11:31,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=57240.0, ans=0.125 2023-06-17 22:11:37,278 INFO [train.py:996] (2/4) Epoch 1, batch 9550, loss[loss=0.4029, simple_loss=0.4602, pruned_loss=0.1727, over 21770.00 frames. ], tot_loss[loss=0.3907, simple_loss=0.4198, pruned_loss=0.1808, over 4265898.42 frames. ], batch size: 332, lr: 3.54e-02, grad_scale: 16.0 2023-06-17 22:13:22,034 INFO [train.py:996] (2/4) Epoch 1, batch 9600, loss[loss=0.3811, simple_loss=0.4038, pruned_loss=0.1792, over 21533.00 frames. ], tot_loss[loss=0.3945, simple_loss=0.4235, pruned_loss=0.1827, over 4269216.76 frames. ], batch size: 548, lr: 3.53e-02, grad_scale: 32.0 2023-06-17 22:13:34,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=57600.0, ans=0.1 2023-06-17 22:13:35,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=57600.0, ans=0.0 2023-06-17 22:13:43,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=57660.0, ans=0.125 2023-06-17 22:14:08,854 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.810e+02 4.156e+02 5.294e+02 7.045e+02 1.358e+03, threshold=1.059e+03, percent-clipped=6.0 2023-06-17 22:14:46,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=57780.0, ans=0.09899494936611666 2023-06-17 22:15:06,112 INFO [train.py:996] (2/4) Epoch 1, batch 9650, loss[loss=0.3831, simple_loss=0.4095, pruned_loss=0.1783, over 21942.00 frames. ], tot_loss[loss=0.3894, simple_loss=0.419, pruned_loss=0.1799, over 4277107.11 frames. ], batch size: 316, lr: 3.53e-02, grad_scale: 32.0 2023-06-17 22:15:16,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=57900.0, ans=0.0 2023-06-17 22:16:08,489 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=15.0 2023-06-17 22:16:27,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=58080.0, ans=0.125 2023-06-17 22:16:32,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=58140.0, ans=0.125 2023-06-17 22:16:48,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=58200.0, ans=0.0 2023-06-17 22:16:49,793 INFO [train.py:996] (2/4) Epoch 1, batch 9700, loss[loss=0.3529, simple_loss=0.3757, pruned_loss=0.1651, over 20215.00 frames. ], tot_loss[loss=0.3919, simple_loss=0.4226, pruned_loss=0.1806, over 4276558.96 frames. ], batch size: 703, lr: 3.52e-02, grad_scale: 32.0 2023-06-17 22:17:03,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=58200.0, ans=0.125 2023-06-17 22:17:37,777 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.787e+02 4.137e+02 5.402e+02 6.942e+02 1.239e+03, threshold=1.080e+03, percent-clipped=2.0 2023-06-17 22:18:02,148 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=7.41 vs. limit=6.0 2023-06-17 22:18:16,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=58440.0, ans=0.0 2023-06-17 22:18:33,485 INFO [train.py:996] (2/4) Epoch 1, batch 9750, loss[loss=0.364, simple_loss=0.3871, pruned_loss=0.1705, over 21238.00 frames. ], tot_loss[loss=0.3842, simple_loss=0.4133, pruned_loss=0.1775, over 4264051.90 frames. ], batch size: 159, lr: 3.51e-02, grad_scale: 32.0 2023-06-17 22:19:00,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=58560.0, ans=0.125 2023-06-17 22:19:49,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=58680.0, ans=0.1 2023-06-17 22:20:15,537 INFO [train.py:996] (2/4) Epoch 1, batch 9800, loss[loss=0.3764, simple_loss=0.3984, pruned_loss=0.1772, over 21887.00 frames. ], tot_loss[loss=0.3856, simple_loss=0.4148, pruned_loss=0.1782, over 4257592.33 frames. ], batch size: 107, lr: 3.51e-02, grad_scale: 16.0 2023-06-17 22:20:46,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=58860.0, ans=0.0 2023-06-17 22:20:56,814 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=12.0 2023-06-17 22:21:02,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=58920.0, ans=0.125 2023-06-17 22:21:03,543 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.967e+02 4.740e+02 5.847e+02 8.148e+02 2.070e+03, threshold=1.169e+03, percent-clipped=10.0 2023-06-17 22:21:41,315 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 22:21:52,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=59040.0, ans=0.2 2023-06-17 22:21:56,347 INFO [train.py:996] (2/4) Epoch 1, batch 9850, loss[loss=0.3805, simple_loss=0.4046, pruned_loss=0.1782, over 21914.00 frames. ], tot_loss[loss=0.386, simple_loss=0.415, pruned_loss=0.1785, over 4249885.67 frames. ], batch size: 316, lr: 3.50e-02, grad_scale: 16.0 2023-06-17 22:22:18,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=59160.0, ans=0.125 2023-06-17 22:23:00,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=59280.0, ans=0.2 2023-06-17 22:23:12,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=59280.0, ans=0.04949747468305833 2023-06-17 22:23:31,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=59340.0, ans=0.125 2023-06-17 22:23:33,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=59340.0, ans=0.1 2023-06-17 22:23:39,154 INFO [train.py:996] (2/4) Epoch 1, batch 9900, loss[loss=0.3655, simple_loss=0.3713, pruned_loss=0.1798, over 21339.00 frames. ], tot_loss[loss=0.381, simple_loss=0.4086, pruned_loss=0.1767, over 4248691.45 frames. ], batch size: 473, lr: 3.50e-02, grad_scale: 16.0 2023-06-17 22:24:08,487 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=15.0 2023-06-17 22:24:28,783 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.114e+02 4.325e+02 5.228e+02 6.725e+02 1.103e+03, threshold=1.046e+03, percent-clipped=0.0 2023-06-17 22:24:35,845 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 22:24:53,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=59580.0, ans=0.125 2023-06-17 22:25:23,907 INFO [train.py:996] (2/4) Epoch 1, batch 9950, loss[loss=0.4019, simple_loss=0.4109, pruned_loss=0.1965, over 21569.00 frames. ], tot_loss[loss=0.3877, simple_loss=0.4133, pruned_loss=0.1811, over 4258216.33 frames. ], batch size: 415, lr: 3.49e-02, grad_scale: 16.0 2023-06-17 22:26:18,879 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=23.56 vs. limit=15.0 2023-06-17 22:26:39,464 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.53 vs. limit=15.0 2023-06-17 22:27:12,388 INFO [train.py:996] (2/4) Epoch 1, batch 10000, loss[loss=0.3232, simple_loss=0.3584, pruned_loss=0.144, over 21266.00 frames. ], tot_loss[loss=0.3802, simple_loss=0.4064, pruned_loss=0.177, over 4256407.49 frames. ], batch size: 176, lr: 3.49e-02, grad_scale: 32.0 2023-06-17 22:27:30,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=60000.0, ans=0.0 2023-06-17 22:28:03,646 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.788e+02 4.452e+02 5.196e+02 6.727e+02 1.360e+03, threshold=1.039e+03, percent-clipped=5.0 2023-06-17 22:28:22,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=60180.0, ans=0.2 2023-06-17 22:29:04,517 INFO [train.py:996] (2/4) Epoch 1, batch 10050, loss[loss=0.351, simple_loss=0.3835, pruned_loss=0.1593, over 21703.00 frames. ], tot_loss[loss=0.3796, simple_loss=0.4065, pruned_loss=0.1763, over 4266044.99 frames. ], batch size: 298, lr: 3.48e-02, grad_scale: 32.0 2023-06-17 22:29:08,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=60300.0, ans=0.2 2023-06-17 22:29:20,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=60300.0, ans=0.125 2023-06-17 22:29:20,869 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.88 vs. limit=10.0 2023-06-17 22:29:33,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=60360.0, ans=0.125 2023-06-17 22:29:56,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=60420.0, ans=0.125 2023-06-17 22:30:38,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=60540.0, ans=0.125 2023-06-17 22:30:41,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=60540.0, ans=0.125 2023-06-17 22:30:41,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=60540.0, ans=0.2 2023-06-17 22:30:54,298 INFO [train.py:996] (2/4) Epoch 1, batch 10100, loss[loss=0.4248, simple_loss=0.449, pruned_loss=0.2003, over 21796.00 frames. ], tot_loss[loss=0.3747, simple_loss=0.4046, pruned_loss=0.1724, over 4267965.43 frames. ], batch size: 332, lr: 3.47e-02, grad_scale: 32.0 2023-06-17 22:30:57,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=60600.0, ans=0.5 2023-06-17 22:31:32,509 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.854e+02 4.188e+02 5.288e+02 6.297e+02 1.348e+03, threshold=1.058e+03, percent-clipped=5.0 2023-06-17 22:31:32,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=60720.0, ans=0.125 2023-06-17 22:31:57,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=60780.0, ans=0.95 2023-06-17 22:32:03,186 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-06-17 22:32:20,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=60840.0, ans=0.125 2023-06-17 22:32:37,664 INFO [train.py:996] (2/4) Epoch 1, batch 10150, loss[loss=0.4127, simple_loss=0.4168, pruned_loss=0.2042, over 21831.00 frames. ], tot_loss[loss=0.3816, simple_loss=0.4109, pruned_loss=0.1762, over 4262687.78 frames. ], batch size: 107, lr: 3.47e-02, grad_scale: 32.0 2023-06-17 22:32:46,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=60900.0, ans=0.125 2023-06-17 22:34:17,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=61140.0, ans=0.2 2023-06-17 22:34:22,098 INFO [train.py:996] (2/4) Epoch 1, batch 10200, loss[loss=0.2948, simple_loss=0.353, pruned_loss=0.1183, over 21626.00 frames. ], tot_loss[loss=0.3748, simple_loss=0.4077, pruned_loss=0.1709, over 4266043.40 frames. ], batch size: 247, lr: 3.46e-02, grad_scale: 32.0 2023-06-17 22:35:01,230 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.111e+02 3.806e+02 4.726e+02 6.535e+02 1.145e+03, threshold=9.453e+02, percent-clipped=1.0 2023-06-17 22:35:17,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=61320.0, ans=0.0 2023-06-17 22:35:34,607 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-17 22:36:11,364 INFO [train.py:996] (2/4) Epoch 1, batch 10250, loss[loss=0.5159, simple_loss=0.5072, pruned_loss=0.2623, over 21354.00 frames. ], tot_loss[loss=0.3581, simple_loss=0.3979, pruned_loss=0.1592, over 4268490.18 frames. ], batch size: 507, lr: 3.46e-02, grad_scale: 32.0 2023-06-17 22:36:25,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=61500.0, ans=0.0 2023-06-17 22:36:32,728 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.25 vs. limit=15.0 2023-06-17 22:36:49,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=61620.0, ans=10.0 2023-06-17 22:36:58,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=61620.0, ans=0.125 2023-06-17 22:37:20,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=61680.0, ans=0.0 2023-06-17 22:37:41,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=61740.0, ans=0.0 2023-06-17 22:37:58,283 INFO [train.py:996] (2/4) Epoch 1, batch 10300, loss[loss=0.4167, simple_loss=0.4679, pruned_loss=0.1828, over 21276.00 frames. ], tot_loss[loss=0.3669, simple_loss=0.4059, pruned_loss=0.1639, over 4268019.19 frames. ], batch size: 549, lr: 3.45e-02, grad_scale: 16.0 2023-06-17 22:38:12,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=61800.0, ans=0.0 2023-06-17 22:38:18,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=61860.0, ans=0.1 2023-06-17 22:38:37,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=61920.0, ans=0.2 2023-06-17 22:38:39,032 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 4.187e+02 5.738e+02 8.381e+02 2.086e+03, threshold=1.148e+03, percent-clipped=17.0 2023-06-17 22:38:46,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=61920.0, ans=0.2 2023-06-17 22:39:34,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=62040.0, ans=0.0 2023-06-17 22:39:43,549 INFO [train.py:996] (2/4) Epoch 1, batch 10350, loss[loss=0.3468, simple_loss=0.388, pruned_loss=0.1528, over 21649.00 frames. ], tot_loss[loss=0.369, simple_loss=0.4077, pruned_loss=0.1651, over 4272558.12 frames. ], batch size: 351, lr: 3.45e-02, grad_scale: 16.0 2023-06-17 22:40:28,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=62220.0, ans=0.125 2023-06-17 22:41:21,883 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=12.0 2023-06-17 22:41:29,026 INFO [train.py:996] (2/4) Epoch 1, batch 10400, loss[loss=0.4686, simple_loss=0.4754, pruned_loss=0.2309, over 21441.00 frames. ], tot_loss[loss=0.3564, simple_loss=0.3948, pruned_loss=0.159, over 4270062.00 frames. ], batch size: 507, lr: 3.44e-02, grad_scale: 32.0 2023-06-17 22:42:08,335 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.30 vs. limit=15.0 2023-06-17 22:42:19,016 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.354e+02 3.646e+02 4.942e+02 6.227e+02 1.303e+03, threshold=9.884e+02, percent-clipped=2.0 2023-06-17 22:42:34,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=62580.0, ans=0.0 2023-06-17 22:42:52,881 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.05 vs. limit=6.0 2023-06-17 22:42:55,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=62640.0, ans=0.0 2023-06-17 22:43:18,613 INFO [train.py:996] (2/4) Epoch 1, batch 10450, loss[loss=0.399, simple_loss=0.4257, pruned_loss=0.1862, over 21413.00 frames. ], tot_loss[loss=0.3654, simple_loss=0.4014, pruned_loss=0.1647, over 4274890.82 frames. ], batch size: 211, lr: 3.44e-02, grad_scale: 32.0 2023-06-17 22:43:20,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=62700.0, ans=0.125 2023-06-17 22:43:25,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=62700.0, ans=0.05 2023-06-17 22:43:34,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=62760.0, ans=0.0 2023-06-17 22:44:04,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=62820.0, ans=0.0 2023-06-17 22:44:37,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=62880.0, ans=0.1 2023-06-17 22:44:46,918 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=15.0 2023-06-17 22:45:02,395 INFO [train.py:996] (2/4) Epoch 1, batch 10500, loss[loss=0.3168, simple_loss=0.3417, pruned_loss=0.146, over 21411.00 frames. ], tot_loss[loss=0.3666, simple_loss=0.4033, pruned_loss=0.1649, over 4265090.73 frames. ], batch size: 194, lr: 3.43e-02, grad_scale: 32.0 2023-06-17 22:45:33,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=63060.0, ans=0.125 2023-06-17 22:45:48,076 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 3.889e+02 5.007e+02 6.898e+02 1.631e+03, threshold=1.001e+03, percent-clipped=5.0 2023-06-17 22:46:45,948 INFO [train.py:996] (2/4) Epoch 1, batch 10550, loss[loss=0.3424, simple_loss=0.3685, pruned_loss=0.1582, over 21806.00 frames. ], tot_loss[loss=0.3647, simple_loss=0.3985, pruned_loss=0.1654, over 4264279.03 frames. ], batch size: 352, lr: 3.43e-02, grad_scale: 32.0 2023-06-17 22:46:48,789 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-17 22:47:16,999 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.766e-03 2023-06-17 22:47:38,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=63420.0, ans=0.05 2023-06-17 22:47:38,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=63420.0, ans=0.1 2023-06-17 22:47:53,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=63480.0, ans=0.0 2023-06-17 22:48:08,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=63540.0, ans=0.0 2023-06-17 22:48:25,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=63540.0, ans=0.125 2023-06-17 22:48:29,745 INFO [train.py:996] (2/4) Epoch 1, batch 10600, loss[loss=0.3268, simple_loss=0.3922, pruned_loss=0.1307, over 21748.00 frames. ], tot_loss[loss=0.3585, simple_loss=0.3919, pruned_loss=0.1626, over 4251407.13 frames. ], batch size: 332, lr: 3.42e-02, grad_scale: 32.0 2023-06-17 22:48:58,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=63660.0, ans=0.5 2023-06-17 22:49:00,886 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.20 vs. limit=10.0 2023-06-17 22:49:22,785 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.391e+02 3.854e+02 4.619e+02 6.310e+02 1.881e+03, threshold=9.238e+02, percent-clipped=9.0 2023-06-17 22:50:02,424 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2023-06-17 22:50:28,337 INFO [train.py:996] (2/4) Epoch 1, batch 10650, loss[loss=0.4077, simple_loss=0.4498, pruned_loss=0.1828, over 19890.00 frames. ], tot_loss[loss=0.3589, simple_loss=0.3947, pruned_loss=0.1615, over 4239137.51 frames. ], batch size: 702, lr: 3.41e-02, grad_scale: 32.0 2023-06-17 22:50:39,235 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.09 vs. limit=15.0 2023-06-17 22:50:42,795 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.42 vs. limit=15.0 2023-06-17 22:50:55,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=63960.0, ans=0.125 2023-06-17 22:51:51,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=64140.0, ans=0.125 2023-06-17 22:52:06,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=64140.0, ans=0.125 2023-06-17 22:52:14,295 INFO [train.py:996] (2/4) Epoch 1, batch 10700, loss[loss=0.3322, simple_loss=0.3624, pruned_loss=0.151, over 21555.00 frames. ], tot_loss[loss=0.3604, simple_loss=0.3951, pruned_loss=0.1628, over 4249347.13 frames. ], batch size: 263, lr: 3.41e-02, grad_scale: 32.0 2023-06-17 22:52:35,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=64260.0, ans=0.2 2023-06-17 22:52:55,087 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.608e+02 4.131e+02 5.113e+02 6.555e+02 1.006e+03, threshold=1.023e+03, percent-clipped=2.0 2023-06-17 22:53:07,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=64320.0, ans=0.0 2023-06-17 22:53:46,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=64440.0, ans=0.125 2023-06-17 22:53:58,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=64500.0, ans=0.1 2023-06-17 22:53:59,515 INFO [train.py:996] (2/4) Epoch 1, batch 10750, loss[loss=0.383, simple_loss=0.4429, pruned_loss=0.1615, over 21873.00 frames. ], tot_loss[loss=0.3734, simple_loss=0.4077, pruned_loss=0.1696, over 4249137.01 frames. ], batch size: 316, lr: 3.40e-02, grad_scale: 32.0 2023-06-17 22:54:16,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=64500.0, ans=0.0 2023-06-17 22:54:55,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=64620.0, ans=0.0 2023-06-17 22:55:49,321 INFO [train.py:996] (2/4) Epoch 1, batch 10800, loss[loss=0.4737, simple_loss=0.481, pruned_loss=0.2332, over 21694.00 frames. ], tot_loss[loss=0.378, simple_loss=0.4135, pruned_loss=0.1712, over 4264074.94 frames. ], batch size: 351, lr: 3.40e-02, grad_scale: 32.0 2023-06-17 22:56:27,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=64920.0, ans=0.125 2023-06-17 22:56:30,678 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.758e+02 4.502e+02 5.308e+02 7.377e+02 1.430e+03, threshold=1.062e+03, percent-clipped=5.0 2023-06-17 22:56:34,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=64920.0, ans=0.0 2023-06-17 22:56:47,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=64920.0, ans=0.125 2023-06-17 22:57:33,931 INFO [train.py:996] (2/4) Epoch 1, batch 10850, loss[loss=0.37, simple_loss=0.3986, pruned_loss=0.1707, over 21561.00 frames. ], tot_loss[loss=0.3811, simple_loss=0.4162, pruned_loss=0.1731, over 4260330.89 frames. ], batch size: 441, lr: 3.39e-02, grad_scale: 32.0 2023-06-17 22:57:40,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=65100.0, ans=0.125 2023-06-17 22:57:44,722 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=12.0 2023-06-17 22:57:55,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=65160.0, ans=0.125 2023-06-17 22:58:25,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=65220.0, ans=0.125 2023-06-17 22:58:27,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=65220.0, ans=0.0 2023-06-17 22:59:08,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=65340.0, ans=0.0 2023-06-17 22:59:10,241 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 22:59:17,703 INFO [train.py:996] (2/4) Epoch 1, batch 10900, loss[loss=0.3655, simple_loss=0.3903, pruned_loss=0.1703, over 21175.00 frames. ], tot_loss[loss=0.3726, simple_loss=0.4075, pruned_loss=0.1688, over 4269055.22 frames. ], batch size: 143, lr: 3.39e-02, grad_scale: 32.0 2023-06-17 22:59:59,151 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.489e+02 3.764e+02 4.430e+02 5.513e+02 1.224e+03, threshold=8.861e+02, percent-clipped=2.0 2023-06-17 23:00:29,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=65580.0, ans=0.125 2023-06-17 23:01:01,582 INFO [train.py:996] (2/4) Epoch 1, batch 10950, loss[loss=0.4153, simple_loss=0.443, pruned_loss=0.1938, over 20656.00 frames. ], tot_loss[loss=0.3671, simple_loss=0.402, pruned_loss=0.1661, over 4264052.21 frames. ], batch size: 607, lr: 3.38e-02, grad_scale: 32.0 2023-06-17 23:01:08,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=65700.0, ans=0.025 2023-06-17 23:01:40,512 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.95 vs. limit=6.0 2023-06-17 23:02:09,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=65880.0, ans=0.2 2023-06-17 23:02:44,533 INFO [train.py:996] (2/4) Epoch 1, batch 11000, loss[loss=0.3729, simple_loss=0.3957, pruned_loss=0.175, over 21939.00 frames. ], tot_loss[loss=0.3687, simple_loss=0.4017, pruned_loss=0.1678, over 4267600.11 frames. ], batch size: 316, lr: 3.38e-02, grad_scale: 32.0 2023-06-17 23:03:24,875 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-17 23:03:24,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=66120.0, ans=0.125 2023-06-17 23:03:25,405 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.79 vs. limit=15.0 2023-06-17 23:03:25,865 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.455e+02 4.369e+02 5.427e+02 7.022e+02 1.248e+03, threshold=1.085e+03, percent-clipped=10.0 2023-06-17 23:03:31,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=66120.0, ans=0.1 2023-06-17 23:03:50,197 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.06 vs. limit=12.0 2023-06-17 23:03:56,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=66180.0, ans=0.2 2023-06-17 23:04:24,342 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=5.98 vs. limit=10.0 2023-06-17 23:04:27,723 INFO [train.py:996] (2/4) Epoch 1, batch 11050, loss[loss=0.3387, simple_loss=0.3785, pruned_loss=0.1495, over 20006.00 frames. ], tot_loss[loss=0.3688, simple_loss=0.3997, pruned_loss=0.1689, over 4261578.01 frames. ], batch size: 703, lr: 3.37e-02, grad_scale: 32.0 2023-06-17 23:04:36,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=66300.0, ans=0.1 2023-06-17 23:04:55,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=66360.0, ans=0.0 2023-06-17 23:04:58,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=66360.0, ans=0.2 2023-06-17 23:05:29,143 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=22.5 2023-06-17 23:05:30,640 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.57 vs. limit=10.0 2023-06-17 23:05:31,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=66480.0, ans=0.125 2023-06-17 23:05:39,801 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.81 vs. limit=15.0 2023-06-17 23:05:43,133 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-17 23:05:44,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=66480.0, ans=0.125 2023-06-17 23:06:11,023 INFO [train.py:996] (2/4) Epoch 1, batch 11100, loss[loss=0.3591, simple_loss=0.3861, pruned_loss=0.1661, over 21727.00 frames. ], tot_loss[loss=0.3673, simple_loss=0.3963, pruned_loss=0.1692, over 4261926.43 frames. ], batch size: 351, lr: 3.37e-02, grad_scale: 32.0 2023-06-17 23:06:37,809 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.35 vs. limit=22.5 2023-06-17 23:06:58,060 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.675e+02 3.924e+02 4.981e+02 6.262e+02 1.185e+03, threshold=9.963e+02, percent-clipped=1.0 2023-06-17 23:07:07,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=66720.0, ans=0.125 2023-06-17 23:07:07,711 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=12.0 2023-06-17 23:07:48,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=66840.0, ans=0.125 2023-06-17 23:07:48,611 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-17 23:07:55,932 INFO [train.py:996] (2/4) Epoch 1, batch 11150, loss[loss=0.3178, simple_loss=0.3332, pruned_loss=0.1512, over 20310.00 frames. ], tot_loss[loss=0.3638, simple_loss=0.3935, pruned_loss=0.167, over 4254857.44 frames. ], batch size: 703, lr: 3.36e-02, grad_scale: 32.0 2023-06-17 23:08:17,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=66960.0, ans=0.125 2023-06-17 23:08:19,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=66960.0, ans=0.1 2023-06-17 23:08:58,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=67080.0, ans=0.125 2023-06-17 23:09:04,680 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.00 vs. limit=15.0 2023-06-17 23:09:37,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=67200.0, ans=0.125 2023-06-17 23:09:38,654 INFO [train.py:996] (2/4) Epoch 1, batch 11200, loss[loss=0.3109, simple_loss=0.347, pruned_loss=0.1375, over 15496.00 frames. ], tot_loss[loss=0.3611, simple_loss=0.3912, pruned_loss=0.1655, over 4237845.79 frames. ], batch size: 61, lr: 3.36e-02, grad_scale: 32.0 2023-06-17 23:09:44,739 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.60 vs. limit=22.5 2023-06-17 23:10:09,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=67260.0, ans=0.2 2023-06-17 23:10:25,343 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.560e+02 3.969e+02 4.814e+02 6.139e+02 9.199e+02, threshold=9.628e+02, percent-clipped=0.0 2023-06-17 23:11:21,151 INFO [train.py:996] (2/4) Epoch 1, batch 11250, loss[loss=0.3411, simple_loss=0.3961, pruned_loss=0.1431, over 21463.00 frames. ], tot_loss[loss=0.362, simple_loss=0.3913, pruned_loss=0.1663, over 4234815.83 frames. ], batch size: 131, lr: 3.35e-02, grad_scale: 32.0 2023-06-17 23:12:08,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=67620.0, ans=0.0 2023-06-17 23:12:49,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=67740.0, ans=0.125 2023-06-17 23:12:51,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=67740.0, ans=0.0 2023-06-17 23:13:04,131 INFO [train.py:996] (2/4) Epoch 1, batch 11300, loss[loss=0.4127, simple_loss=0.4442, pruned_loss=0.1906, over 21772.00 frames. ], tot_loss[loss=0.3625, simple_loss=0.3926, pruned_loss=0.1662, over 4241424.34 frames. ], batch size: 414, lr: 3.35e-02, grad_scale: 32.0 2023-06-17 23:13:12,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=67800.0, ans=0.125 2023-06-17 23:13:24,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=67860.0, ans=0.05 2023-06-17 23:13:51,118 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.497e+02 3.668e+02 4.732e+02 6.264e+02 1.219e+03, threshold=9.465e+02, percent-clipped=6.0 2023-06-17 23:13:58,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=67920.0, ans=0.125 2023-06-17 23:14:06,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=67920.0, ans=0.015 2023-06-17 23:14:06,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=67920.0, ans=0.0 2023-06-17 23:14:49,567 INFO [train.py:996] (2/4) Epoch 1, batch 11350, loss[loss=0.4063, simple_loss=0.4692, pruned_loss=0.1718, over 20817.00 frames. ], tot_loss[loss=0.3662, simple_loss=0.3977, pruned_loss=0.1673, over 4251550.85 frames. ], batch size: 608, lr: 3.34e-02, grad_scale: 32.0 2023-06-17 23:15:11,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=68160.0, ans=0.1 2023-06-17 23:15:22,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=68160.0, ans=15.0 2023-06-17 23:16:41,437 INFO [train.py:996] (2/4) Epoch 1, batch 11400, loss[loss=0.3897, simple_loss=0.4344, pruned_loss=0.1725, over 21862.00 frames. ], tot_loss[loss=0.3762, simple_loss=0.4077, pruned_loss=0.1723, over 4262261.42 frames. ], batch size: 372, lr: 3.34e-02, grad_scale: 32.0 2023-06-17 23:17:18,647 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.81 vs. limit=22.5 2023-06-17 23:17:28,742 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.599e+02 4.138e+02 5.254e+02 6.973e+02 1.408e+03, threshold=1.051e+03, percent-clipped=10.0 2023-06-17 23:17:58,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=68580.0, ans=0.0 2023-06-17 23:18:00,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=68580.0, ans=0.125 2023-06-17 23:18:23,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=68640.0, ans=0.125 2023-06-17 23:18:23,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=68640.0, ans=0.0 2023-06-17 23:18:27,548 INFO [train.py:996] (2/4) Epoch 1, batch 11450, loss[loss=0.5681, simple_loss=0.5402, pruned_loss=0.298, over 21352.00 frames. ], tot_loss[loss=0.3744, simple_loss=0.4075, pruned_loss=0.1707, over 4246851.23 frames. ], batch size: 508, lr: 3.33e-02, grad_scale: 32.0 2023-06-17 23:18:51,206 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.91 vs. limit=15.0 2023-06-17 23:19:28,906 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2023-06-17 23:19:30,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=68820.0, ans=0.1 2023-06-17 23:19:30,745 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=15.0 2023-06-17 23:19:35,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=68880.0, ans=0.0 2023-06-17 23:20:01,419 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=18.22 vs. limit=22.5 2023-06-17 23:20:13,506 INFO [train.py:996] (2/4) Epoch 1, batch 11500, loss[loss=0.4076, simple_loss=0.4483, pruned_loss=0.1834, over 19982.00 frames. ], tot_loss[loss=0.3781, simple_loss=0.4114, pruned_loss=0.1724, over 4248250.64 frames. ], batch size: 703, lr: 3.33e-02, grad_scale: 32.0 2023-06-17 23:21:00,431 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.621e+02 4.282e+02 5.552e+02 6.865e+02 1.531e+03, threshold=1.110e+03, percent-clipped=3.0 2023-06-17 23:21:18,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=69180.0, ans=0.0 2023-06-17 23:21:59,337 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=15.0 2023-06-17 23:22:09,466 INFO [train.py:996] (2/4) Epoch 1, batch 11550, loss[loss=0.5919, simple_loss=0.6169, pruned_loss=0.2834, over 21488.00 frames. ], tot_loss[loss=0.3777, simple_loss=0.4158, pruned_loss=0.1698, over 4262294.74 frames. ], batch size: 471, lr: 3.32e-02, grad_scale: 32.0 2023-06-17 23:22:32,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=69360.0, ans=0.0 2023-06-17 23:22:42,109 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=22.10 vs. limit=22.5 2023-06-17 23:23:40,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=69540.0, ans=0.0 2023-06-17 23:23:42,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=69540.0, ans=0.05 2023-06-17 23:23:47,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=69540.0, ans=0.0 2023-06-17 23:23:51,044 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.93 vs. limit=15.0 2023-06-17 23:23:54,947 INFO [train.py:996] (2/4) Epoch 1, batch 11600, loss[loss=0.4334, simple_loss=0.5133, pruned_loss=0.1767, over 21261.00 frames. ], tot_loss[loss=0.3899, simple_loss=0.4338, pruned_loss=0.173, over 4263254.47 frames. ], batch size: 549, lr: 3.32e-02, grad_scale: 32.0 2023-06-17 23:24:16,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=69660.0, ans=0.1 2023-06-17 23:24:39,730 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.776e+02 4.538e+02 6.004e+02 8.984e+02 1.767e+03, threshold=1.201e+03, percent-clipped=15.0 2023-06-17 23:24:57,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=69780.0, ans=0.125 2023-06-17 23:25:01,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=69780.0, ans=0.0 2023-06-17 23:25:11,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=69780.0, ans=0.025 2023-06-17 23:25:19,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=69840.0, ans=0.125 2023-06-17 23:25:32,119 INFO [train.py:996] (2/4) Epoch 1, batch 11650, loss[loss=0.4452, simple_loss=0.4608, pruned_loss=0.2149, over 21337.00 frames. ], tot_loss[loss=0.3918, simple_loss=0.4384, pruned_loss=0.1726, over 4260964.50 frames. ], batch size: 471, lr: 3.31e-02, grad_scale: 16.0 2023-06-17 23:26:01,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=69960.0, ans=0.0 2023-06-17 23:26:09,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=70020.0, ans=0.125 2023-06-17 23:26:52,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=70080.0, ans=0.0 2023-06-17 23:27:00,959 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.59 vs. limit=10.0 2023-06-17 23:27:15,726 INFO [train.py:996] (2/4) Epoch 1, batch 11700, loss[loss=0.3378, simple_loss=0.3641, pruned_loss=0.1558, over 21731.00 frames. ], tot_loss[loss=0.384, simple_loss=0.4268, pruned_loss=0.1706, over 4261242.90 frames. ], batch size: 317, lr: 3.31e-02, grad_scale: 16.0 2023-06-17 23:27:19,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=70200.0, ans=0.125 2023-06-17 23:27:46,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=70260.0, ans=0.125 2023-06-17 23:28:00,022 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.900e+02 4.129e+02 5.507e+02 7.167e+02 1.590e+03, threshold=1.101e+03, percent-clipped=1.0 2023-06-17 23:28:00,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=70320.0, ans=0.1 2023-06-17 23:28:46,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=70440.0, ans=0.2 2023-06-17 23:28:52,885 INFO [train.py:996] (2/4) Epoch 1, batch 11750, loss[loss=0.4301, simple_loss=0.4472, pruned_loss=0.2064, over 21675.00 frames. ], tot_loss[loss=0.3779, simple_loss=0.4156, pruned_loss=0.17, over 4268467.55 frames. ], batch size: 441, lr: 3.30e-02, grad_scale: 16.0 2023-06-17 23:28:56,640 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-17 23:29:15,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=70560.0, ans=0.125 2023-06-17 23:29:25,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=70560.0, ans=0.125 2023-06-17 23:30:06,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=70680.0, ans=0.125 2023-06-17 23:30:35,762 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-06-17 23:30:38,210 INFO [train.py:996] (2/4) Epoch 1, batch 11800, loss[loss=0.3796, simple_loss=0.4525, pruned_loss=0.1533, over 21771.00 frames. ], tot_loss[loss=0.3827, simple_loss=0.4177, pruned_loss=0.1738, over 4268529.18 frames. ], batch size: 282, lr: 3.30e-02, grad_scale: 16.0 2023-06-17 23:30:52,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=70800.0, ans=0.125 2023-06-17 23:31:16,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=70920.0, ans=0.125 2023-06-17 23:31:29,238 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.402e+02 3.727e+02 4.871e+02 6.879e+02 1.447e+03, threshold=9.741e+02, percent-clipped=5.0 2023-06-17 23:32:22,096 INFO [train.py:996] (2/4) Epoch 1, batch 11850, loss[loss=0.4125, simple_loss=0.4308, pruned_loss=0.1971, over 20099.00 frames. ], tot_loss[loss=0.3812, simple_loss=0.4184, pruned_loss=0.172, over 4272427.47 frames. ], batch size: 707, lr: 3.29e-02, grad_scale: 16.0 2023-06-17 23:34:12,214 INFO [train.py:996] (2/4) Epoch 1, batch 11900, loss[loss=0.3354, simple_loss=0.4025, pruned_loss=0.1342, over 21674.00 frames. ], tot_loss[loss=0.3768, simple_loss=0.4171, pruned_loss=0.1682, over 4263333.64 frames. ], batch size: 414, lr: 3.29e-02, grad_scale: 16.0 2023-06-17 23:34:14,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=71400.0, ans=0.2 2023-06-17 23:35:08,168 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.209e+02 3.592e+02 4.787e+02 5.877e+02 1.275e+03, threshold=9.575e+02, percent-clipped=4.0 2023-06-17 23:35:56,081 INFO [train.py:996] (2/4) Epoch 1, batch 11950, loss[loss=0.303, simple_loss=0.3689, pruned_loss=0.1186, over 21578.00 frames. ], tot_loss[loss=0.3673, simple_loss=0.4129, pruned_loss=0.1609, over 4272876.13 frames. ], batch size: 230, lr: 3.28e-02, grad_scale: 16.0 2023-06-17 23:36:44,003 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=15.0 2023-06-17 23:37:39,777 INFO [train.py:996] (2/4) Epoch 1, batch 12000, loss[loss=0.3429, simple_loss=0.3697, pruned_loss=0.1581, over 21874.00 frames. ], tot_loss[loss=0.3625, simple_loss=0.4058, pruned_loss=0.1596, over 4270653.44 frames. ], batch size: 98, lr: 3.28e-02, grad_scale: 32.0 2023-06-17 23:37:39,777 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-17 23:37:57,347 INFO [train.py:1028] (2/4) Epoch 1, validation: loss=0.3348, simple_loss=0.4196, pruned_loss=0.125, over 1796401.00 frames. 2023-06-17 23:37:57,348 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-17 23:38:33,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=72060.0, ans=0.1 2023-06-17 23:38:52,956 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.252e+02 3.693e+02 4.861e+02 6.052e+02 1.192e+03, threshold=9.721e+02, percent-clipped=3.0 2023-06-17 23:39:03,868 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=7.06 vs. limit=6.0 2023-06-17 23:39:41,336 INFO [train.py:996] (2/4) Epoch 1, batch 12050, loss[loss=0.3678, simple_loss=0.3838, pruned_loss=0.176, over 21246.00 frames. ], tot_loss[loss=0.3648, simple_loss=0.4034, pruned_loss=0.1631, over 4275905.92 frames. ], batch size: 176, lr: 3.27e-02, grad_scale: 32.0 2023-06-17 23:39:53,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=72300.0, ans=0.125 2023-06-17 23:41:11,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=72540.0, ans=0.0 2023-06-17 23:41:32,455 INFO [train.py:996] (2/4) Epoch 1, batch 12100, loss[loss=0.4043, simple_loss=0.4678, pruned_loss=0.1704, over 19771.00 frames. ], tot_loss[loss=0.3781, simple_loss=0.4146, pruned_loss=0.1708, over 4272182.34 frames. ], batch size: 702, lr: 3.27e-02, grad_scale: 16.0 2023-06-17 23:41:42,595 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.18 vs. limit=22.5 2023-06-17 23:41:47,539 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.68 vs. limit=22.5 2023-06-17 23:41:47,586 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.30 vs. limit=15.0 2023-06-17 23:42:26,178 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.143e+02 4.439e+02 6.434e+02 8.417e+02 1.460e+03, threshold=1.287e+03, percent-clipped=16.0 2023-06-17 23:43:23,521 INFO [train.py:996] (2/4) Epoch 1, batch 12150, loss[loss=0.3613, simple_loss=0.4296, pruned_loss=0.1465, over 21818.00 frames. ], tot_loss[loss=0.3804, simple_loss=0.4182, pruned_loss=0.1712, over 4273847.52 frames. ], batch size: 371, lr: 3.26e-02, grad_scale: 16.0 2023-06-17 23:44:00,336 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=15.0 2023-06-17 23:44:03,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=73020.0, ans=0.125 2023-06-17 23:45:00,533 INFO [train.py:996] (2/4) Epoch 1, batch 12200, loss[loss=0.3462, simple_loss=0.3711, pruned_loss=0.1607, over 21849.00 frames. ], tot_loss[loss=0.3753, simple_loss=0.4119, pruned_loss=0.1694, over 4268236.39 frames. ], batch size: 98, lr: 3.26e-02, grad_scale: 16.0 2023-06-17 23:45:01,514 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=12.0 2023-06-17 23:45:01,770 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.15 vs. limit=15.0 2023-06-17 23:45:12,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=73200.0, ans=0.1 2023-06-17 23:45:37,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=73320.0, ans=0.125 2023-06-17 23:45:41,555 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=21.49 vs. limit=15.0 2023-06-17 23:45:45,902 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.575e+02 3.853e+02 4.664e+02 5.869e+02 1.070e+03, threshold=9.327e+02, percent-clipped=0.0 2023-06-17 23:45:54,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=73380.0, ans=0.125 2023-06-17 23:45:56,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=73380.0, ans=0.05 2023-06-17 23:46:25,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=73440.0, ans=0.125 2023-06-17 23:46:26,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=73440.0, ans=0.125 2023-06-17 23:46:42,500 INFO [train.py:996] (2/4) Epoch 1, batch 12250, loss[loss=0.2267, simple_loss=0.2963, pruned_loss=0.07854, over 21516.00 frames. ], tot_loss[loss=0.3643, simple_loss=0.4016, pruned_loss=0.1635, over 4263771.34 frames. ], batch size: 212, lr: 3.25e-02, grad_scale: 16.0 2023-06-17 23:46:48,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=73500.0, ans=0.125 2023-06-17 23:47:27,110 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-06-17 23:47:49,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=73680.0, ans=0.0 2023-06-17 23:48:25,667 INFO [train.py:996] (2/4) Epoch 1, batch 12300, loss[loss=0.3343, simple_loss=0.4064, pruned_loss=0.1311, over 21639.00 frames. ], tot_loss[loss=0.3476, simple_loss=0.3903, pruned_loss=0.1525, over 4271338.31 frames. ], batch size: 389, lr: 3.25e-02, grad_scale: 16.0 2023-06-17 23:49:12,345 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.290e+02 3.740e+02 4.870e+02 6.587e+02 1.091e+03, threshold=9.740e+02, percent-clipped=4.0 2023-06-17 23:49:51,548 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.93 vs. limit=6.0 2023-06-17 23:50:08,222 INFO [train.py:996] (2/4) Epoch 1, batch 12350, loss[loss=0.4224, simple_loss=0.4405, pruned_loss=0.2022, over 21838.00 frames. ], tot_loss[loss=0.3547, simple_loss=0.3983, pruned_loss=0.1556, over 4276647.24 frames. ], batch size: 332, lr: 3.24e-02, grad_scale: 16.0 2023-06-17 23:50:13,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=74100.0, ans=0.0 2023-06-17 23:50:37,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=74160.0, ans=0.125 2023-06-17 23:50:44,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=74220.0, ans=0.0 2023-06-17 23:51:33,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=74340.0, ans=0.09899494936611666 2023-06-17 23:51:49,167 INFO [train.py:996] (2/4) Epoch 1, batch 12400, loss[loss=0.3905, simple_loss=0.4232, pruned_loss=0.1789, over 21720.00 frames. ], tot_loss[loss=0.36, simple_loss=0.4003, pruned_loss=0.1599, over 4275268.76 frames. ], batch size: 389, lr: 3.24e-02, grad_scale: 32.0 2023-06-17 23:51:57,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=74400.0, ans=0.1 2023-06-17 23:52:15,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=74460.0, ans=0.1 2023-06-17 23:52:34,930 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.733e+02 4.026e+02 5.096e+02 6.661e+02 1.103e+03, threshold=1.019e+03, percent-clipped=2.0 2023-06-17 23:52:45,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=74520.0, ans=0.125 2023-06-17 23:53:08,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=74580.0, ans=0.0 2023-06-17 23:53:10,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=74580.0, ans=0.1 2023-06-17 23:53:21,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=74640.0, ans=0.125 2023-06-17 23:53:31,334 INFO [train.py:996] (2/4) Epoch 1, batch 12450, loss[loss=0.4137, simple_loss=0.4462, pruned_loss=0.1906, over 21874.00 frames. ], tot_loss[loss=0.3685, simple_loss=0.4061, pruned_loss=0.1655, over 4284512.97 frames. ], batch size: 371, lr: 3.23e-02, grad_scale: 32.0 2023-06-17 23:53:31,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=74700.0, ans=0.125 2023-06-17 23:53:58,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=74760.0, ans=0.125 2023-06-17 23:54:01,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=74760.0, ans=0.2 2023-06-17 23:54:04,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=74760.0, ans=0.125 2023-06-17 23:54:26,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=74820.0, ans=0.2 2023-06-17 23:54:39,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=74880.0, ans=0.1 2023-06-17 23:54:47,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=74880.0, ans=0.125 2023-06-17 23:55:16,052 INFO [train.py:996] (2/4) Epoch 1, batch 12500, loss[loss=0.4043, simple_loss=0.4558, pruned_loss=0.1765, over 21390.00 frames. ], tot_loss[loss=0.3832, simple_loss=0.4204, pruned_loss=0.173, over 4288268.70 frames. ], batch size: 194, lr: 3.23e-02, grad_scale: 32.0 2023-06-17 23:55:31,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=75060.0, ans=0.05 2023-06-17 23:55:48,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=75060.0, ans=0.125 2023-06-17 23:56:14,398 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.070e+02 4.603e+02 5.505e+02 7.191e+02 1.270e+03, threshold=1.101e+03, percent-clipped=4.0 2023-06-17 23:56:29,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=75180.0, ans=0.1 2023-06-17 23:56:45,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=75240.0, ans=0.1 2023-06-17 23:56:57,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=75240.0, ans=0.2 2023-06-17 23:57:02,631 INFO [train.py:996] (2/4) Epoch 1, batch 12550, loss[loss=0.4956, simple_loss=0.5042, pruned_loss=0.2435, over 21353.00 frames. ], tot_loss[loss=0.3897, simple_loss=0.4265, pruned_loss=0.1764, over 4284142.21 frames. ], batch size: 507, lr: 3.22e-02, grad_scale: 32.0 2023-06-17 23:57:12,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=75300.0, ans=0.0 2023-06-17 23:57:36,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=75360.0, ans=0.0 2023-06-17 23:57:41,945 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=22.5 2023-06-17 23:58:33,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=75540.0, ans=0.2 2023-06-17 23:58:38,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=75540.0, ans=0.125 2023-06-17 23:58:44,740 INFO [train.py:996] (2/4) Epoch 1, batch 12600, loss[loss=0.3412, simple_loss=0.4051, pruned_loss=0.1386, over 21659.00 frames. ], tot_loss[loss=0.3823, simple_loss=0.4219, pruned_loss=0.1713, over 4283406.99 frames. ], batch size: 414, lr: 3.22e-02, grad_scale: 32.0 2023-06-17 23:59:41,393 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.279e+02 3.697e+02 4.571e+02 5.714e+02 1.241e+03, threshold=9.141e+02, percent-clipped=1.0 2023-06-18 00:00:21,828 INFO [train.py:996] (2/4) Epoch 1, batch 12650, loss[loss=0.333, simple_loss=0.3672, pruned_loss=0.1494, over 21685.00 frames. ], tot_loss[loss=0.3695, simple_loss=0.4121, pruned_loss=0.1635, over 4290356.47 frames. ], batch size: 230, lr: 3.21e-02, grad_scale: 32.0 2023-06-18 00:00:42,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.whiten.whitening_limit, batch_count=75900.0, ans=12.0 2023-06-18 00:00:43,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=75900.0, ans=0.1 2023-06-18 00:00:45,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=75900.0, ans=0.125 2023-06-18 00:00:46,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=75960.0, ans=0.125 2023-06-18 00:01:13,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=76020.0, ans=0.1 2023-06-18 00:01:21,600 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-06-18 00:01:37,761 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:02:03,338 INFO [train.py:996] (2/4) Epoch 1, batch 12700, loss[loss=0.3654, simple_loss=0.3987, pruned_loss=0.166, over 21261.00 frames. ], tot_loss[loss=0.3733, simple_loss=0.4121, pruned_loss=0.1673, over 4293164.30 frames. ], batch size: 176, lr: 3.21e-02, grad_scale: 32.0 2023-06-18 00:02:25,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=76260.0, ans=0.125 2023-06-18 00:02:54,853 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.709e+02 4.032e+02 5.068e+02 7.064e+02 1.461e+03, threshold=1.014e+03, percent-clipped=9.0 2023-06-18 00:02:55,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=76320.0, ans=0.0 2023-06-18 00:03:40,099 INFO [train.py:996] (2/4) Epoch 1, batch 12750, loss[loss=0.3648, simple_loss=0.3944, pruned_loss=0.1676, over 21894.00 frames. ], tot_loss[loss=0.3741, simple_loss=0.4134, pruned_loss=0.1674, over 4286113.46 frames. ], batch size: 118, lr: 3.20e-02, grad_scale: 32.0 2023-06-18 00:03:40,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=76500.0, ans=0.1 2023-06-18 00:04:10,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=76560.0, ans=0.125 2023-06-18 00:04:36,241 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.25 vs. limit=22.5 2023-06-18 00:05:03,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=76740.0, ans=0.125 2023-06-18 00:05:06,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=76740.0, ans=0.125 2023-06-18 00:05:32,636 INFO [train.py:996] (2/4) Epoch 1, batch 12800, loss[loss=0.3289, simple_loss=0.3767, pruned_loss=0.1405, over 21436.00 frames. ], tot_loss[loss=0.374, simple_loss=0.412, pruned_loss=0.168, over 4288513.37 frames. ], batch size: 211, lr: 3.20e-02, grad_scale: 32.0 2023-06-18 00:05:40,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=76800.0, ans=10.0 2023-06-18 00:05:53,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=76860.0, ans=0.025 2023-06-18 00:06:20,453 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.836e+02 3.969e+02 4.961e+02 6.426e+02 1.503e+03, threshold=9.923e+02, percent-clipped=9.0 2023-06-18 00:06:32,710 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:06:34,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=76980.0, ans=0.2 2023-06-18 00:07:01,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=77040.0, ans=0.125 2023-06-18 00:07:12,475 INFO [train.py:996] (2/4) Epoch 1, batch 12850, loss[loss=0.4432, simple_loss=0.4972, pruned_loss=0.1946, over 19837.00 frames. ], tot_loss[loss=0.378, simple_loss=0.4155, pruned_loss=0.1702, over 4287980.84 frames. ], batch size: 703, lr: 3.19e-02, grad_scale: 32.0 2023-06-18 00:07:13,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=77100.0, ans=0.1 2023-06-18 00:07:29,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=77100.0, ans=0.0 2023-06-18 00:07:44,839 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.00 vs. limit=10.0 2023-06-18 00:08:39,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=77340.0, ans=0.04949747468305833 2023-06-18 00:09:00,995 INFO [train.py:996] (2/4) Epoch 1, batch 12900, loss[loss=0.3497, simple_loss=0.4064, pruned_loss=0.1465, over 21829.00 frames. ], tot_loss[loss=0.3678, simple_loss=0.4092, pruned_loss=0.1632, over 4284318.92 frames. ], batch size: 333, lr: 3.19e-02, grad_scale: 32.0 2023-06-18 00:09:15,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=77460.0, ans=10.0 2023-06-18 00:09:25,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=77460.0, ans=0.0 2023-06-18 00:09:39,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=77520.0, ans=0.0 2023-06-18 00:09:46,055 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.277e+02 3.853e+02 4.882e+02 6.013e+02 9.581e+02, threshold=9.764e+02, percent-clipped=0.0 2023-06-18 00:10:43,808 INFO [train.py:996] (2/4) Epoch 1, batch 12950, loss[loss=0.397, simple_loss=0.4719, pruned_loss=0.1611, over 19794.00 frames. ], tot_loss[loss=0.3619, simple_loss=0.405, pruned_loss=0.1594, over 4277162.34 frames. ], batch size: 703, lr: 3.19e-02, grad_scale: 32.0 2023-06-18 00:11:18,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=77760.0, ans=0.125 2023-06-18 00:11:48,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=77880.0, ans=0.125 2023-06-18 00:11:52,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=77880.0, ans=0.125 2023-06-18 00:11:57,533 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.24 vs. limit=15.0 2023-06-18 00:12:06,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=77880.0, ans=0.1 2023-06-18 00:12:14,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=77940.0, ans=0.0 2023-06-18 00:12:28,720 INFO [train.py:996] (2/4) Epoch 1, batch 13000, loss[loss=0.3752, simple_loss=0.4165, pruned_loss=0.1669, over 21500.00 frames. ], tot_loss[loss=0.3647, simple_loss=0.4071, pruned_loss=0.1612, over 4274839.07 frames. ], batch size: 471, lr: 3.18e-02, grad_scale: 16.0 2023-06-18 00:12:45,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=78060.0, ans=0.05 2023-06-18 00:13:04,445 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.91 vs. limit=12.0 2023-06-18 00:13:20,480 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.564e+02 4.234e+02 5.570e+02 6.916e+02 1.204e+03, threshold=1.114e+03, percent-clipped=4.0 2023-06-18 00:14:05,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=78240.0, ans=0.125 2023-06-18 00:14:09,921 INFO [train.py:996] (2/4) Epoch 1, batch 13050, loss[loss=0.384, simple_loss=0.4101, pruned_loss=0.179, over 21438.00 frames. ], tot_loss[loss=0.3592, simple_loss=0.4025, pruned_loss=0.158, over 4279682.63 frames. ], batch size: 131, lr: 3.18e-02, grad_scale: 16.0 2023-06-18 00:15:04,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=78420.0, ans=0.0 2023-06-18 00:15:14,077 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=15.0 2023-06-18 00:15:42,847 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-06-18 00:15:55,024 INFO [train.py:996] (2/4) Epoch 1, batch 13100, loss[loss=0.3457, simple_loss=0.3998, pruned_loss=0.1458, over 21807.00 frames. ], tot_loss[loss=0.3608, simple_loss=0.4046, pruned_loss=0.1585, over 4280497.42 frames. ], batch size: 282, lr: 3.17e-02, grad_scale: 16.0 2023-06-18 00:15:56,149 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.72 vs. limit=15.0 2023-06-18 00:16:53,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=78720.0, ans=0.125 2023-06-18 00:16:54,431 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.991e+02 4.696e+02 5.724e+02 7.991e+02 1.405e+03, threshold=1.145e+03, percent-clipped=4.0 2023-06-18 00:17:45,916 INFO [train.py:996] (2/4) Epoch 1, batch 13150, loss[loss=0.3631, simple_loss=0.3878, pruned_loss=0.1692, over 21795.00 frames. ], tot_loss[loss=0.3677, simple_loss=0.4081, pruned_loss=0.1637, over 4282706.12 frames. ], batch size: 124, lr: 3.17e-02, grad_scale: 16.0 2023-06-18 00:17:47,137 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.03 vs. limit=6.0 2023-06-18 00:18:37,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=79020.0, ans=0.0 2023-06-18 00:18:50,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=79080.0, ans=0.0 2023-06-18 00:18:59,779 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.50 vs. limit=10.0 2023-06-18 00:19:00,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=79080.0, ans=0.1 2023-06-18 00:19:00,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=79080.0, ans=0.0 2023-06-18 00:19:02,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=79080.0, ans=0.125 2023-06-18 00:19:30,343 INFO [train.py:996] (2/4) Epoch 1, batch 13200, loss[loss=0.3796, simple_loss=0.4131, pruned_loss=0.173, over 21799.00 frames. ], tot_loss[loss=0.3699, simple_loss=0.4089, pruned_loss=0.1654, over 4278719.76 frames. ], batch size: 247, lr: 3.16e-02, grad_scale: 32.0 2023-06-18 00:19:40,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=79200.0, ans=0.125 2023-06-18 00:20:05,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=79260.0, ans=0.125 2023-06-18 00:20:05,688 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.77 vs. limit=15.0 2023-06-18 00:20:27,591 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.628e+02 3.872e+02 4.776e+02 6.394e+02 8.489e+02, threshold=9.552e+02, percent-clipped=0.0 2023-06-18 00:20:48,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=79380.0, ans=0.0 2023-06-18 00:21:01,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=79440.0, ans=0.125 2023-06-18 00:21:09,358 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2023-06-18 00:21:18,114 INFO [train.py:996] (2/4) Epoch 1, batch 13250, loss[loss=0.3605, simple_loss=0.3959, pruned_loss=0.1626, over 21822.00 frames. ], tot_loss[loss=0.3723, simple_loss=0.4095, pruned_loss=0.1676, over 4279987.06 frames. ], batch size: 107, lr: 3.16e-02, grad_scale: 32.0 2023-06-18 00:21:25,617 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.12 vs. limit=15.0 2023-06-18 00:21:32,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=79500.0, ans=0.0 2023-06-18 00:22:10,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=79620.0, ans=0.125 2023-06-18 00:22:37,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=79680.0, ans=0.0 2023-06-18 00:23:07,131 INFO [train.py:996] (2/4) Epoch 1, batch 13300, loss[loss=0.3157, simple_loss=0.4321, pruned_loss=0.09967, over 19807.00 frames. ], tot_loss[loss=0.3739, simple_loss=0.4127, pruned_loss=0.1675, over 4276384.53 frames. ], batch size: 702, lr: 3.15e-02, grad_scale: 32.0 2023-06-18 00:23:11,449 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-18 00:23:19,379 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.21 vs. limit=15.0 2023-06-18 00:23:29,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=79860.0, ans=0.125 2023-06-18 00:23:55,798 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.809e+02 3.934e+02 5.014e+02 6.811e+02 1.186e+03, threshold=1.003e+03, percent-clipped=5.0 2023-06-18 00:24:51,653 INFO [train.py:996] (2/4) Epoch 1, batch 13350, loss[loss=0.3949, simple_loss=0.4805, pruned_loss=0.1547, over 19662.00 frames. ], tot_loss[loss=0.3816, simple_loss=0.4198, pruned_loss=0.1717, over 4273349.29 frames. ], batch size: 702, lr: 3.15e-02, grad_scale: 32.0 2023-06-18 00:24:52,037 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:25:03,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=80100.0, ans=0.0 2023-06-18 00:25:06,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=80100.0, ans=0.0 2023-06-18 00:25:14,562 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-18 00:25:17,481 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:25:57,979 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:25:59,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=80280.0, ans=0.2 2023-06-18 00:26:30,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=80340.0, ans=0.125 2023-06-18 00:26:40,435 INFO [train.py:996] (2/4) Epoch 1, batch 13400, loss[loss=0.4205, simple_loss=0.4427, pruned_loss=0.1992, over 21857.00 frames. ], tot_loss[loss=0.3864, simple_loss=0.4222, pruned_loss=0.1753, over 4273823.79 frames. ], batch size: 371, lr: 3.14e-02, grad_scale: 32.0 2023-06-18 00:26:43,347 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.19 vs. limit=22.5 2023-06-18 00:26:44,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=80400.0, ans=0.0 2023-06-18 00:27:24,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=80520.0, ans=0.125 2023-06-18 00:27:27,345 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.979e+02 4.393e+02 5.548e+02 7.060e+02 1.249e+03, threshold=1.110e+03, percent-clipped=4.0 2023-06-18 00:27:27,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=80520.0, ans=0.0 2023-06-18 00:28:23,674 INFO [train.py:996] (2/4) Epoch 1, batch 13450, loss[loss=0.3114, simple_loss=0.3144, pruned_loss=0.1542, over 16710.00 frames. ], tot_loss[loss=0.3891, simple_loss=0.422, pruned_loss=0.1781, over 4271779.29 frames. ], batch size: 60, lr: 3.14e-02, grad_scale: 32.0 2023-06-18 00:28:48,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=80760.0, ans=0.125 2023-06-18 00:28:48,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=80760.0, ans=0.0 2023-06-18 00:29:34,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=80880.0, ans=0.1 2023-06-18 00:29:37,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=80880.0, ans=0.125 2023-06-18 00:29:46,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=80940.0, ans=0.0 2023-06-18 00:29:59,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=80940.0, ans=0.0 2023-06-18 00:30:08,372 INFO [train.py:996] (2/4) Epoch 1, batch 13500, loss[loss=0.3208, simple_loss=0.3751, pruned_loss=0.1333, over 21880.00 frames. ], tot_loss[loss=0.3765, simple_loss=0.4093, pruned_loss=0.1718, over 4265946.24 frames. ], batch size: 317, lr: 3.14e-02, grad_scale: 32.0 2023-06-18 00:30:17,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=81000.0, ans=0.0 2023-06-18 00:30:22,966 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.11 vs. limit=22.5 2023-06-18 00:30:36,504 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.28 vs. limit=15.0 2023-06-18 00:30:44,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=81060.0, ans=0.125 2023-06-18 00:31:07,788 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.719e+02 4.002e+02 4.680e+02 6.090e+02 1.151e+03, threshold=9.360e+02, percent-clipped=1.0 2023-06-18 00:31:21,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=81180.0, ans=0.125 2023-06-18 00:31:52,232 INFO [train.py:996] (2/4) Epoch 1, batch 13550, loss[loss=0.3397, simple_loss=0.3701, pruned_loss=0.1546, over 21727.00 frames. ], tot_loss[loss=0.3767, simple_loss=0.4129, pruned_loss=0.1702, over 4252488.08 frames. ], batch size: 112, lr: 3.13e-02, grad_scale: 32.0 2023-06-18 00:32:02,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=81300.0, ans=0.0 2023-06-18 00:32:22,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=81360.0, ans=0.125 2023-06-18 00:32:44,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=81420.0, ans=0.125 2023-06-18 00:33:00,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=81480.0, ans=0.125 2023-06-18 00:33:01,029 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-06-18 00:33:34,668 INFO [train.py:996] (2/4) Epoch 1, batch 13600, loss[loss=0.3757, simple_loss=0.4101, pruned_loss=0.1707, over 21899.00 frames. ], tot_loss[loss=0.3799, simple_loss=0.4163, pruned_loss=0.1718, over 4261986.38 frames. ], batch size: 351, lr: 3.13e-02, grad_scale: 32.0 2023-06-18 00:34:16,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=81720.0, ans=0.0 2023-06-18 00:34:27,100 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.149e+02 4.484e+02 6.125e+02 7.575e+02 1.688e+03, threshold=1.225e+03, percent-clipped=13.0 2023-06-18 00:34:29,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=81720.0, ans=0.125 2023-06-18 00:34:35,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=81780.0, ans=0.0 2023-06-18 00:34:36,534 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.58 vs. limit=8.0 2023-06-18 00:34:56,001 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.96 vs. limit=6.0 2023-06-18 00:35:11,063 INFO [train.py:996] (2/4) Epoch 1, batch 13650, loss[loss=0.3091, simple_loss=0.3416, pruned_loss=0.1383, over 21540.00 frames. ], tot_loss[loss=0.3699, simple_loss=0.4081, pruned_loss=0.1659, over 4260783.16 frames. ], batch size: 263, lr: 3.12e-02, grad_scale: 32.0 2023-06-18 00:36:12,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=82020.0, ans=0.125 2023-06-18 00:36:33,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=82080.0, ans=0.125 2023-06-18 00:36:33,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=82080.0, ans=0.05 2023-06-18 00:36:59,458 INFO [train.py:996] (2/4) Epoch 1, batch 13700, loss[loss=0.338, simple_loss=0.3795, pruned_loss=0.1483, over 21773.00 frames. ], tot_loss[loss=0.3646, simple_loss=0.4007, pruned_loss=0.1643, over 4269082.33 frames. ], batch size: 316, lr: 3.12e-02, grad_scale: 32.0 2023-06-18 00:37:07,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=82200.0, ans=0.0 2023-06-18 00:37:10,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=82200.0, ans=0.0 2023-06-18 00:37:25,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=82260.0, ans=0.0 2023-06-18 00:37:40,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=82260.0, ans=0.125 2023-06-18 00:37:50,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=82320.0, ans=0.125 2023-06-18 00:37:53,347 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.729e+02 3.850e+02 5.196e+02 6.756e+02 1.127e+03, threshold=1.039e+03, percent-clipped=0.0 2023-06-18 00:38:01,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=82320.0, ans=0.125 2023-06-18 00:38:12,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=82380.0, ans=0.1 2023-06-18 00:38:43,667 INFO [train.py:996] (2/4) Epoch 1, batch 13750, loss[loss=0.3359, simple_loss=0.4009, pruned_loss=0.1354, over 21175.00 frames. ], tot_loss[loss=0.3546, simple_loss=0.3923, pruned_loss=0.1585, over 4264012.70 frames. ], batch size: 548, lr: 3.11e-02, grad_scale: 32.0 2023-06-18 00:39:55,386 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=22.5 2023-06-18 00:40:26,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=82740.0, ans=0.0 2023-06-18 00:40:32,422 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.20 vs. limit=22.5 2023-06-18 00:40:40,340 INFO [train.py:996] (2/4) Epoch 1, batch 13800, loss[loss=0.3901, simple_loss=0.4636, pruned_loss=0.1583, over 21827.00 frames. ], tot_loss[loss=0.3575, simple_loss=0.3989, pruned_loss=0.1581, over 4264395.76 frames. ], batch size: 371, lr: 3.11e-02, grad_scale: 32.0 2023-06-18 00:40:40,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=82800.0, ans=0.1 2023-06-18 00:40:40,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=82800.0, ans=0.1 2023-06-18 00:40:57,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=82860.0, ans=0.2 2023-06-18 00:41:07,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=82860.0, ans=0.1 2023-06-18 00:41:22,727 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.36 vs. limit=15.0 2023-06-18 00:41:23,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=82920.0, ans=0.125 2023-06-18 00:41:28,047 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.821e+02 3.900e+02 5.256e+02 6.721e+02 1.169e+03, threshold=1.051e+03, percent-clipped=1.0 2023-06-18 00:41:48,540 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.334e-02 2023-06-18 00:42:22,665 INFO [train.py:996] (2/4) Epoch 1, batch 13850, loss[loss=0.2829, simple_loss=0.3432, pruned_loss=0.1113, over 21867.00 frames. ], tot_loss[loss=0.3658, simple_loss=0.4087, pruned_loss=0.1615, over 4267183.26 frames. ], batch size: 107, lr: 3.11e-02, grad_scale: 32.0 2023-06-18 00:42:27,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=83100.0, ans=0.125 2023-06-18 00:43:55,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=83340.0, ans=0.0 2023-06-18 00:44:06,039 INFO [train.py:996] (2/4) Epoch 1, batch 13900, loss[loss=0.4686, simple_loss=0.4652, pruned_loss=0.236, over 21742.00 frames. ], tot_loss[loss=0.3768, simple_loss=0.4156, pruned_loss=0.169, over 4268108.62 frames. ], batch size: 441, lr: 3.10e-02, grad_scale: 32.0 2023-06-18 00:44:09,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=83400.0, ans=0.015 2023-06-18 00:44:21,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=83400.0, ans=0.125 2023-06-18 00:44:58,293 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.781e+02 4.127e+02 5.100e+02 6.768e+02 1.105e+03, threshold=1.020e+03, percent-clipped=2.0 2023-06-18 00:45:48,150 INFO [train.py:996] (2/4) Epoch 1, batch 13950, loss[loss=0.3669, simple_loss=0.4068, pruned_loss=0.1635, over 21851.00 frames. ], tot_loss[loss=0.3796, simple_loss=0.4156, pruned_loss=0.1718, over 4283841.23 frames. ], batch size: 332, lr: 3.10e-02, grad_scale: 32.0 2023-06-18 00:46:06,201 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.23 vs. limit=6.0 2023-06-18 00:46:12,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=83760.0, ans=0.025 2023-06-18 00:46:43,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=83820.0, ans=0.2 2023-06-18 00:46:46,968 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.18 vs. limit=22.5 2023-06-18 00:46:54,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=83880.0, ans=0.125 2023-06-18 00:47:30,517 INFO [train.py:996] (2/4) Epoch 1, batch 14000, loss[loss=0.2454, simple_loss=0.3039, pruned_loss=0.09342, over 21675.00 frames. ], tot_loss[loss=0.3698, simple_loss=0.4079, pruned_loss=0.1658, over 4279589.37 frames. ], batch size: 263, lr: 3.09e-02, grad_scale: 32.0 2023-06-18 00:47:43,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=84000.0, ans=0.2 2023-06-18 00:48:28,676 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 3.708e+02 4.933e+02 6.099e+02 9.890e+02, threshold=9.866e+02, percent-clipped=0.0 2023-06-18 00:48:29,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=84120.0, ans=0.015 2023-06-18 00:48:29,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=84120.0, ans=10.0 2023-06-18 00:48:33,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=84180.0, ans=0.125 2023-06-18 00:49:04,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=84240.0, ans=0.2 2023-06-18 00:49:10,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=84240.0, ans=0.0 2023-06-18 00:49:18,922 INFO [train.py:996] (2/4) Epoch 1, batch 14050, loss[loss=0.3125, simple_loss=0.3493, pruned_loss=0.1378, over 21167.00 frames. ], tot_loss[loss=0.3605, simple_loss=0.402, pruned_loss=0.1595, over 4282636.25 frames. ], batch size: 548, lr: 3.09e-02, grad_scale: 32.0 2023-06-18 00:50:13,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=84420.0, ans=0.0 2023-06-18 00:50:20,513 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 00:50:34,848 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.03 vs. limit=15.0 2023-06-18 00:51:01,840 INFO [train.py:996] (2/4) Epoch 1, batch 14100, loss[loss=0.4124, simple_loss=0.4309, pruned_loss=0.197, over 21327.00 frames. ], tot_loss[loss=0.3576, simple_loss=0.3965, pruned_loss=0.1593, over 4283021.36 frames. ], batch size: 549, lr: 3.08e-02, grad_scale: 32.0 2023-06-18 00:51:03,166 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.54 vs. limit=22.5 2023-06-18 00:51:54,170 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.428e+02 4.143e+02 4.965e+02 6.574e+02 1.166e+03, threshold=9.930e+02, percent-clipped=2.0 2023-06-18 00:51:54,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=84720.0, ans=0.1 2023-06-18 00:52:32,174 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.85 vs. limit=6.0 2023-06-18 00:52:37,619 INFO [train.py:996] (2/4) Epoch 1, batch 14150, loss[loss=0.3631, simple_loss=0.407, pruned_loss=0.1596, over 21199.00 frames. ], tot_loss[loss=0.3618, simple_loss=0.4009, pruned_loss=0.1613, over 4279402.20 frames. ], batch size: 143, lr: 3.08e-02, grad_scale: 32.0 2023-06-18 00:53:23,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=85020.0, ans=0.125 2023-06-18 00:54:01,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=85140.0, ans=0.125 2023-06-18 00:54:06,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=85140.0, ans=0.1 2023-06-18 00:54:17,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=85200.0, ans=0.125 2023-06-18 00:54:18,340 INFO [train.py:996] (2/4) Epoch 1, batch 14200, loss[loss=0.2898, simple_loss=0.3506, pruned_loss=0.1145, over 21487.00 frames. ], tot_loss[loss=0.3576, simple_loss=0.3974, pruned_loss=0.1589, over 4269455.60 frames. ], batch size: 194, lr: 3.08e-02, grad_scale: 32.0 2023-06-18 00:54:22,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=85200.0, ans=0.0 2023-06-18 00:54:24,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=85200.0, ans=0.125 2023-06-18 00:54:53,003 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=15.0 2023-06-18 00:55:09,995 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 4.023e+02 4.862e+02 6.439e+02 1.166e+03, threshold=9.724e+02, percent-clipped=3.0 2023-06-18 00:55:20,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=85380.0, ans=0.125 2023-06-18 00:55:53,729 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.28 vs. limit=10.0 2023-06-18 00:55:59,283 INFO [train.py:996] (2/4) Epoch 1, batch 14250, loss[loss=0.3376, simple_loss=0.3903, pruned_loss=0.1424, over 21638.00 frames. ], tot_loss[loss=0.3531, simple_loss=0.3922, pruned_loss=0.157, over 4253939.66 frames. ], batch size: 391, lr: 3.07e-02, grad_scale: 32.0 2023-06-18 00:56:37,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=85620.0, ans=0.125 2023-06-18 00:57:01,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=85620.0, ans=0.125 2023-06-18 00:57:43,781 INFO [train.py:996] (2/4) Epoch 1, batch 14300, loss[loss=0.298, simple_loss=0.3248, pruned_loss=0.1356, over 20710.00 frames. ], tot_loss[loss=0.3539, simple_loss=0.394, pruned_loss=0.1569, over 4245178.20 frames. ], batch size: 607, lr: 3.07e-02, grad_scale: 32.0 2023-06-18 00:57:50,222 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.64 vs. limit=10.0 2023-06-18 00:58:11,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=85860.0, ans=0.125 2023-06-18 00:58:20,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=85860.0, ans=6.0 2023-06-18 00:58:31,829 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.74 vs. limit=22.5 2023-06-18 00:58:38,747 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 3.854e+02 5.533e+02 8.207e+02 1.409e+03, threshold=1.107e+03, percent-clipped=13.0 2023-06-18 00:58:41,570 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.35 vs. limit=6.0 2023-06-18 00:59:13,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=86040.0, ans=0.035 2023-06-18 00:59:26,788 INFO [train.py:996] (2/4) Epoch 1, batch 14350, loss[loss=0.3314, simple_loss=0.3712, pruned_loss=0.1458, over 21833.00 frames. ], tot_loss[loss=0.3559, simple_loss=0.3975, pruned_loss=0.1572, over 4244818.75 frames. ], batch size: 282, lr: 3.06e-02, grad_scale: 16.0 2023-06-18 00:59:39,485 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-18 00:59:47,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=86160.0, ans=0.125 2023-06-18 01:00:37,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=86280.0, ans=0.1 2023-06-18 01:01:08,533 INFO [train.py:996] (2/4) Epoch 1, batch 14400, loss[loss=0.363, simple_loss=0.3895, pruned_loss=0.1682, over 21315.00 frames. ], tot_loss[loss=0.3558, simple_loss=0.3953, pruned_loss=0.1582, over 4251201.60 frames. ], batch size: 176, lr: 3.06e-02, grad_scale: 32.0 2023-06-18 01:01:41,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=86460.0, ans=0.125 2023-06-18 01:02:08,114 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.669e+02 3.871e+02 4.703e+02 5.738e+02 1.217e+03, threshold=9.407e+02, percent-clipped=2.0 2023-06-18 01:02:12,321 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-18 01:02:50,420 INFO [train.py:996] (2/4) Epoch 1, batch 14450, loss[loss=0.4231, simple_loss=0.4088, pruned_loss=0.2187, over 21589.00 frames. ], tot_loss[loss=0.3525, simple_loss=0.3889, pruned_loss=0.1581, over 4256658.16 frames. ], batch size: 508, lr: 3.05e-02, grad_scale: 32.0 2023-06-18 01:02:55,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=86700.0, ans=0.95 2023-06-18 01:03:22,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=86760.0, ans=0.2 2023-06-18 01:03:25,710 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=22.5 2023-06-18 01:04:25,118 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.59 vs. limit=15.0 2023-06-18 01:04:33,717 INFO [train.py:996] (2/4) Epoch 1, batch 14500, loss[loss=0.3214, simple_loss=0.3759, pruned_loss=0.1335, over 21790.00 frames. ], tot_loss[loss=0.3496, simple_loss=0.3859, pruned_loss=0.1566, over 4270096.54 frames. ], batch size: 371, lr: 3.05e-02, grad_scale: 32.0 2023-06-18 01:04:39,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=87000.0, ans=0.0 2023-06-18 01:04:55,747 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-06-18 01:04:56,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=87060.0, ans=0.125 2023-06-18 01:04:58,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=87060.0, ans=0.125 2023-06-18 01:05:36,055 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.357e+02 3.944e+02 5.039e+02 7.491e+02 1.788e+03, threshold=1.008e+03, percent-clipped=13.0 2023-06-18 01:06:07,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=87240.0, ans=0.0 2023-06-18 01:06:17,597 INFO [train.py:996] (2/4) Epoch 1, batch 14550, loss[loss=0.5107, simple_loss=0.4997, pruned_loss=0.2609, over 21307.00 frames. ], tot_loss[loss=0.3606, simple_loss=0.3958, pruned_loss=0.1627, over 4267563.67 frames. ], batch size: 507, lr: 3.05e-02, grad_scale: 16.0 2023-06-18 01:07:54,525 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=15.0 2023-06-18 01:08:01,712 INFO [train.py:996] (2/4) Epoch 1, batch 14600, loss[loss=0.3513, simple_loss=0.417, pruned_loss=0.1429, over 21699.00 frames. ], tot_loss[loss=0.3714, simple_loss=0.4056, pruned_loss=0.1686, over 4270628.56 frames. ], batch size: 263, lr: 3.04e-02, grad_scale: 16.0 2023-06-18 01:08:18,508 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=15.0 2023-06-18 01:08:19,975 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.29 vs. limit=22.5 2023-06-18 01:09:02,467 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.680e+02 4.017e+02 5.030e+02 6.430e+02 1.157e+03, threshold=1.006e+03, percent-clipped=2.0 2023-06-18 01:09:43,536 INFO [train.py:996] (2/4) Epoch 1, batch 14650, loss[loss=0.3604, simple_loss=0.4153, pruned_loss=0.1528, over 21837.00 frames. ], tot_loss[loss=0.3691, simple_loss=0.4052, pruned_loss=0.1665, over 4259795.45 frames. ], batch size: 371, lr: 3.04e-02, grad_scale: 16.0 2023-06-18 01:10:34,004 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-06-18 01:11:30,894 INFO [train.py:996] (2/4) Epoch 1, batch 14700, loss[loss=0.4531, simple_loss=0.4912, pruned_loss=0.2075, over 21671.00 frames. ], tot_loss[loss=0.3557, simple_loss=0.3972, pruned_loss=0.1571, over 4258696.76 frames. ], batch size: 441, lr: 3.03e-02, grad_scale: 16.0 2023-06-18 01:11:50,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=88260.0, ans=10.0 2023-06-18 01:12:09,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=88260.0, ans=0.125 2023-06-18 01:12:19,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=88320.0, ans=0.125 2023-06-18 01:12:20,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=88320.0, ans=0.125 2023-06-18 01:12:32,567 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 3.580e+02 4.552e+02 5.267e+02 1.016e+03, threshold=9.103e+02, percent-clipped=1.0 2023-06-18 01:13:05,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=88440.0, ans=0.2 2023-06-18 01:13:08,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=88440.0, ans=0.125 2023-06-18 01:13:12,530 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.58 vs. limit=10.0 2023-06-18 01:13:14,656 INFO [train.py:996] (2/4) Epoch 1, batch 14750, loss[loss=0.4012, simple_loss=0.4411, pruned_loss=0.1807, over 21269.00 frames. ], tot_loss[loss=0.3654, simple_loss=0.4061, pruned_loss=0.1623, over 4269994.30 frames. ], batch size: 548, lr: 3.03e-02, grad_scale: 16.0 2023-06-18 01:14:16,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=88620.0, ans=0.125 2023-06-18 01:14:31,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=88680.0, ans=0.0 2023-06-18 01:14:59,912 INFO [train.py:996] (2/4) Epoch 1, batch 14800, loss[loss=0.3816, simple_loss=0.4102, pruned_loss=0.1765, over 20042.00 frames. ], tot_loss[loss=0.3799, simple_loss=0.4192, pruned_loss=0.1703, over 4267183.55 frames. ], batch size: 702, lr: 3.03e-02, grad_scale: 32.0 2023-06-18 01:15:44,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=88860.0, ans=0.2 2023-06-18 01:16:02,070 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.486e+02 4.489e+02 5.229e+02 7.110e+02 1.407e+03, threshold=1.046e+03, percent-clipped=11.0 2023-06-18 01:16:55,423 INFO [train.py:996] (2/4) Epoch 1, batch 14850, loss[loss=0.3808, simple_loss=0.4194, pruned_loss=0.1711, over 21728.00 frames. ], tot_loss[loss=0.3752, simple_loss=0.4117, pruned_loss=0.1694, over 4270129.48 frames. ], batch size: 332, lr: 3.02e-02, grad_scale: 32.0 2023-06-18 01:17:23,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=89160.0, ans=0.0 2023-06-18 01:17:27,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=89160.0, ans=0.125 2023-06-18 01:18:09,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=89280.0, ans=0.0 2023-06-18 01:18:41,328 INFO [train.py:996] (2/4) Epoch 1, batch 14900, loss[loss=0.3782, simple_loss=0.4122, pruned_loss=0.1721, over 21586.00 frames. ], tot_loss[loss=0.3794, simple_loss=0.4157, pruned_loss=0.1716, over 4278224.74 frames. ], batch size: 230, lr: 3.02e-02, grad_scale: 32.0 2023-06-18 01:19:33,660 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.31 vs. limit=6.0 2023-06-18 01:19:34,038 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.655e+02 4.060e+02 5.286e+02 6.323e+02 1.154e+03, threshold=1.057e+03, percent-clipped=2.0 2023-06-18 01:20:21,538 INFO [train.py:996] (2/4) Epoch 1, batch 14950, loss[loss=0.3529, simple_loss=0.3619, pruned_loss=0.1719, over 20685.00 frames. ], tot_loss[loss=0.3797, simple_loss=0.4174, pruned_loss=0.1711, over 4271289.77 frames. ], batch size: 607, lr: 3.01e-02, grad_scale: 32.0 2023-06-18 01:20:25,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=89700.0, ans=0.0 2023-06-18 01:21:18,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=89820.0, ans=0.0 2023-06-18 01:21:33,677 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.54 vs. limit=15.0 2023-06-18 01:21:58,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=89940.0, ans=0.04949747468305833 2023-06-18 01:22:04,801 INFO [train.py:996] (2/4) Epoch 1, batch 15000, loss[loss=0.3693, simple_loss=0.3958, pruned_loss=0.1714, over 21497.00 frames. ], tot_loss[loss=0.3833, simple_loss=0.4203, pruned_loss=0.1732, over 4276611.55 frames. ], batch size: 194, lr: 3.01e-02, grad_scale: 32.0 2023-06-18 01:22:04,801 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 01:22:23,161 INFO [train.py:1028] (2/4) Epoch 1, validation: loss=0.3215, simple_loss=0.4085, pruned_loss=0.1173, over 1796401.00 frames. 2023-06-18 01:22:23,162 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-18 01:22:23,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=90000.0, ans=0.2 2023-06-18 01:22:53,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=90060.0, ans=0.125 2023-06-18 01:23:22,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=90120.0, ans=0.2 2023-06-18 01:23:25,161 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.727e+02 3.992e+02 4.836e+02 5.829e+02 8.010e+02, threshold=9.672e+02, percent-clipped=0.0 2023-06-18 01:23:51,486 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=15.0 2023-06-18 01:24:11,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=90300.0, ans=0.1 2023-06-18 01:24:12,199 INFO [train.py:996] (2/4) Epoch 1, batch 15050, loss[loss=0.4133, simple_loss=0.4747, pruned_loss=0.176, over 21256.00 frames. ], tot_loss[loss=0.3855, simple_loss=0.4214, pruned_loss=0.1748, over 4279628.36 frames. ], batch size: 548, lr: 3.01e-02, grad_scale: 32.0 2023-06-18 01:24:27,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=90360.0, ans=0.2 2023-06-18 01:25:37,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=90540.0, ans=0.125 2023-06-18 01:25:48,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=90540.0, ans=0.0 2023-06-18 01:25:55,104 INFO [train.py:996] (2/4) Epoch 1, batch 15100, loss[loss=0.4275, simple_loss=0.4448, pruned_loss=0.2051, over 21908.00 frames. ], tot_loss[loss=0.3862, simple_loss=0.4234, pruned_loss=0.1745, over 4278509.84 frames. ], batch size: 316, lr: 3.00e-02, grad_scale: 32.0 2023-06-18 01:26:50,650 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.378e+02 4.036e+02 5.408e+02 6.449e+02 1.241e+03, threshold=1.082e+03, percent-clipped=5.0 2023-06-18 01:27:05,421 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.87 vs. limit=10.0 2023-06-18 01:27:32,772 INFO [train.py:996] (2/4) Epoch 1, batch 15150, loss[loss=0.3236, simple_loss=0.3547, pruned_loss=0.1463, over 21730.00 frames. ], tot_loss[loss=0.3848, simple_loss=0.42, pruned_loss=0.1748, over 4279362.51 frames. ], batch size: 124, lr: 3.00e-02, grad_scale: 32.0 2023-06-18 01:29:11,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=91140.0, ans=0.125 2023-06-18 01:29:11,838 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.37 vs. limit=15.0 2023-06-18 01:29:15,458 INFO [train.py:996] (2/4) Epoch 1, batch 15200, loss[loss=0.2735, simple_loss=0.3122, pruned_loss=0.1174, over 21802.00 frames. ], tot_loss[loss=0.3738, simple_loss=0.411, pruned_loss=0.1684, over 4279756.02 frames. ], batch size: 112, lr: 2.99e-02, grad_scale: 32.0 2023-06-18 01:30:15,294 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.443e+02 3.952e+02 4.981e+02 6.119e+02 1.167e+03, threshold=9.963e+02, percent-clipped=1.0 2023-06-18 01:30:15,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=91320.0, ans=0.125 2023-06-18 01:30:32,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=91380.0, ans=0.035 2023-06-18 01:30:36,213 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.07 vs. limit=15.0 2023-06-18 01:30:56,065 INFO [train.py:996] (2/4) Epoch 1, batch 15250, loss[loss=0.3639, simple_loss=0.3851, pruned_loss=0.1714, over 21417.00 frames. ], tot_loss[loss=0.3671, simple_loss=0.4043, pruned_loss=0.1649, over 4275124.60 frames. ], batch size: 194, lr: 2.99e-02, grad_scale: 32.0 2023-06-18 01:30:58,810 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.04 vs. limit=15.0 2023-06-18 01:31:58,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=91620.0, ans=0.125 2023-06-18 01:32:21,151 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.81 vs. limit=22.5 2023-06-18 01:32:36,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=91740.0, ans=0.1 2023-06-18 01:32:38,853 INFO [train.py:996] (2/4) Epoch 1, batch 15300, loss[loss=0.2832, simple_loss=0.3094, pruned_loss=0.1285, over 20743.00 frames. ], tot_loss[loss=0.3709, simple_loss=0.4059, pruned_loss=0.168, over 4269428.66 frames. ], batch size: 609, lr: 2.99e-02, grad_scale: 32.0 2023-06-18 01:33:10,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=91860.0, ans=0.2 2023-06-18 01:33:28,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=91920.0, ans=0.1 2023-06-18 01:33:46,564 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.623e+02 4.254e+02 5.015e+02 5.905e+02 1.167e+03, threshold=1.003e+03, percent-clipped=1.0 2023-06-18 01:34:28,066 INFO [train.py:996] (2/4) Epoch 1, batch 15350, loss[loss=0.3817, simple_loss=0.414, pruned_loss=0.1747, over 21890.00 frames. ], tot_loss[loss=0.3765, simple_loss=0.411, pruned_loss=0.171, over 4269874.44 frames. ], batch size: 371, lr: 2.98e-02, grad_scale: 32.0 2023-06-18 01:35:11,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=92220.0, ans=0.125 2023-06-18 01:35:20,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=92220.0, ans=0.1 2023-06-18 01:35:33,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=92280.0, ans=0.125 2023-06-18 01:35:57,882 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=17.33 vs. limit=15.0 2023-06-18 01:36:04,754 INFO [train.py:996] (2/4) Epoch 1, batch 15400, loss[loss=0.336, simple_loss=0.376, pruned_loss=0.148, over 21702.00 frames. ], tot_loss[loss=0.3726, simple_loss=0.4098, pruned_loss=0.1676, over 4275154.40 frames. ], batch size: 230, lr: 2.98e-02, grad_scale: 32.0 2023-06-18 01:36:11,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=92400.0, ans=0.125 2023-06-18 01:36:22,665 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.44 vs. limit=10.0 2023-06-18 01:37:10,694 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.478e+02 3.998e+02 4.934e+02 5.907e+02 9.449e+02, threshold=9.868e+02, percent-clipped=0.0 2023-06-18 01:37:14,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=92580.0, ans=0.125 2023-06-18 01:37:26,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=92580.0, ans=0.2 2023-06-18 01:37:42,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=92640.0, ans=0.05 2023-06-18 01:37:46,514 INFO [train.py:996] (2/4) Epoch 1, batch 15450, loss[loss=0.34, simple_loss=0.4059, pruned_loss=0.1371, over 21824.00 frames. ], tot_loss[loss=0.3678, simple_loss=0.4057, pruned_loss=0.165, over 4273489.37 frames. ], batch size: 351, lr: 2.97e-02, grad_scale: 32.0 2023-06-18 01:37:50,879 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=15.0 2023-06-18 01:38:58,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=92880.0, ans=0.125 2023-06-18 01:39:19,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=92940.0, ans=0.2 2023-06-18 01:39:35,855 INFO [train.py:996] (2/4) Epoch 1, batch 15500, loss[loss=0.4979, simple_loss=0.4984, pruned_loss=0.2487, over 21345.00 frames. ], tot_loss[loss=0.368, simple_loss=0.4077, pruned_loss=0.1642, over 4248680.58 frames. ], batch size: 507, lr: 2.97e-02, grad_scale: 32.0 2023-06-18 01:40:37,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=93120.0, ans=0.0 2023-06-18 01:40:39,760 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.360e+02 3.584e+02 4.777e+02 6.158e+02 1.272e+03, threshold=9.553e+02, percent-clipped=7.0 2023-06-18 01:41:20,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=93240.0, ans=0.125 2023-06-18 01:41:29,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=93300.0, ans=0.125 2023-06-18 01:41:30,368 INFO [train.py:996] (2/4) Epoch 1, batch 15550, loss[loss=0.3152, simple_loss=0.3568, pruned_loss=0.1368, over 21359.00 frames. ], tot_loss[loss=0.3627, simple_loss=0.4056, pruned_loss=0.1598, over 4258329.17 frames. ], batch size: 131, lr: 2.97e-02, grad_scale: 16.0 2023-06-18 01:41:42,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=93300.0, ans=0.0 2023-06-18 01:41:59,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=93360.0, ans=0.2 2023-06-18 01:42:07,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=93420.0, ans=0.125 2023-06-18 01:42:11,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=93420.0, ans=0.07 2023-06-18 01:42:31,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=93480.0, ans=0.2 2023-06-18 01:43:13,886 INFO [train.py:996] (2/4) Epoch 1, batch 15600, loss[loss=0.4447, simple_loss=0.4487, pruned_loss=0.2203, over 21367.00 frames. ], tot_loss[loss=0.3573, simple_loss=0.3985, pruned_loss=0.158, over 4259440.12 frames. ], batch size: 508, lr: 2.96e-02, grad_scale: 32.0 2023-06-18 01:43:18,671 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.31 vs. limit=15.0 2023-06-18 01:43:25,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=93600.0, ans=0.2 2023-06-18 01:43:49,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=93660.0, ans=0.0 2023-06-18 01:43:58,328 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=15.0 2023-06-18 01:44:06,396 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.574e+02 3.769e+02 4.800e+02 6.009e+02 1.224e+03, threshold=9.599e+02, percent-clipped=5.0 2023-06-18 01:44:12,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=93780.0, ans=0.95 2023-06-18 01:44:56,286 INFO [train.py:996] (2/4) Epoch 1, batch 15650, loss[loss=0.3181, simple_loss=0.3587, pruned_loss=0.1387, over 21762.00 frames. ], tot_loss[loss=0.3553, simple_loss=0.3972, pruned_loss=0.1567, over 4253669.14 frames. ], batch size: 112, lr: 2.96e-02, grad_scale: 32.0 2023-06-18 01:45:31,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=93960.0, ans=0.0 2023-06-18 01:45:53,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=94080.0, ans=15.0 2023-06-18 01:46:15,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=94140.0, ans=0.125 2023-06-18 01:46:20,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=94140.0, ans=0.125 2023-06-18 01:46:27,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=94140.0, ans=0.0 2023-06-18 01:46:36,775 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=15.0 2023-06-18 01:46:38,863 INFO [train.py:996] (2/4) Epoch 1, batch 15700, loss[loss=0.3124, simple_loss=0.354, pruned_loss=0.1354, over 20777.00 frames. ], tot_loss[loss=0.3516, simple_loss=0.392, pruned_loss=0.1555, over 4249587.24 frames. ], batch size: 608, lr: 2.95e-02, grad_scale: 32.0 2023-06-18 01:47:27,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=94320.0, ans=0.1 2023-06-18 01:47:31,853 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.759e+02 5.241e+02 6.627e+02 1.144e+03, threshold=1.048e+03, percent-clipped=4.0 2023-06-18 01:47:41,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=94380.0, ans=0.125 2023-06-18 01:48:21,253 INFO [train.py:996] (2/4) Epoch 1, batch 15750, loss[loss=0.299, simple_loss=0.3483, pruned_loss=0.1249, over 21688.00 frames. ], tot_loss[loss=0.3475, simple_loss=0.386, pruned_loss=0.1544, over 4254441.55 frames. ], batch size: 316, lr: 2.95e-02, grad_scale: 32.0 2023-06-18 01:49:01,892 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.85 vs. limit=15.0 2023-06-18 01:50:03,636 INFO [train.py:996] (2/4) Epoch 1, batch 15800, loss[loss=0.3717, simple_loss=0.3974, pruned_loss=0.173, over 21302.00 frames. ], tot_loss[loss=0.3444, simple_loss=0.3805, pruned_loss=0.1542, over 4257334.41 frames. ], batch size: 159, lr: 2.95e-02, grad_scale: 32.0 2023-06-18 01:50:37,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=94860.0, ans=0.125 2023-06-18 01:50:47,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=94920.0, ans=0.07 2023-06-18 01:50:52,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=94920.0, ans=0.125 2023-06-18 01:51:07,133 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.378e+02 3.489e+02 4.310e+02 5.461e+02 1.002e+03, threshold=8.621e+02, percent-clipped=0.0 2023-06-18 01:51:12,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=94980.0, ans=0.125 2023-06-18 01:51:43,435 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.95 vs. limit=15.0 2023-06-18 01:51:47,045 INFO [train.py:996] (2/4) Epoch 1, batch 15850, loss[loss=0.4821, simple_loss=0.4798, pruned_loss=0.2422, over 21742.00 frames. ], tot_loss[loss=0.3495, simple_loss=0.3836, pruned_loss=0.1577, over 4259030.93 frames. ], batch size: 441, lr: 2.94e-02, grad_scale: 32.0 2023-06-18 01:51:47,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=95100.0, ans=0.0 2023-06-18 01:52:06,543 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.55 vs. limit=15.0 2023-06-18 01:53:08,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=95340.0, ans=0.125 2023-06-18 01:53:27,508 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-18 01:53:31,129 INFO [train.py:996] (2/4) Epoch 1, batch 15900, loss[loss=0.3247, simple_loss=0.3584, pruned_loss=0.1455, over 21873.00 frames. ], tot_loss[loss=0.3515, simple_loss=0.3841, pruned_loss=0.1595, over 4264452.23 frames. ], batch size: 118, lr: 2.94e-02, grad_scale: 32.0 2023-06-18 01:54:01,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=95460.0, ans=0.125 2023-06-18 01:54:13,107 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.77 vs. limit=22.5 2023-06-18 01:54:28,633 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.552e+02 4.278e+02 5.211e+02 7.153e+02 1.346e+03, threshold=1.042e+03, percent-clipped=13.0 2023-06-18 01:54:45,775 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.04 vs. limit=15.0 2023-06-18 01:55:06,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=95700.0, ans=0.125 2023-06-18 01:55:07,513 INFO [train.py:996] (2/4) Epoch 1, batch 15950, loss[loss=0.3438, simple_loss=0.3873, pruned_loss=0.1502, over 21578.00 frames. ], tot_loss[loss=0.3485, simple_loss=0.3841, pruned_loss=0.1564, over 4250790.71 frames. ], batch size: 263, lr: 2.94e-02, grad_scale: 32.0 2023-06-18 01:55:40,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=95760.0, ans=0.1 2023-06-18 01:55:41,231 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.96 vs. limit=15.0 2023-06-18 01:55:51,338 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.31 vs. limit=6.0 2023-06-18 01:55:55,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=95820.0, ans=0.1 2023-06-18 01:56:38,460 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 01:56:48,836 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=22.5 2023-06-18 01:56:54,412 INFO [train.py:996] (2/4) Epoch 1, batch 16000, loss[loss=0.3583, simple_loss=0.4154, pruned_loss=0.1506, over 21665.00 frames. ], tot_loss[loss=0.3452, simple_loss=0.3852, pruned_loss=0.1526, over 4264423.09 frames. ], batch size: 389, lr: 2.93e-02, grad_scale: 32.0 2023-06-18 01:56:54,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=96000.0, ans=0.1 2023-06-18 01:57:17,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=96000.0, ans=0.125 2023-06-18 01:57:19,491 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-06-18 01:57:20,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=96060.0, ans=0.05 2023-06-18 01:57:22,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=96060.0, ans=0.125 2023-06-18 01:57:43,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=96120.0, ans=0.0 2023-06-18 01:57:52,176 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 3.630e+02 4.249e+02 5.497e+02 1.232e+03, threshold=8.498e+02, percent-clipped=2.0 2023-06-18 01:57:57,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=96180.0, ans=0.1 2023-06-18 01:58:30,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=96240.0, ans=0.125 2023-06-18 01:58:31,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=96240.0, ans=0.125 2023-06-18 01:58:35,943 INFO [train.py:996] (2/4) Epoch 1, batch 16050, loss[loss=0.3411, simple_loss=0.4064, pruned_loss=0.1379, over 21384.00 frames. ], tot_loss[loss=0.3474, simple_loss=0.3911, pruned_loss=0.1519, over 4260150.55 frames. ], batch size: 211, lr: 2.93e-02, grad_scale: 32.0 2023-06-18 01:59:11,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=96360.0, ans=0.1 2023-06-18 01:59:13,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=96360.0, ans=0.07 2023-06-18 01:59:18,786 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=18.60 vs. limit=22.5 2023-06-18 01:59:37,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=96480.0, ans=0.125 2023-06-18 02:00:02,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=96540.0, ans=0.125 2023-06-18 02:00:17,300 INFO [train.py:996] (2/4) Epoch 1, batch 16100, loss[loss=0.3648, simple_loss=0.4007, pruned_loss=0.1644, over 21871.00 frames. ], tot_loss[loss=0.3523, simple_loss=0.3968, pruned_loss=0.1539, over 4262952.28 frames. ], batch size: 124, lr: 2.92e-02, grad_scale: 32.0 2023-06-18 02:00:25,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_ff3.min_abs, batch_count=96600.0, ans=0.2 2023-06-18 02:01:14,759 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.462e+02 3.670e+02 4.719e+02 5.843e+02 1.104e+03, threshold=9.438e+02, percent-clipped=5.0 2023-06-18 02:01:59,172 INFO [train.py:996] (2/4) Epoch 1, batch 16150, loss[loss=0.3835, simple_loss=0.4741, pruned_loss=0.1464, over 20853.00 frames. ], tot_loss[loss=0.3565, simple_loss=0.3985, pruned_loss=0.1573, over 4273755.94 frames. ], batch size: 608, lr: 2.92e-02, grad_scale: 32.0 2023-06-18 02:02:06,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=96900.0, ans=0.125 2023-06-18 02:02:47,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=97020.0, ans=0.0 2023-06-18 02:02:47,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=97020.0, ans=0.125 2023-06-18 02:03:04,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=97080.0, ans=0.2 2023-06-18 02:03:36,849 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-18 02:03:45,785 INFO [train.py:996] (2/4) Epoch 1, batch 16200, loss[loss=0.4114, simple_loss=0.4383, pruned_loss=0.1922, over 21338.00 frames. ], tot_loss[loss=0.362, simple_loss=0.4032, pruned_loss=0.1604, over 4279250.65 frames. ], batch size: 548, lr: 2.92e-02, grad_scale: 32.0 2023-06-18 02:04:30,324 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.91 vs. limit=22.5 2023-06-18 02:04:33,650 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.26 vs. limit=22.5 2023-06-18 02:04:34,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=97320.0, ans=0.125 2023-06-18 02:04:49,169 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.675e+02 4.045e+02 5.128e+02 6.271e+02 1.195e+03, threshold=1.026e+03, percent-clipped=3.0 2023-06-18 02:05:34,474 INFO [train.py:996] (2/4) Epoch 1, batch 16250, loss[loss=0.3098, simple_loss=0.3457, pruned_loss=0.137, over 21443.00 frames. ], tot_loss[loss=0.3586, simple_loss=0.3995, pruned_loss=0.1589, over 4277352.57 frames. ], batch size: 194, lr: 2.91e-02, grad_scale: 32.0 2023-06-18 02:06:34,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=97680.0, ans=0.2 2023-06-18 02:06:36,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=97680.0, ans=0.95 2023-06-18 02:06:45,736 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.84 vs. limit=22.5 2023-06-18 02:06:48,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=97680.0, ans=0.1 2023-06-18 02:07:23,965 INFO [train.py:996] (2/4) Epoch 1, batch 16300, loss[loss=0.321, simple_loss=0.3686, pruned_loss=0.1367, over 21257.00 frames. ], tot_loss[loss=0.3484, simple_loss=0.3907, pruned_loss=0.153, over 4263867.77 frames. ], batch size: 549, lr: 2.91e-02, grad_scale: 32.0 2023-06-18 02:07:26,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=97800.0, ans=0.0 2023-06-18 02:08:17,060 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.881e+02 3.435e+02 4.309e+02 5.263e+02 1.274e+03, threshold=8.618e+02, percent-clipped=4.0 2023-06-18 02:08:39,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=98040.0, ans=0.09899494936611666 2023-06-18 02:09:08,889 INFO [train.py:996] (2/4) Epoch 1, batch 16350, loss[loss=0.2992, simple_loss=0.3477, pruned_loss=0.1254, over 21605.00 frames. ], tot_loss[loss=0.3478, simple_loss=0.3898, pruned_loss=0.1529, over 4258077.76 frames. ], batch size: 263, lr: 2.91e-02, grad_scale: 32.0 2023-06-18 02:09:10,128 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.82 vs. limit=15.0 2023-06-18 02:09:46,617 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=22.5 2023-06-18 02:10:22,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=98340.0, ans=0.125 2023-06-18 02:10:43,952 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=15.0 2023-06-18 02:10:51,217 INFO [train.py:996] (2/4) Epoch 1, batch 16400, loss[loss=0.3862, simple_loss=0.4358, pruned_loss=0.1683, over 19944.00 frames. ], tot_loss[loss=0.3542, simple_loss=0.3957, pruned_loss=0.1563, over 4260906.62 frames. ], batch size: 703, lr: 2.90e-02, grad_scale: 32.0 2023-06-18 02:11:07,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=98460.0, ans=0.04949747468305833 2023-06-18 02:11:25,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=98520.0, ans=0.125 2023-06-18 02:11:31,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=98520.0, ans=0.1 2023-06-18 02:11:44,388 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=15.0 2023-06-18 02:11:47,766 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.785e+02 3.743e+02 5.331e+02 6.637e+02 1.239e+03, threshold=1.066e+03, percent-clipped=10.0 2023-06-18 02:11:58,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=98580.0, ans=0.0 2023-06-18 02:12:00,666 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=15.0 2023-06-18 02:12:17,535 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=15.0 2023-06-18 02:12:33,627 INFO [train.py:996] (2/4) Epoch 1, batch 16450, loss[loss=0.3339, simple_loss=0.3728, pruned_loss=0.1475, over 21870.00 frames. ], tot_loss[loss=0.3554, simple_loss=0.3962, pruned_loss=0.1574, over 4263510.88 frames. ], batch size: 282, lr: 2.90e-02, grad_scale: 32.0 2023-06-18 02:12:34,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=98700.0, ans=0.125 2023-06-18 02:12:45,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=98700.0, ans=0.1 2023-06-18 02:12:52,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=98760.0, ans=0.5 2023-06-18 02:13:17,656 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.35 vs. limit=10.0 2023-06-18 02:13:32,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=98880.0, ans=0.1 2023-06-18 02:14:17,027 INFO [train.py:996] (2/4) Epoch 1, batch 16500, loss[loss=0.3436, simple_loss=0.3827, pruned_loss=0.1522, over 21716.00 frames. ], tot_loss[loss=0.3549, simple_loss=0.3959, pruned_loss=0.157, over 4268905.45 frames. ], batch size: 298, lr: 2.89e-02, grad_scale: 32.0 2023-06-18 02:14:19,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=99000.0, ans=0.125 2023-06-18 02:14:33,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=99060.0, ans=0.125 2023-06-18 02:14:37,225 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=17.20 vs. limit=22.5 2023-06-18 02:15:16,396 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.870e+02 4.027e+02 4.822e+02 5.863e+02 1.078e+03, threshold=9.645e+02, percent-clipped=1.0 2023-06-18 02:15:18,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=99180.0, ans=0.125 2023-06-18 02:15:43,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=99240.0, ans=0.0 2023-06-18 02:15:48,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=99240.0, ans=0.1 2023-06-18 02:15:48,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=99240.0, ans=0.2 2023-06-18 02:16:01,048 INFO [train.py:996] (2/4) Epoch 1, batch 16550, loss[loss=0.3424, simple_loss=0.3778, pruned_loss=0.1535, over 21798.00 frames. ], tot_loss[loss=0.3487, simple_loss=0.3919, pruned_loss=0.1527, over 4272291.92 frames. ], batch size: 124, lr: 2.89e-02, grad_scale: 32.0 2023-06-18 02:16:04,798 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:17:29,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=99540.0, ans=0.125 2023-06-18 02:17:29,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=99540.0, ans=0.125 2023-06-18 02:17:45,760 INFO [train.py:996] (2/4) Epoch 1, batch 16600, loss[loss=0.387, simple_loss=0.4488, pruned_loss=0.1626, over 21274.00 frames. ], tot_loss[loss=0.3623, simple_loss=0.4051, pruned_loss=0.1597, over 4276730.19 frames. ], batch size: 548, lr: 2.89e-02, grad_scale: 32.0 2023-06-18 02:17:46,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=99600.0, ans=0.0 2023-06-18 02:18:31,005 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-06-18 02:18:44,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=99720.0, ans=0.125 2023-06-18 02:18:49,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=99720.0, ans=0.2 2023-06-18 02:18:52,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=99720.0, ans=0.0 2023-06-18 02:18:55,021 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.830e+02 4.483e+02 5.513e+02 7.810e+02 1.353e+03, threshold=1.103e+03, percent-clipped=10.0 2023-06-18 02:19:40,962 INFO [train.py:996] (2/4) Epoch 1, batch 16650, loss[loss=0.3631, simple_loss=0.4085, pruned_loss=0.1588, over 21799.00 frames. ], tot_loss[loss=0.3704, simple_loss=0.4153, pruned_loss=0.1628, over 4281198.79 frames. ], batch size: 247, lr: 2.88e-02, grad_scale: 32.0 2023-06-18 02:20:27,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=100020.0, ans=0.125 2023-06-18 02:21:15,908 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.07 vs. limit=22.5 2023-06-18 02:21:28,231 INFO [train.py:996] (2/4) Epoch 1, batch 16700, loss[loss=0.2687, simple_loss=0.3163, pruned_loss=0.1105, over 21248.00 frames. ], tot_loss[loss=0.371, simple_loss=0.4157, pruned_loss=0.1632, over 4274173.77 frames. ], batch size: 176, lr: 2.88e-02, grad_scale: 32.0 2023-06-18 02:21:40,897 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:22:00,976 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.67 vs. limit=15.0 2023-06-18 02:22:34,505 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.534e+02 4.065e+02 5.046e+02 6.706e+02 1.129e+03, threshold=1.009e+03, percent-clipped=1.0 2023-06-18 02:23:09,322 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-18 02:23:27,548 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.61 vs. limit=6.0 2023-06-18 02:23:27,866 INFO [train.py:996] (2/4) Epoch 1, batch 16750, loss[loss=0.4587, simple_loss=0.4936, pruned_loss=0.2119, over 21492.00 frames. ], tot_loss[loss=0.3763, simple_loss=0.4188, pruned_loss=0.1669, over 4274427.62 frames. ], batch size: 471, lr: 2.88e-02, grad_scale: 32.0 2023-06-18 02:23:45,203 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.47 vs. limit=22.5 2023-06-18 02:24:13,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=100620.0, ans=0.0 2023-06-18 02:24:24,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=100620.0, ans=0.1 2023-06-18 02:25:05,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=100740.0, ans=10.0 2023-06-18 02:25:11,930 INFO [train.py:996] (2/4) Epoch 1, batch 16800, loss[loss=0.3655, simple_loss=0.3921, pruned_loss=0.1694, over 21333.00 frames. ], tot_loss[loss=0.3775, simple_loss=0.4225, pruned_loss=0.1663, over 4269884.99 frames. ], batch size: 159, lr: 2.87e-02, grad_scale: 32.0 2023-06-18 02:25:13,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=100800.0, ans=0.125 2023-06-18 02:26:08,493 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.594e+02 4.332e+02 5.464e+02 7.061e+02 1.204e+03, threshold=1.093e+03, percent-clipped=8.0 2023-06-18 02:26:51,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=101040.0, ans=0.125 2023-06-18 02:26:54,560 INFO [train.py:996] (2/4) Epoch 1, batch 16850, loss[loss=0.3688, simple_loss=0.3875, pruned_loss=0.1751, over 21648.00 frames. ], tot_loss[loss=0.3768, simple_loss=0.4194, pruned_loss=0.1671, over 4275610.19 frames. ], batch size: 230, lr: 2.87e-02, grad_scale: 32.0 2023-06-18 02:27:27,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=101160.0, ans=0.2 2023-06-18 02:27:42,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=101220.0, ans=0.07 2023-06-18 02:27:49,712 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=12.0 2023-06-18 02:28:37,456 INFO [train.py:996] (2/4) Epoch 1, batch 16900, loss[loss=0.3873, simple_loss=0.407, pruned_loss=0.1838, over 21485.00 frames. ], tot_loss[loss=0.3705, simple_loss=0.4116, pruned_loss=0.1647, over 4286086.59 frames. ], batch size: 508, lr: 2.87e-02, grad_scale: 32.0 2023-06-18 02:28:45,008 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=11.74 vs. limit=15.0 2023-06-18 02:28:47,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=101400.0, ans=0.1 2023-06-18 02:29:39,427 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 3.564e+02 4.278e+02 5.386e+02 1.254e+03, threshold=8.556e+02, percent-clipped=1.0 2023-06-18 02:30:02,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=101640.0, ans=0.125 2023-06-18 02:30:19,204 INFO [train.py:996] (2/4) Epoch 1, batch 16950, loss[loss=0.3584, simple_loss=0.3864, pruned_loss=0.1652, over 21860.00 frames. ], tot_loss[loss=0.3637, simple_loss=0.4029, pruned_loss=0.1623, over 4275116.28 frames. ], batch size: 371, lr: 2.86e-02, grad_scale: 32.0 2023-06-18 02:30:24,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=101700.0, ans=0.0 2023-06-18 02:30:53,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=101760.0, ans=0.0 2023-06-18 02:30:57,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=101760.0, ans=0.125 2023-06-18 02:30:57,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=101760.0, ans=0.125 2023-06-18 02:32:00,712 INFO [train.py:996] (2/4) Epoch 1, batch 17000, loss[loss=0.3843, simple_loss=0.4104, pruned_loss=0.1791, over 21840.00 frames. ], tot_loss[loss=0.3613, simple_loss=0.3987, pruned_loss=0.162, over 4285604.11 frames. ], batch size: 107, lr: 2.86e-02, grad_scale: 32.0 2023-06-18 02:32:33,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=102060.0, ans=0.0 2023-06-18 02:32:55,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=102120.0, ans=0.125 2023-06-18 02:33:09,879 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.503e+02 3.976e+02 5.576e+02 8.538e+02 1.340e+03, threshold=1.115e+03, percent-clipped=23.0 2023-06-18 02:33:28,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=102240.0, ans=0.0 2023-06-18 02:33:44,515 INFO [train.py:996] (2/4) Epoch 1, batch 17050, loss[loss=0.3737, simple_loss=0.4118, pruned_loss=0.1678, over 21402.00 frames. ], tot_loss[loss=0.3707, simple_loss=0.4077, pruned_loss=0.1668, over 4294529.49 frames. ], batch size: 131, lr: 2.86e-02, grad_scale: 32.0 2023-06-18 02:34:01,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=102300.0, ans=0.0 2023-06-18 02:34:51,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=102420.0, ans=0.1 2023-06-18 02:34:56,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=102480.0, ans=0.2 2023-06-18 02:35:26,153 INFO [train.py:996] (2/4) Epoch 1, batch 17100, loss[loss=0.3673, simple_loss=0.3948, pruned_loss=0.1699, over 21662.00 frames. ], tot_loss[loss=0.3695, simple_loss=0.4066, pruned_loss=0.1662, over 4292378.89 frames. ], batch size: 230, lr: 2.85e-02, grad_scale: 32.0 2023-06-18 02:36:08,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=102720.0, ans=0.0 2023-06-18 02:36:28,704 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.900e+02 4.752e+02 6.955e+02 1.664e+03, threshold=9.503e+02, percent-clipped=6.0 2023-06-18 02:36:53,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=102840.0, ans=0.125 2023-06-18 02:37:07,592 INFO [train.py:996] (2/4) Epoch 1, batch 17150, loss[loss=0.3582, simple_loss=0.3815, pruned_loss=0.1674, over 21853.00 frames. ], tot_loss[loss=0.3639, simple_loss=0.4003, pruned_loss=0.1638, over 4298718.57 frames. ], batch size: 351, lr: 2.85e-02, grad_scale: 32.0 2023-06-18 02:37:29,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=102960.0, ans=0.125 2023-06-18 02:38:11,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=103080.0, ans=0.125 2023-06-18 02:38:51,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=103200.0, ans=0.0 2023-06-18 02:38:52,554 INFO [train.py:996] (2/4) Epoch 1, batch 17200, loss[loss=0.4454, simple_loss=0.4564, pruned_loss=0.2172, over 21416.00 frames. ], tot_loss[loss=0.3641, simple_loss=0.4001, pruned_loss=0.164, over 4300154.48 frames. ], batch size: 471, lr: 2.84e-02, grad_scale: 32.0 2023-06-18 02:39:11,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=103260.0, ans=0.2 2023-06-18 02:39:11,941 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.78 vs. limit=10.0 2023-06-18 02:39:44,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=103320.0, ans=0.125 2023-06-18 02:39:45,762 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 3.944e+02 4.838e+02 6.225e+02 9.968e+02, threshold=9.676e+02, percent-clipped=1.0 2023-06-18 02:39:51,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=103380.0, ans=0.0 2023-06-18 02:40:31,383 INFO [train.py:996] (2/4) Epoch 1, batch 17250, loss[loss=0.4116, simple_loss=0.4485, pruned_loss=0.1873, over 21473.00 frames. ], tot_loss[loss=0.3695, simple_loss=0.4056, pruned_loss=0.1667, over 4291893.05 frames. ], batch size: 211, lr: 2.84e-02, grad_scale: 32.0 2023-06-18 02:40:42,669 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-18 02:40:56,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=103560.0, ans=0.125 2023-06-18 02:41:09,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=103620.0, ans=0.2 2023-06-18 02:41:31,587 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.06 vs. limit=6.0 2023-06-18 02:41:36,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=103680.0, ans=0.07 2023-06-18 02:42:10,782 INFO [train.py:996] (2/4) Epoch 1, batch 17300, loss[loss=0.4168, simple_loss=0.4467, pruned_loss=0.1934, over 21696.00 frames. ], tot_loss[loss=0.3791, simple_loss=0.4152, pruned_loss=0.1715, over 4290003.99 frames. ], batch size: 351, lr: 2.84e-02, grad_scale: 32.0 2023-06-18 02:42:20,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=103800.0, ans=0.125 2023-06-18 02:42:29,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=103800.0, ans=0.125 2023-06-18 02:42:49,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=103920.0, ans=0.125 2023-06-18 02:43:16,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=103980.0, ans=0.0 2023-06-18 02:43:17,411 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.468e+02 3.798e+02 4.894e+02 6.344e+02 1.044e+03, threshold=9.789e+02, percent-clipped=2.0 2023-06-18 02:43:56,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=104040.0, ans=0.0 2023-06-18 02:44:02,188 INFO [train.py:996] (2/4) Epoch 1, batch 17350, loss[loss=0.3669, simple_loss=0.4108, pruned_loss=0.1614, over 21789.00 frames. ], tot_loss[loss=0.3803, simple_loss=0.4173, pruned_loss=0.1716, over 4281785.44 frames. ], batch size: 282, lr: 2.83e-02, grad_scale: 16.0 2023-06-18 02:45:07,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=104280.0, ans=0.125 2023-06-18 02:45:47,347 INFO [train.py:996] (2/4) Epoch 1, batch 17400, loss[loss=0.2821, simple_loss=0.3186, pruned_loss=0.1228, over 21279.00 frames. ], tot_loss[loss=0.3719, simple_loss=0.4121, pruned_loss=0.1659, over 4280010.10 frames. ], batch size: 159, lr: 2.83e-02, grad_scale: 16.0 2023-06-18 02:46:02,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=104400.0, ans=0.125 2023-06-18 02:46:04,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=104400.0, ans=0.0 2023-06-18 02:46:07,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=104460.0, ans=0.035 2023-06-18 02:46:42,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=104520.0, ans=0.1 2023-06-18 02:46:46,948 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-06-18 02:46:47,234 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 3.665e+02 5.087e+02 7.278e+02 1.204e+03, threshold=1.017e+03, percent-clipped=4.0 2023-06-18 02:46:55,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=104580.0, ans=0.0 2023-06-18 02:47:21,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=104640.0, ans=0.125 2023-06-18 02:47:30,818 INFO [train.py:996] (2/4) Epoch 1, batch 17450, loss[loss=0.2818, simple_loss=0.3641, pruned_loss=0.0998, over 21566.00 frames. ], tot_loss[loss=0.3624, simple_loss=0.4048, pruned_loss=0.16, over 4273399.99 frames. ], batch size: 389, lr: 2.83e-02, grad_scale: 16.0 2023-06-18 02:47:35,238 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.35 vs. limit=15.0 2023-06-18 02:47:54,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=104760.0, ans=0.0 2023-06-18 02:48:19,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=104820.0, ans=0.0 2023-06-18 02:48:36,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=104880.0, ans=0.125 2023-06-18 02:48:51,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=104940.0, ans=0.0 2023-06-18 02:49:07,404 INFO [train.py:996] (2/4) Epoch 1, batch 17500, loss[loss=0.341, simple_loss=0.372, pruned_loss=0.155, over 21535.00 frames. ], tot_loss[loss=0.355, simple_loss=0.3982, pruned_loss=0.1559, over 4277868.87 frames. ], batch size: 548, lr: 2.82e-02, grad_scale: 16.0 2023-06-18 02:50:12,888 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 3.016e+02 3.973e+02 5.519e+02 1.327e+03, threshold=7.947e+02, percent-clipped=4.0 2023-06-18 02:50:41,946 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:50:42,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=105240.0, ans=0.125 2023-06-18 02:50:49,639 INFO [train.py:996] (2/4) Epoch 1, batch 17550, loss[loss=0.2947, simple_loss=0.3663, pruned_loss=0.1116, over 21622.00 frames. ], tot_loss[loss=0.3523, simple_loss=0.3974, pruned_loss=0.1536, over 4264045.66 frames. ], batch size: 230, lr: 2.82e-02, grad_scale: 16.0 2023-06-18 02:51:23,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=105360.0, ans=0.1 2023-06-18 02:52:03,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=105480.0, ans=0.07 2023-06-18 02:52:04,224 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.39 vs. limit=6.0 2023-06-18 02:52:06,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=105480.0, ans=0.125 2023-06-18 02:52:32,344 INFO [train.py:996] (2/4) Epoch 1, batch 17600, loss[loss=0.4022, simple_loss=0.443, pruned_loss=0.1806, over 21801.00 frames. ], tot_loss[loss=0.356, simple_loss=0.4008, pruned_loss=0.1556, over 4265126.48 frames. ], batch size: 124, lr: 2.82e-02, grad_scale: 32.0 2023-06-18 02:52:33,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=105600.0, ans=0.125 2023-06-18 02:52:54,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=105660.0, ans=0.0 2023-06-18 02:53:36,625 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 3.619e+02 4.865e+02 6.950e+02 1.496e+03, threshold=9.730e+02, percent-clipped=22.0 2023-06-18 02:53:38,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=105780.0, ans=0.0 2023-06-18 02:54:19,867 INFO [train.py:996] (2/4) Epoch 1, batch 17650, loss[loss=0.2684, simple_loss=0.3223, pruned_loss=0.1073, over 21760.00 frames. ], tot_loss[loss=0.3538, simple_loss=0.3974, pruned_loss=0.1551, over 4264829.38 frames. ], batch size: 282, lr: 2.81e-02, grad_scale: 32.0 2023-06-18 02:55:03,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=106020.0, ans=0.05 2023-06-18 02:55:24,420 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-18 02:55:26,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=106080.0, ans=0.0 2023-06-18 02:55:40,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=106140.0, ans=0.1 2023-06-18 02:55:44,880 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 02:56:02,796 INFO [train.py:996] (2/4) Epoch 1, batch 17700, loss[loss=0.2577, simple_loss=0.273, pruned_loss=0.1212, over 16608.00 frames. ], tot_loss[loss=0.3436, simple_loss=0.3889, pruned_loss=0.1492, over 4252697.22 frames. ], batch size: 61, lr: 2.81e-02, grad_scale: 32.0 2023-06-18 02:56:57,438 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.59 vs. limit=22.5 2023-06-18 02:57:07,758 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.438e+02 3.749e+02 4.413e+02 5.536e+02 1.027e+03, threshold=8.827e+02, percent-clipped=1.0 2023-06-18 02:57:45,709 INFO [train.py:996] (2/4) Epoch 1, batch 17750, loss[loss=0.3849, simple_loss=0.4263, pruned_loss=0.1718, over 21480.00 frames. ], tot_loss[loss=0.357, simple_loss=0.4004, pruned_loss=0.1568, over 4258645.20 frames. ], batch size: 112, lr: 2.81e-02, grad_scale: 32.0 2023-06-18 02:58:47,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=106620.0, ans=0.0 2023-06-18 02:58:47,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=106620.0, ans=0.125 2023-06-18 02:59:15,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=106740.0, ans=0.1 2023-06-18 02:59:27,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=106740.0, ans=0.0 2023-06-18 02:59:30,632 INFO [train.py:996] (2/4) Epoch 1, batch 17800, loss[loss=0.3162, simple_loss=0.3648, pruned_loss=0.1338, over 21637.00 frames. ], tot_loss[loss=0.3551, simple_loss=0.3993, pruned_loss=0.1555, over 4262167.88 frames. ], batch size: 263, lr: 2.80e-02, grad_scale: 32.0 2023-06-18 02:59:50,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=106800.0, ans=0.1 2023-06-18 03:00:11,907 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=9.106e-03 2023-06-18 03:00:41,772 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.290e+02 3.616e+02 4.998e+02 5.902e+02 1.082e+03, threshold=9.996e+02, percent-clipped=5.0 2023-06-18 03:00:50,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=106980.0, ans=0.0 2023-06-18 03:01:12,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=107040.0, ans=0.1 2023-06-18 03:01:25,753 INFO [train.py:996] (2/4) Epoch 1, batch 17850, loss[loss=0.3874, simple_loss=0.4239, pruned_loss=0.1755, over 21698.00 frames. ], tot_loss[loss=0.3558, simple_loss=0.4006, pruned_loss=0.1555, over 4261592.47 frames. ], batch size: 351, lr: 2.80e-02, grad_scale: 32.0 2023-06-18 03:01:40,602 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.68 vs. limit=10.0 2023-06-18 03:01:41,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=107160.0, ans=0.125 2023-06-18 03:02:01,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=107160.0, ans=0.02 2023-06-18 03:02:03,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=107220.0, ans=0.04949747468305833 2023-06-18 03:03:02,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=107340.0, ans=0.125 2023-06-18 03:03:11,363 INFO [train.py:996] (2/4) Epoch 1, batch 17900, loss[loss=0.3422, simple_loss=0.4037, pruned_loss=0.1403, over 21223.00 frames. ], tot_loss[loss=0.3631, simple_loss=0.4069, pruned_loss=0.1596, over 4267456.73 frames. ], batch size: 176, lr: 2.80e-02, grad_scale: 32.0 2023-06-18 03:03:18,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=107400.0, ans=0.0 2023-06-18 03:03:43,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=107460.0, ans=0.0 2023-06-18 03:04:10,897 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.685e+02 4.154e+02 5.194e+02 6.786e+02 1.159e+03, threshold=1.039e+03, percent-clipped=5.0 2023-06-18 03:04:16,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=107580.0, ans=0.1 2023-06-18 03:04:55,227 INFO [train.py:996] (2/4) Epoch 1, batch 17950, loss[loss=0.2983, simple_loss=0.3628, pruned_loss=0.117, over 21762.00 frames. ], tot_loss[loss=0.3569, simple_loss=0.4053, pruned_loss=0.1543, over 4270945.83 frames. ], batch size: 332, lr: 2.79e-02, grad_scale: 32.0 2023-06-18 03:05:10,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=107700.0, ans=0.125 2023-06-18 03:05:36,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=107820.0, ans=0.1 2023-06-18 03:06:35,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=107940.0, ans=0.125 2023-06-18 03:06:38,450 INFO [train.py:996] (2/4) Epoch 1, batch 18000, loss[loss=0.3469, simple_loss=0.3744, pruned_loss=0.1597, over 21752.00 frames. ], tot_loss[loss=0.3506, simple_loss=0.3968, pruned_loss=0.1522, over 4268906.48 frames. ], batch size: 371, lr: 2.79e-02, grad_scale: 32.0 2023-06-18 03:06:38,451 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 03:06:57,892 INFO [train.py:1028] (2/4) Epoch 1, validation: loss=0.3324, simple_loss=0.4216, pruned_loss=0.1216, over 1796401.00 frames. 2023-06-18 03:06:57,893 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-18 03:07:17,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=108000.0, ans=0.1 2023-06-18 03:07:42,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=108120.0, ans=0.125 2023-06-18 03:08:03,604 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 3.501e+02 4.751e+02 6.240e+02 1.819e+03, threshold=9.502e+02, percent-clipped=6.0 2023-06-18 03:08:41,378 INFO [train.py:996] (2/4) Epoch 1, batch 18050, loss[loss=0.311, simple_loss=0.3531, pruned_loss=0.1344, over 21654.00 frames. ], tot_loss[loss=0.3459, simple_loss=0.3902, pruned_loss=0.1508, over 4257540.53 frames. ], batch size: 298, lr: 2.79e-02, grad_scale: 32.0 2023-06-18 03:09:25,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=108420.0, ans=0.0 2023-06-18 03:09:42,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=108420.0, ans=0.07 2023-06-18 03:09:57,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=108480.0, ans=0.1 2023-06-18 03:10:15,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=108540.0, ans=0.125 2023-06-18 03:10:19,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=108540.0, ans=0.125 2023-06-18 03:10:33,517 INFO [train.py:996] (2/4) Epoch 1, batch 18100, loss[loss=0.354, simple_loss=0.4079, pruned_loss=0.1501, over 21433.00 frames. ], tot_loss[loss=0.3559, simple_loss=0.3985, pruned_loss=0.1566, over 4252194.95 frames. ], batch size: 131, lr: 2.78e-02, grad_scale: 32.0 2023-06-18 03:10:38,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=108600.0, ans=0.125 2023-06-18 03:11:30,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=108720.0, ans=0.125 2023-06-18 03:11:33,731 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.544e+02 4.148e+02 5.231e+02 6.853e+02 1.250e+03, threshold=1.046e+03, percent-clipped=5.0 2023-06-18 03:11:46,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=108780.0, ans=0.125 2023-06-18 03:12:07,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=108840.0, ans=0.125 2023-06-18 03:12:17,540 INFO [train.py:996] (2/4) Epoch 1, batch 18150, loss[loss=0.292, simple_loss=0.3321, pruned_loss=0.1259, over 21381.00 frames. ], tot_loss[loss=0.3546, simple_loss=0.3988, pruned_loss=0.1552, over 4254487.74 frames. ], batch size: 131, lr: 2.78e-02, grad_scale: 32.0 2023-06-18 03:12:17,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=108900.0, ans=0.125 2023-06-18 03:12:55,434 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-18 03:13:03,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=109020.0, ans=0.125 2023-06-18 03:13:40,783 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.25 vs. limit=6.0 2023-06-18 03:13:43,838 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.62 vs. limit=22.5 2023-06-18 03:13:49,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=109140.0, ans=0.2 2023-06-18 03:13:58,936 INFO [train.py:996] (2/4) Epoch 1, batch 18200, loss[loss=0.3079, simple_loss=0.3495, pruned_loss=0.1331, over 21685.00 frames. ], tot_loss[loss=0.3499, simple_loss=0.3913, pruned_loss=0.1543, over 4254266.71 frames. ], batch size: 282, lr: 2.78e-02, grad_scale: 32.0 2023-06-18 03:14:13,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=109260.0, ans=0.015 2023-06-18 03:14:56,481 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.781e+02 3.703e+02 5.000e+02 6.238e+02 9.945e+02, threshold=1.000e+03, percent-clipped=0.0 2023-06-18 03:15:00,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=109380.0, ans=0.125 2023-06-18 03:15:31,234 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=15.0 2023-06-18 03:15:33,226 INFO [train.py:996] (2/4) Epoch 1, batch 18250, loss[loss=0.3121, simple_loss=0.3595, pruned_loss=0.1324, over 21799.00 frames. ], tot_loss[loss=0.3356, simple_loss=0.3782, pruned_loss=0.1465, over 4255567.35 frames. ], batch size: 124, lr: 2.77e-02, grad_scale: 32.0 2023-06-18 03:16:23,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=109620.0, ans=0.1 2023-06-18 03:16:59,850 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.65 vs. limit=10.0 2023-06-18 03:17:02,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=109740.0, ans=0.125 2023-06-18 03:17:17,178 INFO [train.py:996] (2/4) Epoch 1, batch 18300, loss[loss=0.4629, simple_loss=0.4749, pruned_loss=0.2255, over 21654.00 frames. ], tot_loss[loss=0.336, simple_loss=0.3778, pruned_loss=0.1472, over 4263752.86 frames. ], batch size: 507, lr: 2.77e-02, grad_scale: 32.0 2023-06-18 03:17:49,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=109860.0, ans=0.2 2023-06-18 03:18:07,895 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.56 vs. limit=6.0 2023-06-18 03:18:21,688 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.295e+02 3.385e+02 4.237e+02 5.760e+02 9.388e+02, threshold=8.473e+02, percent-clipped=0.0 2023-06-18 03:19:00,061 INFO [train.py:996] (2/4) Epoch 1, batch 18350, loss[loss=0.258, simple_loss=0.3153, pruned_loss=0.1003, over 17472.00 frames. ], tot_loss[loss=0.3402, simple_loss=0.3847, pruned_loss=0.1479, over 4260783.53 frames. ], batch size: 67, lr: 2.77e-02, grad_scale: 32.0 2023-06-18 03:19:26,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=110160.0, ans=0.1 2023-06-18 03:19:37,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=110220.0, ans=22.5 2023-06-18 03:20:01,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=110220.0, ans=0.0 2023-06-18 03:20:19,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=110280.0, ans=0.0 2023-06-18 03:20:20,120 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.85 vs. limit=6.0 2023-06-18 03:20:28,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=110340.0, ans=0.125 2023-06-18 03:20:42,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=110400.0, ans=0.125 2023-06-18 03:20:44,057 INFO [train.py:996] (2/4) Epoch 1, batch 18400, loss[loss=0.3867, simple_loss=0.4334, pruned_loss=0.17, over 21713.00 frames. ], tot_loss[loss=0.3382, simple_loss=0.3817, pruned_loss=0.1473, over 4250320.48 frames. ], batch size: 415, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 03:21:48,648 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.390e+02 3.132e+02 3.866e+02 5.099e+02 1.496e+03, threshold=7.733e+02, percent-clipped=6.0 2023-06-18 03:22:07,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=110580.0, ans=0.125 2023-06-18 03:22:31,366 INFO [train.py:996] (2/4) Epoch 1, batch 18450, loss[loss=0.2824, simple_loss=0.3406, pruned_loss=0.1121, over 20800.00 frames. ], tot_loss[loss=0.33, simple_loss=0.3772, pruned_loss=0.1414, over 4243427.32 frames. ], batch size: 609, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 03:22:36,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=110700.0, ans=0.0 2023-06-18 03:22:48,356 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=15.0 2023-06-18 03:23:26,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=110820.0, ans=0.04949747468305833 2023-06-18 03:23:55,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=110940.0, ans=0.07 2023-06-18 03:23:56,234 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.36 vs. limit=15.0 2023-06-18 03:23:59,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=110940.0, ans=0.0 2023-06-18 03:24:10,271 INFO [train.py:996] (2/4) Epoch 1, batch 18500, loss[loss=0.3002, simple_loss=0.3336, pruned_loss=0.1334, over 21639.00 frames. ], tot_loss[loss=0.3249, simple_loss=0.3719, pruned_loss=0.1389, over 4240003.37 frames. ], batch size: 263, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 03:25:14,310 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 3.536e+02 4.866e+02 6.375e+02 1.291e+03, threshold=9.732e+02, percent-clipped=16.0 2023-06-18 03:25:38,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=111240.0, ans=0.125 2023-06-18 03:25:52,428 INFO [train.py:996] (2/4) Epoch 1, batch 18550, loss[loss=0.3235, simple_loss=0.3607, pruned_loss=0.1431, over 21619.00 frames. ], tot_loss[loss=0.3228, simple_loss=0.3697, pruned_loss=0.1379, over 4228127.75 frames. ], batch size: 332, lr: 2.76e-02, grad_scale: 32.0 2023-06-18 03:27:18,854 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 03:27:42,674 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.90 vs. limit=15.0 2023-06-18 03:27:44,762 INFO [train.py:996] (2/4) Epoch 1, batch 18600, loss[loss=0.3027, simple_loss=0.3583, pruned_loss=0.1235, over 21797.00 frames. ], tot_loss[loss=0.3261, simple_loss=0.3708, pruned_loss=0.1407, over 4235754.14 frames. ], batch size: 282, lr: 2.75e-02, grad_scale: 32.0 2023-06-18 03:28:44,876 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 3.656e+02 4.245e+02 5.529e+02 8.990e+02, threshold=8.491e+02, percent-clipped=0.0 2023-06-18 03:28:48,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=111780.0, ans=0.04949747468305833 2023-06-18 03:29:07,569 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.33 vs. limit=15.0 2023-06-18 03:29:27,680 INFO [train.py:996] (2/4) Epoch 1, batch 18650, loss[loss=0.2915, simple_loss=0.3399, pruned_loss=0.1215, over 21501.00 frames. ], tot_loss[loss=0.3258, simple_loss=0.3695, pruned_loss=0.1411, over 4234703.51 frames. ], batch size: 212, lr: 2.75e-02, grad_scale: 32.0 2023-06-18 03:30:13,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=112020.0, ans=0.015 2023-06-18 03:30:51,527 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.52 vs. limit=6.0 2023-06-18 03:31:04,858 INFO [train.py:996] (2/4) Epoch 1, batch 18700, loss[loss=0.364, simple_loss=0.389, pruned_loss=0.1695, over 21407.00 frames. ], tot_loss[loss=0.3273, simple_loss=0.3686, pruned_loss=0.143, over 4237799.96 frames. ], batch size: 473, lr: 2.75e-02, grad_scale: 32.0 2023-06-18 03:31:31,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=112260.0, ans=0.0 2023-06-18 03:31:33,873 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.97 vs. limit=15.0 2023-06-18 03:32:03,403 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.390e+02 3.596e+02 4.888e+02 6.142e+02 1.184e+03, threshold=9.776e+02, percent-clipped=7.0 2023-06-18 03:32:47,397 INFO [train.py:996] (2/4) Epoch 1, batch 18750, loss[loss=0.3124, simple_loss=0.3396, pruned_loss=0.1426, over 20266.00 frames. ], tot_loss[loss=0.3308, simple_loss=0.3702, pruned_loss=0.1457, over 4249824.07 frames. ], batch size: 703, lr: 2.74e-02, grad_scale: 32.0 2023-06-18 03:32:49,956 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=15.0 2023-06-18 03:34:04,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=112680.0, ans=0.0 2023-06-18 03:34:05,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=112680.0, ans=0.0 2023-06-18 03:34:15,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=112740.0, ans=0.125 2023-06-18 03:34:19,476 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.24 vs. limit=15.0 2023-06-18 03:34:30,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=112800.0, ans=0.0 2023-06-18 03:34:31,345 INFO [train.py:996] (2/4) Epoch 1, batch 18800, loss[loss=0.2357, simple_loss=0.3025, pruned_loss=0.08449, over 21459.00 frames. ], tot_loss[loss=0.3361, simple_loss=0.3766, pruned_loss=0.1478, over 4256061.67 frames. ], batch size: 194, lr: 2.74e-02, grad_scale: 32.0 2023-06-18 03:34:52,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=112860.0, ans=0.0 2023-06-18 03:35:38,118 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 3.151e+02 4.208e+02 5.876e+02 1.169e+03, threshold=8.416e+02, percent-clipped=1.0 2023-06-18 03:35:51,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=112980.0, ans=0.2 2023-06-18 03:36:15,774 INFO [train.py:996] (2/4) Epoch 1, batch 18850, loss[loss=0.2872, simple_loss=0.3675, pruned_loss=0.1035, over 21769.00 frames. ], tot_loss[loss=0.32, simple_loss=0.3673, pruned_loss=0.1363, over 4257133.86 frames. ], batch size: 391, lr: 2.74e-02, grad_scale: 32.0 2023-06-18 03:36:41,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=113160.0, ans=0.125 2023-06-18 03:36:46,289 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=16.95 vs. limit=15.0 2023-06-18 03:36:54,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=113220.0, ans=0.0 2023-06-18 03:37:04,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=113220.0, ans=0.0 2023-06-18 03:37:04,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=113220.0, ans=0.0 2023-06-18 03:37:45,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=113340.0, ans=0.0 2023-06-18 03:37:55,916 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.77 vs. limit=22.5 2023-06-18 03:37:57,714 INFO [train.py:996] (2/4) Epoch 1, batch 18900, loss[loss=0.3878, simple_loss=0.4073, pruned_loss=0.1842, over 21894.00 frames. ], tot_loss[loss=0.3213, simple_loss=0.3658, pruned_loss=0.1384, over 4253578.14 frames. ], batch size: 351, lr: 2.73e-02, grad_scale: 32.0 2023-06-18 03:38:05,203 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=12.0 2023-06-18 03:38:17,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=113460.0, ans=0.2 2023-06-18 03:38:22,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=113460.0, ans=0.125 2023-06-18 03:38:37,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=113520.0, ans=0.125 2023-06-18 03:38:57,660 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 3.397e+02 4.704e+02 6.198e+02 1.365e+03, threshold=9.409e+02, percent-clipped=10.0 2023-06-18 03:39:05,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=113580.0, ans=0.1 2023-06-18 03:39:48,149 INFO [train.py:996] (2/4) Epoch 1, batch 18950, loss[loss=0.3241, simple_loss=0.3698, pruned_loss=0.1392, over 21659.00 frames. ], tot_loss[loss=0.3252, simple_loss=0.3673, pruned_loss=0.1415, over 4262769.03 frames. ], batch size: 263, lr: 2.73e-02, grad_scale: 32.0 2023-06-18 03:39:58,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=113700.0, ans=0.125 2023-06-18 03:40:12,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=113760.0, ans=0.125 2023-06-18 03:40:58,093 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.26 vs. limit=12.0 2023-06-18 03:41:05,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=113880.0, ans=0.0 2023-06-18 03:41:31,611 INFO [train.py:996] (2/4) Epoch 1, batch 19000, loss[loss=0.3907, simple_loss=0.4315, pruned_loss=0.175, over 21504.00 frames. ], tot_loss[loss=0.3328, simple_loss=0.3777, pruned_loss=0.144, over 4271560.16 frames. ], batch size: 194, lr: 2.73e-02, grad_scale: 32.0 2023-06-18 03:41:46,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=114060.0, ans=0.125 2023-06-18 03:42:24,419 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 03:42:37,052 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.785e+02 4.163e+02 4.936e+02 6.528e+02 1.667e+03, threshold=9.873e+02, percent-clipped=8.0 2023-06-18 03:43:15,075 INFO [train.py:996] (2/4) Epoch 1, batch 19050, loss[loss=0.4154, simple_loss=0.4296, pruned_loss=0.2006, over 21715.00 frames. ], tot_loss[loss=0.3421, simple_loss=0.3849, pruned_loss=0.1496, over 4272775.67 frames. ], batch size: 475, lr: 2.72e-02, grad_scale: 32.0 2023-06-18 03:43:47,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=114360.0, ans=0.125 2023-06-18 03:43:47,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=114360.0, ans=0.125 2023-06-18 03:43:48,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=114360.0, ans=0.0 2023-06-18 03:44:58,934 INFO [train.py:996] (2/4) Epoch 1, batch 19100, loss[loss=0.3242, simple_loss=0.3577, pruned_loss=0.1453, over 21178.00 frames. ], tot_loss[loss=0.344, simple_loss=0.3841, pruned_loss=0.1519, over 4268261.36 frames. ], batch size: 608, lr: 2.72e-02, grad_scale: 32.0 2023-06-18 03:45:04,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=114600.0, ans=0.125 2023-06-18 03:45:18,786 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 03:45:28,733 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-18 03:45:40,904 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.15 vs. limit=10.0 2023-06-18 03:46:04,828 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.557e+02 3.849e+02 4.992e+02 6.577e+02 2.048e+03, threshold=9.985e+02, percent-clipped=3.0 2023-06-18 03:46:41,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=114840.0, ans=0.2 2023-06-18 03:46:43,852 INFO [train.py:996] (2/4) Epoch 1, batch 19150, loss[loss=0.3698, simple_loss=0.4352, pruned_loss=0.1522, over 21866.00 frames. ], tot_loss[loss=0.3468, simple_loss=0.3871, pruned_loss=0.1532, over 4273126.21 frames. ], batch size: 317, lr: 2.72e-02, grad_scale: 32.0 2023-06-18 03:46:44,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=114900.0, ans=0.125 2023-06-18 03:46:47,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=114900.0, ans=0.1 2023-06-18 03:47:48,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=115020.0, ans=0.125 2023-06-18 03:47:54,307 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=15.0 2023-06-18 03:48:13,001 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.90 vs. limit=10.0 2023-06-18 03:48:29,967 INFO [train.py:996] (2/4) Epoch 1, batch 19200, loss[loss=0.3571, simple_loss=0.4299, pruned_loss=0.1422, over 21754.00 frames. ], tot_loss[loss=0.354, simple_loss=0.3993, pruned_loss=0.1544, over 4278559.76 frames. ], batch size: 332, lr: 2.71e-02, grad_scale: 32.0 2023-06-18 03:48:53,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=115260.0, ans=0.125 2023-06-18 03:49:07,224 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=22.5 2023-06-18 03:49:20,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=115320.0, ans=0.0 2023-06-18 03:49:30,837 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.474e+02 4.215e+02 5.397e+02 9.229e+02, threshold=8.431e+02, percent-clipped=0.0 2023-06-18 03:49:34,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=115380.0, ans=0.05 2023-06-18 03:49:49,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=115380.0, ans=0.125 2023-06-18 03:49:51,836 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.67 vs. limit=22.5 2023-06-18 03:49:56,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=115440.0, ans=0.125 2023-06-18 03:50:08,871 INFO [train.py:996] (2/4) Epoch 1, batch 19250, loss[loss=0.342, simple_loss=0.4247, pruned_loss=0.1297, over 19766.00 frames. ], tot_loss[loss=0.3427, simple_loss=0.395, pruned_loss=0.1452, over 4268706.10 frames. ], batch size: 702, lr: 2.71e-02, grad_scale: 32.0 2023-06-18 03:50:11,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=115500.0, ans=0.1 2023-06-18 03:50:38,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=115560.0, ans=0.125 2023-06-18 03:51:50,254 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.69 vs. limit=6.0 2023-06-18 03:51:52,359 INFO [train.py:996] (2/4) Epoch 1, batch 19300, loss[loss=0.4011, simple_loss=0.4723, pruned_loss=0.1649, over 19737.00 frames. ], tot_loss[loss=0.3395, simple_loss=0.3913, pruned_loss=0.1438, over 4276626.67 frames. ], batch size: 703, lr: 2.71e-02, grad_scale: 32.0 2023-06-18 03:52:38,822 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-06-18 03:53:02,619 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 3.240e+02 4.224e+02 5.313e+02 1.250e+03, threshold=8.447e+02, percent-clipped=7.0 2023-06-18 03:53:18,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=116040.0, ans=0.125 2023-06-18 03:53:41,280 INFO [train.py:996] (2/4) Epoch 1, batch 19350, loss[loss=0.2891, simple_loss=0.3516, pruned_loss=0.1133, over 21758.00 frames. ], tot_loss[loss=0.331, simple_loss=0.3843, pruned_loss=0.1389, over 4269723.05 frames. ], batch size: 282, lr: 2.71e-02, grad_scale: 64.0 2023-06-18 03:54:06,168 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-18 03:54:20,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=116160.0, ans=0.125 2023-06-18 03:54:37,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=116220.0, ans=0.2 2023-06-18 03:55:09,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=116340.0, ans=0.125 2023-06-18 03:55:24,582 INFO [train.py:996] (2/4) Epoch 1, batch 19400, loss[loss=0.4447, simple_loss=0.4425, pruned_loss=0.2235, over 21718.00 frames. ], tot_loss[loss=0.3295, simple_loss=0.3824, pruned_loss=0.1383, over 4275735.27 frames. ], batch size: 508, lr: 2.70e-02, grad_scale: 32.0 2023-06-18 03:55:56,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=116460.0, ans=0.125 2023-06-18 03:56:03,746 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=22.5 2023-06-18 03:56:29,466 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 3.749e+02 4.636e+02 5.829e+02 1.066e+03, threshold=9.272e+02, percent-clipped=6.0 2023-06-18 03:57:05,456 INFO [train.py:996] (2/4) Epoch 1, batch 19450, loss[loss=0.3294, simple_loss=0.351, pruned_loss=0.1539, over 21256.00 frames. ], tot_loss[loss=0.332, simple_loss=0.3801, pruned_loss=0.142, over 4281298.61 frames. ], batch size: 144, lr: 2.70e-02, grad_scale: 32.0 2023-06-18 03:58:28,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=116940.0, ans=0.1 2023-06-18 03:58:42,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=116940.0, ans=0.2 2023-06-18 03:58:56,319 INFO [train.py:996] (2/4) Epoch 1, batch 19500, loss[loss=0.3609, simple_loss=0.4031, pruned_loss=0.1594, over 21562.00 frames. ], tot_loss[loss=0.3328, simple_loss=0.3767, pruned_loss=0.1444, over 4279959.81 frames. ], batch size: 389, lr: 2.70e-02, grad_scale: 32.0 2023-06-18 03:59:09,106 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=15.0 2023-06-18 03:59:16,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=117060.0, ans=0.1 2023-06-18 03:59:48,161 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=12.0 2023-06-18 03:59:57,085 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 3.820e+02 4.726e+02 6.793e+02 1.461e+03, threshold=9.451e+02, percent-clipped=7.0 2023-06-18 03:59:57,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=117180.0, ans=0.0 2023-06-18 04:00:07,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=117180.0, ans=0.1 2023-06-18 04:00:25,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=117240.0, ans=0.125 2023-06-18 04:00:27,757 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=12.0 2023-06-18 04:00:32,479 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=12.0 2023-06-18 04:00:32,887 INFO [train.py:996] (2/4) Epoch 1, batch 19550, loss[loss=0.2697, simple_loss=0.3168, pruned_loss=0.1113, over 21388.00 frames. ], tot_loss[loss=0.3248, simple_loss=0.3694, pruned_loss=0.1401, over 4262371.92 frames. ], batch size: 194, lr: 2.69e-02, grad_scale: 32.0 2023-06-18 04:00:56,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=117300.0, ans=0.0 2023-06-18 04:01:18,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=117420.0, ans=0.125 2023-06-18 04:01:39,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=117480.0, ans=0.125 2023-06-18 04:01:44,062 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.60 vs. limit=22.5 2023-06-18 04:01:57,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=117540.0, ans=0.025 2023-06-18 04:02:13,869 INFO [train.py:996] (2/4) Epoch 1, batch 19600, loss[loss=0.3605, simple_loss=0.3933, pruned_loss=0.1639, over 21860.00 frames. ], tot_loss[loss=0.3295, simple_loss=0.3733, pruned_loss=0.1428, over 4267808.03 frames. ], batch size: 351, lr: 2.69e-02, grad_scale: 32.0 2023-06-18 04:02:49,222 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-18 04:02:52,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=117660.0, ans=0.95 2023-06-18 04:03:03,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=117720.0, ans=0.0 2023-06-18 04:03:11,996 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.70 vs. limit=6.0 2023-06-18 04:03:20,213 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.375e+02 3.478e+02 4.292e+02 5.648e+02 1.125e+03, threshold=8.585e+02, percent-clipped=2.0 2023-06-18 04:03:20,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=117780.0, ans=0.125 2023-06-18 04:03:35,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=117840.0, ans=0.1 2023-06-18 04:04:03,584 INFO [train.py:996] (2/4) Epoch 1, batch 19650, loss[loss=0.3475, simple_loss=0.3881, pruned_loss=0.1535, over 20849.00 frames. ], tot_loss[loss=0.3399, simple_loss=0.3806, pruned_loss=0.1495, over 4270614.63 frames. ], batch size: 607, lr: 2.69e-02, grad_scale: 32.0 2023-06-18 04:04:31,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=117960.0, ans=0.125 2023-06-18 04:04:45,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=117960.0, ans=0.0 2023-06-18 04:05:07,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=118080.0, ans=0.125 2023-06-18 04:05:38,195 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.31 vs. limit=22.5 2023-06-18 04:05:41,490 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.48 vs. limit=15.0 2023-06-18 04:05:52,072 INFO [train.py:996] (2/4) Epoch 1, batch 19700, loss[loss=0.3217, simple_loss=0.3765, pruned_loss=0.1334, over 21596.00 frames. ], tot_loss[loss=0.3441, simple_loss=0.3852, pruned_loss=0.1515, over 4260382.21 frames. ], batch size: 263, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 04:05:55,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=118200.0, ans=0.0 2023-06-18 04:06:17,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=118260.0, ans=0.0 2023-06-18 04:06:24,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=118260.0, ans=0.0 2023-06-18 04:06:34,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=118320.0, ans=0.05 2023-06-18 04:06:48,675 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2023-06-18 04:06:58,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=118380.0, ans=0.2 2023-06-18 04:06:59,776 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.705e+02 3.771e+02 4.552e+02 5.763e+02 1.165e+03, threshold=9.104e+02, percent-clipped=3.0 2023-06-18 04:07:11,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=118380.0, ans=0.015 2023-06-18 04:07:30,615 INFO [train.py:996] (2/4) Epoch 1, batch 19750, loss[loss=0.3249, simple_loss=0.387, pruned_loss=0.1314, over 21393.00 frames. ], tot_loss[loss=0.35, simple_loss=0.3947, pruned_loss=0.1527, over 4256144.41 frames. ], batch size: 176, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 04:07:55,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=118560.0, ans=0.04949747468305833 2023-06-18 04:08:43,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=118680.0, ans=0.0 2023-06-18 04:08:50,797 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=22.5 2023-06-18 04:08:51,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=118680.0, ans=0.0 2023-06-18 04:09:17,515 INFO [train.py:996] (2/4) Epoch 1, batch 19800, loss[loss=0.2535, simple_loss=0.2807, pruned_loss=0.1132, over 21890.00 frames. ], tot_loss[loss=0.3513, simple_loss=0.3943, pruned_loss=0.1541, over 4272292.78 frames. ], batch size: 98, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 04:09:20,486 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-18 04:09:59,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=118920.0, ans=0.125 2023-06-18 04:10:23,928 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.318e+02 3.634e+02 4.450e+02 5.874e+02 9.997e+02, threshold=8.899e+02, percent-clipped=2.0 2023-06-18 04:10:56,438 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.73 vs. limit=22.5 2023-06-18 04:11:00,160 INFO [train.py:996] (2/4) Epoch 1, batch 19850, loss[loss=0.2899, simple_loss=0.3544, pruned_loss=0.1127, over 21728.00 frames. ], tot_loss[loss=0.3379, simple_loss=0.384, pruned_loss=0.1459, over 4256660.93 frames. ], batch size: 351, lr: 2.68e-02, grad_scale: 32.0 2023-06-18 04:12:01,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=119220.0, ans=10.0 2023-06-18 04:12:08,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=119280.0, ans=0.07 2023-06-18 04:12:45,766 INFO [train.py:996] (2/4) Epoch 1, batch 19900, loss[loss=0.2811, simple_loss=0.3487, pruned_loss=0.1067, over 21170.00 frames. ], tot_loss[loss=0.3362, simple_loss=0.3853, pruned_loss=0.1436, over 4254265.07 frames. ], batch size: 159, lr: 2.67e-02, grad_scale: 32.0 2023-06-18 04:13:12,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=119460.0, ans=0.125 2023-06-18 04:13:51,579 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.203e+02 3.596e+02 4.410e+02 6.393e+02 1.239e+03, threshold=8.821e+02, percent-clipped=7.0 2023-06-18 04:13:56,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=119580.0, ans=0.125 2023-06-18 04:14:27,887 INFO [train.py:996] (2/4) Epoch 1, batch 19950, loss[loss=0.3377, simple_loss=0.3599, pruned_loss=0.1578, over 21859.00 frames. ], tot_loss[loss=0.3331, simple_loss=0.3793, pruned_loss=0.1435, over 4262255.28 frames. ], batch size: 98, lr: 2.67e-02, grad_scale: 32.0 2023-06-18 04:14:48,027 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.84 vs. limit=22.5 2023-06-18 04:15:18,585 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-18 04:15:50,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=119880.0, ans=0.125 2023-06-18 04:16:12,234 INFO [train.py:996] (2/4) Epoch 1, batch 20000, loss[loss=0.3233, simple_loss=0.3761, pruned_loss=0.1353, over 21669.00 frames. ], tot_loss[loss=0.3342, simple_loss=0.3805, pruned_loss=0.144, over 4262697.60 frames. ], batch size: 263, lr: 2.67e-02, grad_scale: 32.0 2023-06-18 04:16:19,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=120000.0, ans=0.0 2023-06-18 04:16:22,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=120000.0, ans=0.125 2023-06-18 04:16:32,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=120060.0, ans=0.0 2023-06-18 04:17:12,234 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.26 vs. limit=6.0 2023-06-18 04:17:18,784 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 3.616e+02 4.426e+02 6.098e+02 1.164e+03, threshold=8.852e+02, percent-clipped=3.0 2023-06-18 04:17:37,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=120240.0, ans=0.5 2023-06-18 04:17:49,099 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.46 vs. limit=10.0 2023-06-18 04:17:52,867 INFO [train.py:996] (2/4) Epoch 1, batch 20050, loss[loss=0.3746, simple_loss=0.3983, pruned_loss=0.1754, over 20008.00 frames. ], tot_loss[loss=0.3407, simple_loss=0.384, pruned_loss=0.1487, over 4268635.47 frames. ], batch size: 702, lr: 2.66e-02, grad_scale: 32.0 2023-06-18 04:17:55,768 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.41 vs. limit=15.0 2023-06-18 04:18:34,815 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.62 vs. limit=5.0 2023-06-18 04:19:12,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=120480.0, ans=0.035 2023-06-18 04:19:37,711 INFO [train.py:996] (2/4) Epoch 1, batch 20100, loss[loss=0.4479, simple_loss=0.4788, pruned_loss=0.2085, over 21598.00 frames. ], tot_loss[loss=0.3459, simple_loss=0.3872, pruned_loss=0.1523, over 4273401.21 frames. ], batch size: 471, lr: 2.66e-02, grad_scale: 32.0 2023-06-18 04:19:59,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=120600.0, ans=0.125 2023-06-18 04:20:30,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=120720.0, ans=0.125 2023-06-18 04:20:45,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=120720.0, ans=10.0 2023-06-18 04:20:51,579 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.579e+02 3.978e+02 4.839e+02 6.470e+02 1.176e+03, threshold=9.678e+02, percent-clipped=4.0 2023-06-18 04:21:09,154 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.40 vs. limit=22.5 2023-06-18 04:21:32,586 INFO [train.py:996] (2/4) Epoch 1, batch 20150, loss[loss=0.3707, simple_loss=0.412, pruned_loss=0.1647, over 21322.00 frames. ], tot_loss[loss=0.3559, simple_loss=0.3987, pruned_loss=0.1565, over 4274487.67 frames. ], batch size: 159, lr: 2.66e-02, grad_scale: 32.0 2023-06-18 04:21:52,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=120900.0, ans=0.125 2023-06-18 04:22:13,096 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=22.5 2023-06-18 04:22:16,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=121020.0, ans=0.125 2023-06-18 04:22:24,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=121020.0, ans=0.0 2023-06-18 04:23:23,950 INFO [train.py:996] (2/4) Epoch 1, batch 20200, loss[loss=0.361, simple_loss=0.4313, pruned_loss=0.1454, over 21827.00 frames. ], tot_loss[loss=0.3612, simple_loss=0.4036, pruned_loss=0.1594, over 4269408.61 frames. ], batch size: 316, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 04:23:28,483 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=14.66 vs. limit=15.0 2023-06-18 04:23:36,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=121200.0, ans=0.125 2023-06-18 04:23:44,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=121260.0, ans=0.2 2023-06-18 04:24:26,089 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.426e+02 4.003e+02 5.201e+02 6.811e+02 1.420e+03, threshold=1.040e+03, percent-clipped=11.0 2023-06-18 04:24:41,254 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=15.0 2023-06-18 04:25:05,815 INFO [train.py:996] (2/4) Epoch 1, batch 20250, loss[loss=0.3352, simple_loss=0.3845, pruned_loss=0.1429, over 21419.00 frames. ], tot_loss[loss=0.3581, simple_loss=0.4029, pruned_loss=0.1566, over 4274188.64 frames. ], batch size: 211, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 04:25:24,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=121500.0, ans=0.125 2023-06-18 04:25:32,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=121560.0, ans=0.5 2023-06-18 04:26:47,845 INFO [train.py:996] (2/4) Epoch 1, batch 20300, loss[loss=0.332, simple_loss=0.3969, pruned_loss=0.1335, over 21600.00 frames. ], tot_loss[loss=0.3506, simple_loss=0.3986, pruned_loss=0.1513, over 4279206.21 frames. ], batch size: 389, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 04:27:53,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=121980.0, ans=0.125 2023-06-18 04:27:54,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=121980.0, ans=0.02 2023-06-18 04:27:55,668 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.297e+02 3.171e+02 3.718e+02 4.802e+02 8.828e+02, threshold=7.436e+02, percent-clipped=0.0 2023-06-18 04:28:04,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=121980.0, ans=0.125 2023-06-18 04:28:15,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=122040.0, ans=0.125 2023-06-18 04:28:28,665 INFO [train.py:996] (2/4) Epoch 1, batch 20350, loss[loss=0.3684, simple_loss=0.4004, pruned_loss=0.1682, over 21293.00 frames. ], tot_loss[loss=0.3523, simple_loss=0.3999, pruned_loss=0.1524, over 4285216.49 frames. ], batch size: 143, lr: 2.65e-02, grad_scale: 32.0 2023-06-18 04:28:29,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=122100.0, ans=0.125 2023-06-18 04:28:29,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=122100.0, ans=0.0 2023-06-18 04:28:40,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=122100.0, ans=0.0 2023-06-18 04:28:43,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=122100.0, ans=0.125 2023-06-18 04:29:10,244 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.86 vs. limit=10.0 2023-06-18 04:30:16,415 INFO [train.py:996] (2/4) Epoch 1, batch 20400, loss[loss=0.3346, simple_loss=0.3834, pruned_loss=0.1429, over 21377.00 frames. ], tot_loss[loss=0.3587, simple_loss=0.4043, pruned_loss=0.1566, over 4270018.82 frames. ], batch size: 131, lr: 2.64e-02, grad_scale: 32.0 2023-06-18 04:30:21,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=122400.0, ans=0.125 2023-06-18 04:30:30,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=122400.0, ans=0.125 2023-06-18 04:30:30,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=122400.0, ans=0.125 2023-06-18 04:30:52,907 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=15.0 2023-06-18 04:31:13,483 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.464e+02 4.014e+02 4.909e+02 5.768e+02 1.154e+03, threshold=9.817e+02, percent-clipped=10.0 2023-06-18 04:31:26,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=122580.0, ans=0.2 2023-06-18 04:31:53,073 INFO [train.py:996] (2/4) Epoch 1, batch 20450, loss[loss=0.3639, simple_loss=0.3949, pruned_loss=0.1665, over 21857.00 frames. ], tot_loss[loss=0.3628, simple_loss=0.4059, pruned_loss=0.1599, over 4261958.46 frames. ], batch size: 107, lr: 2.64e-02, grad_scale: 32.0 2023-06-18 04:31:54,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=122700.0, ans=22.5 2023-06-18 04:32:13,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=122760.0, ans=0.125 2023-06-18 04:32:20,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=122760.0, ans=0.125 2023-06-18 04:33:33,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=123000.0, ans=0.125 2023-06-18 04:33:34,869 INFO [train.py:996] (2/4) Epoch 1, batch 20500, loss[loss=0.361, simple_loss=0.3892, pruned_loss=0.1664, over 21791.00 frames. ], tot_loss[loss=0.36, simple_loss=0.4006, pruned_loss=0.1597, over 4263428.82 frames. ], batch size: 332, lr: 2.64e-02, grad_scale: 32.0 2023-06-18 04:33:43,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=123000.0, ans=0.0 2023-06-18 04:33:44,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=123000.0, ans=0.125 2023-06-18 04:34:43,265 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.920e+02 3.898e+02 4.731e+02 5.915e+02 1.084e+03, threshold=9.462e+02, percent-clipped=4.0 2023-06-18 04:34:54,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=123180.0, ans=0.1 2023-06-18 04:35:07,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=123240.0, ans=0.125 2023-06-18 04:35:23,680 INFO [train.py:996] (2/4) Epoch 1, batch 20550, loss[loss=0.3098, simple_loss=0.3336, pruned_loss=0.143, over 21580.00 frames. ], tot_loss[loss=0.3572, simple_loss=0.3958, pruned_loss=0.1593, over 4258514.29 frames. ], batch size: 196, lr: 2.63e-02, grad_scale: 32.0 2023-06-18 04:35:47,708 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.18 vs. limit=15.0 2023-06-18 04:36:14,391 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=12.0 2023-06-18 04:36:29,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=123480.0, ans=0.125 2023-06-18 04:36:46,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=123540.0, ans=10.0 2023-06-18 04:37:01,265 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.90 vs. limit=22.5 2023-06-18 04:37:06,864 INFO [train.py:996] (2/4) Epoch 1, batch 20600, loss[loss=0.387, simple_loss=0.4279, pruned_loss=0.1731, over 21740.00 frames. ], tot_loss[loss=0.3528, simple_loss=0.3957, pruned_loss=0.1549, over 4257647.13 frames. ], batch size: 441, lr: 2.63e-02, grad_scale: 32.0 2023-06-18 04:38:09,153 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.246e+02 3.464e+02 4.439e+02 5.514e+02 9.400e+02, threshold=8.878e+02, percent-clipped=0.0 2023-06-18 04:38:38,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=123840.0, ans=0.0 2023-06-18 04:38:48,811 INFO [train.py:996] (2/4) Epoch 1, batch 20650, loss[loss=0.3184, simple_loss=0.3574, pruned_loss=0.1397, over 21889.00 frames. ], tot_loss[loss=0.352, simple_loss=0.3922, pruned_loss=0.1559, over 4259283.21 frames. ], batch size: 107, lr: 2.63e-02, grad_scale: 32.0 2023-06-18 04:38:58,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=123900.0, ans=0.125 2023-06-18 04:39:12,689 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.13 vs. limit=15.0 2023-06-18 04:39:50,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=124080.0, ans=0.0 2023-06-18 04:40:31,338 INFO [train.py:996] (2/4) Epoch 1, batch 20700, loss[loss=0.2518, simple_loss=0.3115, pruned_loss=0.09602, over 21432.00 frames. ], tot_loss[loss=0.341, simple_loss=0.3829, pruned_loss=0.1495, over 4243560.64 frames. ], batch size: 194, lr: 2.63e-02, grad_scale: 32.0 2023-06-18 04:40:45,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=124260.0, ans=0.0 2023-06-18 04:40:52,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=124260.0, ans=0.0 2023-06-18 04:41:38,279 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.314e+02 3.293e+02 3.859e+02 5.120e+02 8.262e+02, threshold=7.718e+02, percent-clipped=0.0 2023-06-18 04:42:12,831 INFO [train.py:996] (2/4) Epoch 1, batch 20750, loss[loss=0.3317, simple_loss=0.3934, pruned_loss=0.135, over 21345.00 frames. ], tot_loss[loss=0.3395, simple_loss=0.3839, pruned_loss=0.1476, over 4239584.72 frames. ], batch size: 194, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 04:42:31,916 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.89 vs. limit=15.0 2023-06-18 04:43:28,274 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.89 vs. limit=6.0 2023-06-18 04:43:31,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=124680.0, ans=0.0 2023-06-18 04:43:45,705 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=16.44 vs. limit=15.0 2023-06-18 04:43:56,116 INFO [train.py:996] (2/4) Epoch 1, batch 20800, loss[loss=0.335, simple_loss=0.3789, pruned_loss=0.1455, over 21623.00 frames. ], tot_loss[loss=0.3433, simple_loss=0.3875, pruned_loss=0.1496, over 4249254.69 frames. ], batch size: 332, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 04:45:08,187 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.368e+02 3.772e+02 4.526e+02 5.632e+02 1.034e+03, threshold=9.051e+02, percent-clipped=9.0 2023-06-18 04:45:16,783 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.65 vs. limit=15.0 2023-06-18 04:45:36,938 INFO [train.py:996] (2/4) Epoch 1, batch 20850, loss[loss=0.3005, simple_loss=0.3448, pruned_loss=0.1281, over 21485.00 frames. ], tot_loss[loss=0.3352, simple_loss=0.3783, pruned_loss=0.1461, over 4257611.00 frames. ], batch size: 194, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 04:45:44,455 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-06-18 04:45:55,728 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-18 04:46:50,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=125280.0, ans=0.0 2023-06-18 04:46:58,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=125280.0, ans=0.125 2023-06-18 04:47:18,203 INFO [train.py:996] (2/4) Epoch 1, batch 20900, loss[loss=0.3092, simple_loss=0.3534, pruned_loss=0.1325, over 21503.00 frames. ], tot_loss[loss=0.3378, simple_loss=0.3789, pruned_loss=0.1484, over 4252568.71 frames. ], batch size: 212, lr: 2.62e-02, grad_scale: 32.0 2023-06-18 04:47:46,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=125460.0, ans=0.2 2023-06-18 04:47:49,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=125460.0, ans=0.125 2023-06-18 04:48:25,148 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 3.309e+02 3.915e+02 5.105e+02 1.001e+03, threshold=7.830e+02, percent-clipped=2.0 2023-06-18 04:48:27,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=125580.0, ans=0.0 2023-06-18 04:48:40,363 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2023-06-18 04:48:41,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=125640.0, ans=0.125 2023-06-18 04:48:53,601 INFO [train.py:996] (2/4) Epoch 1, batch 20950, loss[loss=0.2614, simple_loss=0.3076, pruned_loss=0.1075, over 21189.00 frames. ], tot_loss[loss=0.3271, simple_loss=0.3722, pruned_loss=0.141, over 4238815.00 frames. ], batch size: 143, lr: 2.61e-02, grad_scale: 32.0 2023-06-18 04:49:14,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=125760.0, ans=0.0 2023-06-18 04:50:33,103 INFO [train.py:996] (2/4) Epoch 1, batch 21000, loss[loss=0.3479, simple_loss=0.3725, pruned_loss=0.1616, over 21563.00 frames. ], tot_loss[loss=0.3293, simple_loss=0.3726, pruned_loss=0.143, over 4239027.28 frames. ], batch size: 548, lr: 2.61e-02, grad_scale: 32.0 2023-06-18 04:50:33,104 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 04:50:50,131 INFO [train.py:1028] (2/4) Epoch 1, validation: loss=0.3151, simple_loss=0.4075, pruned_loss=0.1114, over 1796401.00 frames. 2023-06-18 04:50:50,132 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-18 04:51:37,052 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.11 vs. limit=22.5 2023-06-18 04:52:01,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=126180.0, ans=0.125 2023-06-18 04:52:02,298 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 3.430e+02 4.586e+02 6.344e+02 1.913e+03, threshold=9.172e+02, percent-clipped=11.0 2023-06-18 04:52:06,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=126180.0, ans=0.1 2023-06-18 04:52:30,640 INFO [train.py:996] (2/4) Epoch 1, batch 21050, loss[loss=0.3567, simple_loss=0.3801, pruned_loss=0.1667, over 21415.00 frames. ], tot_loss[loss=0.3298, simple_loss=0.3716, pruned_loss=0.144, over 4243404.89 frames. ], batch size: 389, lr: 2.61e-02, grad_scale: 32.0 2023-06-18 04:52:35,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=126300.0, ans=0.125 2023-06-18 04:52:46,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=126360.0, ans=0.2 2023-06-18 04:54:03,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=126540.0, ans=0.05 2023-06-18 04:54:07,570 INFO [train.py:996] (2/4) Epoch 1, batch 21100, loss[loss=0.295, simple_loss=0.333, pruned_loss=0.1285, over 21585.00 frames. ], tot_loss[loss=0.3251, simple_loss=0.3662, pruned_loss=0.142, over 4231563.37 frames. ], batch size: 298, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 04:54:09,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=126600.0, ans=0.0 2023-06-18 04:54:43,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=126720.0, ans=0.125 2023-06-18 04:55:14,654 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.520e+02 3.438e+02 4.271e+02 5.279e+02 9.041e+02, threshold=8.542e+02, percent-clipped=0.0 2023-06-18 04:55:35,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=126840.0, ans=0.0 2023-06-18 04:55:43,342 INFO [train.py:996] (2/4) Epoch 1, batch 21150, loss[loss=0.323, simple_loss=0.3465, pruned_loss=0.1497, over 21638.00 frames. ], tot_loss[loss=0.3233, simple_loss=0.3621, pruned_loss=0.1422, over 4223198.44 frames. ], batch size: 282, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 04:55:53,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=126900.0, ans=0.125 2023-06-18 04:55:54,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=126900.0, ans=0.125 2023-06-18 04:56:03,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=126960.0, ans=0.2 2023-06-18 04:56:18,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=126960.0, ans=0.125 2023-06-18 04:56:24,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=127020.0, ans=0.1 2023-06-18 04:56:26,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=127020.0, ans=0.1 2023-06-18 04:56:47,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn2.whiten.whitening_limit, batch_count=127080.0, ans=22.5 2023-06-18 04:57:20,006 INFO [train.py:996] (2/4) Epoch 1, batch 21200, loss[loss=0.2953, simple_loss=0.3406, pruned_loss=0.1251, over 21747.00 frames. ], tot_loss[loss=0.3185, simple_loss=0.3576, pruned_loss=0.1397, over 4234115.51 frames. ], batch size: 371, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 04:57:22,744 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=12.0 2023-06-18 04:57:24,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=127200.0, ans=0.0 2023-06-18 04:57:43,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=127260.0, ans=0.0 2023-06-18 04:58:33,965 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.453e+02 3.663e+02 4.545e+02 5.734e+02 1.350e+03, threshold=9.091e+02, percent-clipped=8.0 2023-06-18 04:58:39,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=127380.0, ans=0.0 2023-06-18 04:59:03,004 INFO [train.py:996] (2/4) Epoch 1, batch 21250, loss[loss=0.3132, simple_loss=0.3547, pruned_loss=0.1358, over 21383.00 frames. ], tot_loss[loss=0.3198, simple_loss=0.3578, pruned_loss=0.1409, over 4237859.78 frames. ], batch size: 160, lr: 2.60e-02, grad_scale: 32.0 2023-06-18 04:59:08,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=127500.0, ans=0.1 2023-06-18 04:59:23,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=127560.0, ans=0.2 2023-06-18 04:59:30,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=127560.0, ans=0.125 2023-06-18 04:59:40,792 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.36 vs. limit=6.0 2023-06-18 05:00:17,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=127680.0, ans=0.0 2023-06-18 05:00:31,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=127740.0, ans=0.2 2023-06-18 05:00:32,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=127740.0, ans=0.09899494936611666 2023-06-18 05:00:41,960 INFO [train.py:996] (2/4) Epoch 1, batch 21300, loss[loss=0.3753, simple_loss=0.4115, pruned_loss=0.1695, over 21908.00 frames. ], tot_loss[loss=0.3265, simple_loss=0.3643, pruned_loss=0.1444, over 4252627.83 frames. ], batch size: 415, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 05:01:48,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=127920.0, ans=0.125 2023-06-18 05:01:48,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=127920.0, ans=0.2 2023-06-18 05:01:54,314 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.417e+02 3.515e+02 4.385e+02 5.674e+02 1.308e+03, threshold=8.770e+02, percent-clipped=8.0 2023-06-18 05:02:23,589 INFO [train.py:996] (2/4) Epoch 1, batch 21350, loss[loss=0.272, simple_loss=0.3425, pruned_loss=0.1007, over 21661.00 frames. ], tot_loss[loss=0.3299, simple_loss=0.3688, pruned_loss=0.1454, over 4262337.26 frames. ], batch size: 247, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 05:03:20,419 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.80 vs. limit=6.0 2023-06-18 05:03:24,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=128220.0, ans=0.0 2023-06-18 05:03:58,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=128340.0, ans=0.125 2023-06-18 05:04:06,129 INFO [train.py:996] (2/4) Epoch 1, batch 21400, loss[loss=0.3568, simple_loss=0.3994, pruned_loss=0.1571, over 21754.00 frames. ], tot_loss[loss=0.3326, simple_loss=0.374, pruned_loss=0.1456, over 4260442.66 frames. ], batch size: 332, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 05:04:08,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=128400.0, ans=0.2 2023-06-18 05:05:19,000 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.957e+02 3.244e+02 3.990e+02 4.956e+02 1.756e+03, threshold=7.981e+02, percent-clipped=8.0 2023-06-18 05:05:45,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=128640.0, ans=0.025 2023-06-18 05:05:46,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=128700.0, ans=0.0 2023-06-18 05:05:47,934 INFO [train.py:996] (2/4) Epoch 1, batch 21450, loss[loss=0.4444, simple_loss=0.4397, pruned_loss=0.2245, over 21792.00 frames. ], tot_loss[loss=0.3405, simple_loss=0.381, pruned_loss=0.15, over 4267729.50 frames. ], batch size: 507, lr: 2.59e-02, grad_scale: 32.0 2023-06-18 05:05:53,861 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-06-18 05:06:21,662 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=22.5 2023-06-18 05:06:22,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=128760.0, ans=0.0 2023-06-18 05:06:23,625 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.90 vs. limit=10.0 2023-06-18 05:07:13,111 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 05:07:19,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=128940.0, ans=0.025 2023-06-18 05:07:22,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=129000.0, ans=0.0 2023-06-18 05:07:23,869 INFO [train.py:996] (2/4) Epoch 1, batch 21500, loss[loss=0.3016, simple_loss=0.3367, pruned_loss=0.1332, over 21576.00 frames. ], tot_loss[loss=0.3418, simple_loss=0.3797, pruned_loss=0.1519, over 4264658.93 frames. ], batch size: 247, lr: 2.58e-02, grad_scale: 32.0 2023-06-18 05:08:35,886 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.639e+02 3.287e+02 4.067e+02 5.300e+02 1.405e+03, threshold=8.134e+02, percent-clipped=7.0 2023-06-18 05:09:05,156 INFO [train.py:996] (2/4) Epoch 1, batch 21550, loss[loss=0.2299, simple_loss=0.2823, pruned_loss=0.0888, over 21215.00 frames. ], tot_loss[loss=0.3306, simple_loss=0.3688, pruned_loss=0.1462, over 4260882.16 frames. ], batch size: 176, lr: 2.58e-02, grad_scale: 32.0 2023-06-18 05:10:05,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=129420.0, ans=0.2 2023-06-18 05:10:53,616 INFO [train.py:996] (2/4) Epoch 1, batch 21600, loss[loss=0.2845, simple_loss=0.3498, pruned_loss=0.1096, over 21580.00 frames. ], tot_loss[loss=0.3243, simple_loss=0.363, pruned_loss=0.1428, over 4257327.39 frames. ], batch size: 230, lr: 2.58e-02, grad_scale: 32.0 2023-06-18 05:11:39,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=129660.0, ans=0.0 2023-06-18 05:11:40,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=129720.0, ans=0.95 2023-06-18 05:11:50,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=129720.0, ans=0.125 2023-06-18 05:12:01,393 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.514e+02 3.331e+02 4.167e+02 5.142e+02 1.133e+03, threshold=8.334e+02, percent-clipped=4.0 2023-06-18 05:12:03,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=129780.0, ans=0.1 2023-06-18 05:12:19,374 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.47 vs. limit=22.5 2023-06-18 05:12:34,528 INFO [train.py:996] (2/4) Epoch 1, batch 21650, loss[loss=0.2912, simple_loss=0.3524, pruned_loss=0.115, over 21786.00 frames. ], tot_loss[loss=0.3259, simple_loss=0.3701, pruned_loss=0.1409, over 4256124.21 frames. ], batch size: 112, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 05:12:43,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=129900.0, ans=0.0 2023-06-18 05:12:46,921 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=22.5 2023-06-18 05:14:15,286 INFO [train.py:996] (2/4) Epoch 1, batch 21700, loss[loss=0.3314, simple_loss=0.3661, pruned_loss=0.1483, over 21645.00 frames. ], tot_loss[loss=0.3223, simple_loss=0.3698, pruned_loss=0.1374, over 4258028.00 frames. ], batch size: 298, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 05:14:38,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=130260.0, ans=0.125 2023-06-18 05:15:16,911 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 3.479e+02 4.448e+02 5.687e+02 1.020e+03, threshold=8.895e+02, percent-clipped=10.0 2023-06-18 05:15:21,257 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-18 05:15:45,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=130440.0, ans=0.125 2023-06-18 05:15:50,777 INFO [train.py:996] (2/4) Epoch 1, batch 21750, loss[loss=0.341, simple_loss=0.3681, pruned_loss=0.1569, over 21968.00 frames. ], tot_loss[loss=0.3193, simple_loss=0.3654, pruned_loss=0.1366, over 4252331.19 frames. ], batch size: 119, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 05:15:57,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=130500.0, ans=0.0 2023-06-18 05:16:42,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=130620.0, ans=0.125 2023-06-18 05:17:04,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=130680.0, ans=0.125 2023-06-18 05:17:08,405 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=12.0 2023-06-18 05:17:34,238 INFO [train.py:996] (2/4) Epoch 1, batch 21800, loss[loss=0.3208, simple_loss=0.3362, pruned_loss=0.1527, over 20653.00 frames. ], tot_loss[loss=0.32, simple_loss=0.3629, pruned_loss=0.1386, over 4253601.43 frames. ], batch size: 608, lr: 2.57e-02, grad_scale: 32.0 2023-06-18 05:18:42,237 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.290e+02 3.714e+02 4.449e+02 6.326e+02 1.060e+03, threshold=8.898e+02, percent-clipped=3.0 2023-06-18 05:19:16,664 INFO [train.py:996] (2/4) Epoch 1, batch 21850, loss[loss=0.35, simple_loss=0.3917, pruned_loss=0.1542, over 20746.00 frames. ], tot_loss[loss=0.323, simple_loss=0.3675, pruned_loss=0.1392, over 4261689.10 frames. ], batch size: 609, lr: 2.56e-02, grad_scale: 32.0 2023-06-18 05:20:05,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=131220.0, ans=0.125 2023-06-18 05:20:07,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=131220.0, ans=0.02 2023-06-18 05:20:10,715 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=15.0 2023-06-18 05:20:11,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=131220.0, ans=0.1 2023-06-18 05:20:22,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=131280.0, ans=0.125 2023-06-18 05:20:26,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=131280.0, ans=0.0 2023-06-18 05:20:33,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=131280.0, ans=0.0 2023-06-18 05:20:58,039 INFO [train.py:996] (2/4) Epoch 1, batch 21900, loss[loss=0.3308, simple_loss=0.3588, pruned_loss=0.1514, over 21630.00 frames. ], tot_loss[loss=0.3266, simple_loss=0.3701, pruned_loss=0.1415, over 4266648.40 frames. ], batch size: 392, lr: 2.56e-02, grad_scale: 32.0 2023-06-18 05:21:10,253 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 05:21:24,026 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-18 05:21:56,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=131580.0, ans=0.0 2023-06-18 05:22:04,617 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.542e+02 3.404e+02 4.082e+02 5.077e+02 9.199e+02, threshold=8.164e+02, percent-clipped=1.0 2023-06-18 05:22:38,144 INFO [train.py:996] (2/4) Epoch 1, batch 21950, loss[loss=0.2435, simple_loss=0.2947, pruned_loss=0.09616, over 21757.00 frames. ], tot_loss[loss=0.321, simple_loss=0.3635, pruned_loss=0.1392, over 4275342.76 frames. ], batch size: 118, lr: 2.56e-02, grad_scale: 32.0 2023-06-18 05:23:25,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=131820.0, ans=0.0 2023-06-18 05:23:40,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=131880.0, ans=0.2 2023-06-18 05:23:52,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=131880.0, ans=0.05 2023-06-18 05:24:05,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=131940.0, ans=0.1 2023-06-18 05:24:19,974 INFO [train.py:996] (2/4) Epoch 1, batch 22000, loss[loss=0.2424, simple_loss=0.3071, pruned_loss=0.08885, over 21736.00 frames. ], tot_loss[loss=0.316, simple_loss=0.3587, pruned_loss=0.1366, over 4272885.75 frames. ], batch size: 282, lr: 2.56e-02, grad_scale: 64.0 2023-06-18 05:25:29,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=132180.0, ans=0.125 2023-06-18 05:25:30,649 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 3.717e+02 4.714e+02 6.490e+02 1.072e+03, threshold=9.428e+02, percent-clipped=6.0 2023-06-18 05:25:32,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=132180.0, ans=0.1 2023-06-18 05:25:37,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=132180.0, ans=0.04949747468305833 2023-06-18 05:25:44,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=132240.0, ans=0.125 2023-06-18 05:26:08,207 INFO [train.py:996] (2/4) Epoch 1, batch 22050, loss[loss=0.2902, simple_loss=0.3432, pruned_loss=0.1186, over 21369.00 frames. ], tot_loss[loss=0.3219, simple_loss=0.3652, pruned_loss=0.1393, over 4272144.54 frames. ], batch size: 194, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 05:26:20,831 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-18 05:26:46,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=132420.0, ans=0.0 2023-06-18 05:27:07,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=132480.0, ans=10.0 2023-06-18 05:27:09,428 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.56 vs. limit=22.5 2023-06-18 05:27:19,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=132480.0, ans=0.2 2023-06-18 05:27:21,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=132540.0, ans=0.125 2023-06-18 05:27:26,122 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 05:27:48,455 INFO [train.py:996] (2/4) Epoch 1, batch 22100, loss[loss=0.3724, simple_loss=0.4114, pruned_loss=0.1667, over 21252.00 frames. ], tot_loss[loss=0.3377, simple_loss=0.3793, pruned_loss=0.1481, over 4271108.67 frames. ], batch size: 176, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 05:27:48,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=132600.0, ans=0.125 2023-06-18 05:28:09,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=132600.0, ans=0.125 2023-06-18 05:28:54,535 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.580e+02 4.025e+02 4.912e+02 6.450e+02 1.246e+03, threshold=9.825e+02, percent-clipped=3.0 2023-06-18 05:29:24,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=132840.0, ans=0.125 2023-06-18 05:29:32,055 INFO [train.py:996] (2/4) Epoch 1, batch 22150, loss[loss=0.4218, simple_loss=0.447, pruned_loss=0.1983, over 20653.00 frames. ], tot_loss[loss=0.343, simple_loss=0.3845, pruned_loss=0.1508, over 4266698.05 frames. ], batch size: 607, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 05:29:46,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=132900.0, ans=0.0 2023-06-18 05:29:57,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=132960.0, ans=0.2 2023-06-18 05:30:02,840 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-06-18 05:30:12,669 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.74 vs. limit=22.5 2023-06-18 05:30:13,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=133020.0, ans=0.2 2023-06-18 05:30:54,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=133140.0, ans=0.125 2023-06-18 05:31:08,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=133140.0, ans=0.125 2023-06-18 05:31:13,359 INFO [train.py:996] (2/4) Epoch 1, batch 22200, loss[loss=0.3815, simple_loss=0.4178, pruned_loss=0.1726, over 21780.00 frames. ], tot_loss[loss=0.3461, simple_loss=0.3869, pruned_loss=0.1526, over 4275926.51 frames. ], batch size: 112, lr: 2.55e-02, grad_scale: 32.0 2023-06-18 05:31:49,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=133320.0, ans=0.125 2023-06-18 05:32:13,167 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.80 vs. limit=15.0 2023-06-18 05:32:15,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=133380.0, ans=0.125 2023-06-18 05:32:16,577 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.785e+02 4.029e+02 4.889e+02 6.211e+02 1.093e+03, threshold=9.779e+02, percent-clipped=2.0 2023-06-18 05:32:59,499 INFO [train.py:996] (2/4) Epoch 1, batch 22250, loss[loss=0.3691, simple_loss=0.407, pruned_loss=0.1656, over 21606.00 frames. ], tot_loss[loss=0.3514, simple_loss=0.3938, pruned_loss=0.1545, over 4280284.04 frames. ], batch size: 263, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 05:33:00,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=133500.0, ans=0.2 2023-06-18 05:33:03,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=133500.0, ans=0.95 2023-06-18 05:34:39,753 INFO [train.py:996] (2/4) Epoch 1, batch 22300, loss[loss=0.3534, simple_loss=0.385, pruned_loss=0.161, over 21300.00 frames. ], tot_loss[loss=0.3531, simple_loss=0.3944, pruned_loss=0.1559, over 4281010.80 frames. ], batch size: 143, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 05:35:01,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=133860.0, ans=0.125 2023-06-18 05:35:01,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=133860.0, ans=0.125 2023-06-18 05:35:26,130 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-06-18 05:35:37,691 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.601e+02 3.602e+02 4.280e+02 5.421e+02 8.254e+02, threshold=8.559e+02, percent-clipped=0.0 2023-06-18 05:36:20,522 INFO [train.py:996] (2/4) Epoch 1, batch 22350, loss[loss=0.3334, simple_loss=0.3699, pruned_loss=0.1484, over 21495.00 frames. ], tot_loss[loss=0.3529, simple_loss=0.3928, pruned_loss=0.1565, over 4285355.83 frames. ], batch size: 194, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 05:36:56,777 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-18 05:36:59,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=134220.0, ans=0.125 2023-06-18 05:37:14,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=134280.0, ans=0.1 2023-06-18 05:38:03,428 INFO [train.py:996] (2/4) Epoch 1, batch 22400, loss[loss=0.3543, simple_loss=0.3859, pruned_loss=0.1614, over 21457.00 frames. ], tot_loss[loss=0.3471, simple_loss=0.3895, pruned_loss=0.1524, over 4282004.74 frames. ], batch size: 389, lr: 2.54e-02, grad_scale: 32.0 2023-06-18 05:38:08,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=134400.0, ans=0.125 2023-06-18 05:39:07,174 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 3.472e+02 4.159e+02 5.652e+02 9.879e+02, threshold=8.318e+02, percent-clipped=2.0 2023-06-18 05:39:24,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=134640.0, ans=0.1 2023-06-18 05:39:39,884 INFO [train.py:996] (2/4) Epoch 1, batch 22450, loss[loss=0.3477, simple_loss=0.3748, pruned_loss=0.1602, over 21810.00 frames. ], tot_loss[loss=0.342, simple_loss=0.3821, pruned_loss=0.151, over 4278913.34 frames. ], batch size: 372, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 05:39:40,578 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 05:39:52,471 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.20 vs. limit=15.0 2023-06-18 05:40:00,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=134760.0, ans=0.1 2023-06-18 05:41:26,135 INFO [train.py:996] (2/4) Epoch 1, batch 22500, loss[loss=0.3928, simple_loss=0.4183, pruned_loss=0.1837, over 21357.00 frames. ], tot_loss[loss=0.3372, simple_loss=0.3762, pruned_loss=0.1491, over 4279987.19 frames. ], batch size: 507, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 05:41:31,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=135000.0, ans=0.125 2023-06-18 05:41:41,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=135000.0, ans=0.0 2023-06-18 05:42:12,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=135120.0, ans=0.5 2023-06-18 05:42:40,703 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.359e+02 3.574e+02 4.487e+02 5.410e+02 9.033e+02, threshold=8.975e+02, percent-clipped=2.0 2023-06-18 05:42:48,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=135180.0, ans=0.125 2023-06-18 05:42:50,789 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.11 vs. limit=6.0 2023-06-18 05:43:06,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=135240.0, ans=0.2 2023-06-18 05:43:08,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=135300.0, ans=0.0 2023-06-18 05:43:09,827 INFO [train.py:996] (2/4) Epoch 1, batch 22550, loss[loss=0.3726, simple_loss=0.3992, pruned_loss=0.173, over 21556.00 frames. ], tot_loss[loss=0.3383, simple_loss=0.3791, pruned_loss=0.1487, over 4276815.15 frames. ], batch size: 548, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 05:43:20,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=135300.0, ans=0.05 2023-06-18 05:44:07,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=135420.0, ans=0.1 2023-06-18 05:44:08,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=135420.0, ans=0.125 2023-06-18 05:44:53,012 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.41 vs. limit=10.0 2023-06-18 05:44:59,057 INFO [train.py:996] (2/4) Epoch 1, batch 22600, loss[loss=0.3273, simple_loss=0.375, pruned_loss=0.1398, over 21857.00 frames. ], tot_loss[loss=0.3415, simple_loss=0.3833, pruned_loss=0.1498, over 4281793.44 frames. ], batch size: 298, lr: 2.53e-02, grad_scale: 32.0 2023-06-18 05:45:24,790 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2023-06-18 05:45:52,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=135720.0, ans=0.125 2023-06-18 05:46:07,731 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.490e+02 4.199e+02 5.117e+02 6.564e+02 1.237e+03, threshold=1.023e+03, percent-clipped=4.0 2023-06-18 05:46:29,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=135840.0, ans=0.125 2023-06-18 05:46:37,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=135840.0, ans=0.1 2023-06-18 05:46:39,953 INFO [train.py:996] (2/4) Epoch 1, batch 22650, loss[loss=0.2736, simple_loss=0.3366, pruned_loss=0.1053, over 19885.00 frames. ], tot_loss[loss=0.3379, simple_loss=0.3786, pruned_loss=0.1487, over 4273711.56 frames. ], batch size: 703, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 05:48:20,096 INFO [train.py:996] (2/4) Epoch 1, batch 22700, loss[loss=0.3679, simple_loss=0.3777, pruned_loss=0.1791, over 21215.00 frames. ], tot_loss[loss=0.333, simple_loss=0.3726, pruned_loss=0.1467, over 4255168.06 frames. ], batch size: 471, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 05:49:24,147 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 3.785e+02 4.714e+02 6.670e+02 1.093e+03, threshold=9.427e+02, percent-clipped=5.0 2023-06-18 05:49:26,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=136380.0, ans=0.125 2023-06-18 05:49:57,057 INFO [train.py:996] (2/4) Epoch 1, batch 22750, loss[loss=0.3512, simple_loss=0.3905, pruned_loss=0.156, over 21849.00 frames. ], tot_loss[loss=0.3371, simple_loss=0.3744, pruned_loss=0.1498, over 4261499.32 frames. ], batch size: 247, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 05:50:43,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=136620.0, ans=0.125 2023-06-18 05:51:02,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=136680.0, ans=0.125 2023-06-18 05:51:02,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=136680.0, ans=0.0 2023-06-18 05:51:16,550 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=22.5 2023-06-18 05:51:39,030 INFO [train.py:996] (2/4) Epoch 1, batch 22800, loss[loss=0.309, simple_loss=0.3557, pruned_loss=0.1312, over 21754.00 frames. ], tot_loss[loss=0.3431, simple_loss=0.3793, pruned_loss=0.1535, over 4273090.89 frames. ], batch size: 298, lr: 2.52e-02, grad_scale: 32.0 2023-06-18 05:52:39,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=136920.0, ans=0.05 2023-06-18 05:52:45,306 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=22.5 2023-06-18 05:52:47,248 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.251e+02 4.256e+02 5.590e+02 8.268e+02 1.334e+03, threshold=1.118e+03, percent-clipped=16.0 2023-06-18 05:52:51,420 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.71 vs. limit=15.0 2023-06-18 05:53:20,669 INFO [train.py:996] (2/4) Epoch 1, batch 22850, loss[loss=0.3186, simple_loss=0.3536, pruned_loss=0.1418, over 21858.00 frames. ], tot_loss[loss=0.3382, simple_loss=0.3738, pruned_loss=0.1513, over 4264835.02 frames. ], batch size: 118, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 05:53:21,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=137100.0, ans=0.125 2023-06-18 05:53:23,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=137100.0, ans=0.0 2023-06-18 05:53:53,530 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.90 vs. limit=15.0 2023-06-18 05:54:26,878 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.22 vs. limit=15.0 2023-06-18 05:54:32,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=137280.0, ans=0.1 2023-06-18 05:55:09,113 INFO [train.py:996] (2/4) Epoch 1, batch 22900, loss[loss=0.3349, simple_loss=0.4179, pruned_loss=0.1259, over 21781.00 frames. ], tot_loss[loss=0.3364, simple_loss=0.3738, pruned_loss=0.1495, over 4271743.62 frames. ], batch size: 332, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 05:55:22,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=137400.0, ans=0.0 2023-06-18 05:55:26,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=137460.0, ans=0.125 2023-06-18 05:55:54,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=137520.0, ans=0.0 2023-06-18 05:56:13,237 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.811e+02 3.596e+02 4.279e+02 5.382e+02 9.756e+02, threshold=8.557e+02, percent-clipped=0.0 2023-06-18 05:56:43,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=137640.0, ans=0.1 2023-06-18 05:56:45,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=137640.0, ans=0.125 2023-06-18 05:56:52,902 INFO [train.py:996] (2/4) Epoch 1, batch 22950, loss[loss=0.3292, simple_loss=0.4229, pruned_loss=0.1178, over 21758.00 frames. ], tot_loss[loss=0.3411, simple_loss=0.3883, pruned_loss=0.1469, over 4279128.07 frames. ], batch size: 332, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 05:57:26,936 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-18 05:57:36,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=137820.0, ans=0.0 2023-06-18 05:57:47,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=137820.0, ans=0.125 2023-06-18 05:58:14,716 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.04 vs. limit=12.0 2023-06-18 05:58:33,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=138000.0, ans=0.1 2023-06-18 05:58:34,356 INFO [train.py:996] (2/4) Epoch 1, batch 23000, loss[loss=0.2926, simple_loss=0.3438, pruned_loss=0.1207, over 21801.00 frames. ], tot_loss[loss=0.3381, simple_loss=0.3897, pruned_loss=0.1432, over 4280513.25 frames. ], batch size: 247, lr: 2.51e-02, grad_scale: 32.0 2023-06-18 05:59:07,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=138060.0, ans=0.2 2023-06-18 05:59:14,877 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.24 vs. limit=15.0 2023-06-18 05:59:19,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=138120.0, ans=0.2 2023-06-18 05:59:42,676 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.083e+02 3.447e+02 4.093e+02 5.344e+02 1.227e+03, threshold=8.186e+02, percent-clipped=4.0 2023-06-18 06:00:03,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=138240.0, ans=0.125 2023-06-18 06:00:15,839 INFO [train.py:996] (2/4) Epoch 1, batch 23050, loss[loss=0.3825, simple_loss=0.4174, pruned_loss=0.1738, over 21388.00 frames. ], tot_loss[loss=0.3433, simple_loss=0.3922, pruned_loss=0.1472, over 4285229.57 frames. ], batch size: 159, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 06:00:17,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=138300.0, ans=0.1 2023-06-18 06:00:41,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=138360.0, ans=0.04949747468305833 2023-06-18 06:00:41,962 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.43 vs. limit=10.0 2023-06-18 06:00:46,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=138360.0, ans=0.2 2023-06-18 06:01:34,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=138480.0, ans=0.0 2023-06-18 06:01:48,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=138540.0, ans=0.125 2023-06-18 06:02:02,491 INFO [train.py:996] (2/4) Epoch 1, batch 23100, loss[loss=0.3343, simple_loss=0.3575, pruned_loss=0.1556, over 21585.00 frames. ], tot_loss[loss=0.343, simple_loss=0.388, pruned_loss=0.1489, over 4273570.01 frames. ], batch size: 415, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 06:02:34,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=138660.0, ans=0.125 2023-06-18 06:02:43,105 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.30 vs. limit=6.0 2023-06-18 06:03:11,218 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.259e+02 3.616e+02 4.226e+02 5.778e+02 1.152e+03, threshold=8.452e+02, percent-clipped=7.0 2023-06-18 06:03:22,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=138840.0, ans=0.015 2023-06-18 06:03:37,798 INFO [train.py:996] (2/4) Epoch 1, batch 23150, loss[loss=0.389, simple_loss=0.4197, pruned_loss=0.1791, over 21877.00 frames. ], tot_loss[loss=0.3372, simple_loss=0.3803, pruned_loss=0.147, over 4280672.60 frames. ], batch size: 107, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 06:03:55,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=138900.0, ans=0.0 2023-06-18 06:04:01,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=138900.0, ans=0.2 2023-06-18 06:05:23,915 INFO [train.py:996] (2/4) Epoch 1, batch 23200, loss[loss=0.3995, simple_loss=0.4178, pruned_loss=0.1906, over 21791.00 frames. ], tot_loss[loss=0.339, simple_loss=0.3797, pruned_loss=0.1492, over 4286629.66 frames. ], batch size: 441, lr: 2.50e-02, grad_scale: 32.0 2023-06-18 06:05:37,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=139200.0, ans=12.0 2023-06-18 06:05:58,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=139260.0, ans=0.125 2023-06-18 06:06:26,400 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.420e+02 3.602e+02 4.092e+02 5.263e+02 8.445e+02, threshold=8.184e+02, percent-clipped=0.0 2023-06-18 06:06:57,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=139500.0, ans=0.125 2023-06-18 06:06:58,378 INFO [train.py:996] (2/4) Epoch 1, batch 23250, loss[loss=0.4344, simple_loss=0.4385, pruned_loss=0.2151, over 21665.00 frames. ], tot_loss[loss=0.3411, simple_loss=0.38, pruned_loss=0.1511, over 4298230.13 frames. ], batch size: 507, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 06:07:31,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=139560.0, ans=0.04949747468305833 2023-06-18 06:07:50,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=139620.0, ans=0.125 2023-06-18 06:08:27,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=139740.0, ans=0.0 2023-06-18 06:08:29,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=139740.0, ans=15.0 2023-06-18 06:08:52,739 INFO [train.py:996] (2/4) Epoch 1, batch 23300, loss[loss=0.3426, simple_loss=0.4032, pruned_loss=0.141, over 21183.00 frames. ], tot_loss[loss=0.3476, simple_loss=0.3886, pruned_loss=0.1533, over 4298658.11 frames. ], batch size: 159, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 06:09:01,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=139800.0, ans=0.125 2023-06-18 06:09:58,004 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.429e+02 3.912e+02 5.511e+02 7.628e+02 1.360e+03, threshold=1.102e+03, percent-clipped=20.0 2023-06-18 06:10:13,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=139980.0, ans=0.0 2023-06-18 06:10:38,109 INFO [train.py:996] (2/4) Epoch 1, batch 23350, loss[loss=0.2761, simple_loss=0.3406, pruned_loss=0.1057, over 21740.00 frames. ], tot_loss[loss=0.3474, simple_loss=0.3915, pruned_loss=0.1516, over 4280015.17 frames. ], batch size: 371, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 06:11:00,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=140160.0, ans=0.125 2023-06-18 06:11:38,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=140280.0, ans=0.125 2023-06-18 06:11:44,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=140280.0, ans=0.2 2023-06-18 06:12:06,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=140340.0, ans=0.125 2023-06-18 06:12:12,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=140340.0, ans=0.125 2023-06-18 06:12:19,199 INFO [train.py:996] (2/4) Epoch 1, batch 23400, loss[loss=0.3212, simple_loss=0.3619, pruned_loss=0.1403, over 21259.00 frames. ], tot_loss[loss=0.3329, simple_loss=0.3797, pruned_loss=0.1431, over 4277511.52 frames. ], batch size: 608, lr: 2.49e-02, grad_scale: 32.0 2023-06-18 06:12:35,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=140460.0, ans=0.0 2023-06-18 06:13:00,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=140520.0, ans=10.0 2023-06-18 06:13:27,641 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.371e+02 3.226e+02 4.219e+02 5.285e+02 8.873e+02, threshold=8.438e+02, percent-clipped=0.0 2023-06-18 06:13:28,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=140580.0, ans=0.0 2023-06-18 06:13:48,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=140640.0, ans=0.0 2023-06-18 06:14:00,451 INFO [train.py:996] (2/4) Epoch 1, batch 23450, loss[loss=0.3826, simple_loss=0.4252, pruned_loss=0.17, over 21845.00 frames. ], tot_loss[loss=0.3372, simple_loss=0.3812, pruned_loss=0.1466, over 4273976.29 frames. ], batch size: 118, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 06:14:07,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=140700.0, ans=0.2 2023-06-18 06:14:54,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=140820.0, ans=0.125 2023-06-18 06:15:20,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=140880.0, ans=0.125 2023-06-18 06:15:26,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=140940.0, ans=0.2 2023-06-18 06:15:29,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=140940.0, ans=0.125 2023-06-18 06:15:41,729 INFO [train.py:996] (2/4) Epoch 1, batch 23500, loss[loss=0.3518, simple_loss=0.3845, pruned_loss=0.1595, over 21668.00 frames. ], tot_loss[loss=0.3416, simple_loss=0.3834, pruned_loss=0.1499, over 4285572.18 frames. ], batch size: 263, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 06:15:42,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=141000.0, ans=0.0 2023-06-18 06:15:46,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=141000.0, ans=0.2 2023-06-18 06:16:49,416 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.488e+02 3.727e+02 4.969e+02 6.081e+02 9.256e+02, threshold=9.939e+02, percent-clipped=2.0 2023-06-18 06:17:14,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=141240.0, ans=0.125 2023-06-18 06:17:22,062 INFO [train.py:996] (2/4) Epoch 1, batch 23550, loss[loss=0.3137, simple_loss=0.353, pruned_loss=0.1372, over 21745.00 frames. ], tot_loss[loss=0.3376, simple_loss=0.3773, pruned_loss=0.1489, over 4287921.93 frames. ], batch size: 112, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 06:17:53,848 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.87 vs. limit=15.0 2023-06-18 06:17:54,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=141360.0, ans=0.125 2023-06-18 06:18:24,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=141420.0, ans=0.125 2023-06-18 06:18:44,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=141480.0, ans=0.0 2023-06-18 06:19:05,146 INFO [train.py:996] (2/4) Epoch 1, batch 23600, loss[loss=0.3804, simple_loss=0.4154, pruned_loss=0.1727, over 21710.00 frames. ], tot_loss[loss=0.3379, simple_loss=0.3783, pruned_loss=0.1487, over 4273068.57 frames. ], batch size: 351, lr: 2.48e-02, grad_scale: 32.0 2023-06-18 06:20:21,068 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.501e+02 3.688e+02 4.463e+02 5.931e+02 8.627e+02, threshold=8.927e+02, percent-clipped=0.0 2023-06-18 06:20:30,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=141780.0, ans=0.125 2023-06-18 06:20:38,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=141840.0, ans=0.1 2023-06-18 06:20:59,604 INFO [train.py:996] (2/4) Epoch 1, batch 23650, loss[loss=0.1941, simple_loss=0.2492, pruned_loss=0.06951, over 17042.00 frames. ], tot_loss[loss=0.3368, simple_loss=0.3796, pruned_loss=0.147, over 4274979.74 frames. ], batch size: 61, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 06:21:10,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=141900.0, ans=0.1 2023-06-18 06:21:18,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=141960.0, ans=0.0 2023-06-18 06:21:56,638 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 06:21:58,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=142080.0, ans=0.125 2023-06-18 06:22:42,922 INFO [train.py:996] (2/4) Epoch 1, batch 23700, loss[loss=0.372, simple_loss=0.4048, pruned_loss=0.1696, over 21295.00 frames. ], tot_loss[loss=0.336, simple_loss=0.3813, pruned_loss=0.1454, over 4274085.22 frames. ], batch size: 143, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 06:22:45,613 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.66 vs. limit=22.5 2023-06-18 06:23:03,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=142260.0, ans=0.125 2023-06-18 06:23:03,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=142260.0, ans=0.125 2023-06-18 06:23:10,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=142260.0, ans=0.1 2023-06-18 06:23:10,717 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-06-18 06:23:53,392 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 3.745e+02 4.445e+02 5.198e+02 9.027e+02, threshold=8.891e+02, percent-clipped=1.0 2023-06-18 06:23:56,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=142380.0, ans=0.035 2023-06-18 06:24:15,789 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=15.0 2023-06-18 06:24:18,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=142440.0, ans=0.1 2023-06-18 06:24:32,865 INFO [train.py:996] (2/4) Epoch 1, batch 23750, loss[loss=0.2869, simple_loss=0.3707, pruned_loss=0.1015, over 21891.00 frames. ], tot_loss[loss=0.3391, simple_loss=0.3842, pruned_loss=0.147, over 4277242.63 frames. ], batch size: 372, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 06:25:15,326 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=12.0 2023-06-18 06:25:24,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=142620.0, ans=0.2 2023-06-18 06:25:31,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=142680.0, ans=0.0 2023-06-18 06:25:51,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=142680.0, ans=0.125 2023-06-18 06:25:53,862 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-18 06:25:59,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=142740.0, ans=0.0 2023-06-18 06:26:17,142 INFO [train.py:996] (2/4) Epoch 1, batch 23800, loss[loss=0.4065, simple_loss=0.4534, pruned_loss=0.1798, over 21745.00 frames. ], tot_loss[loss=0.3345, simple_loss=0.3823, pruned_loss=0.1433, over 4275084.33 frames. ], batch size: 332, lr: 2.47e-02, grad_scale: 32.0 2023-06-18 06:26:23,324 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.29 vs. limit=6.0 2023-06-18 06:26:33,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=142800.0, ans=0.0 2023-06-18 06:27:27,678 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 3.344e+02 4.883e+02 6.088e+02 1.077e+03, threshold=9.766e+02, percent-clipped=8.0 2023-06-18 06:28:06,239 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.74 vs. limit=22.5 2023-06-18 06:28:06,751 INFO [train.py:996] (2/4) Epoch 1, batch 23850, loss[loss=0.3565, simple_loss=0.411, pruned_loss=0.1511, over 21198.00 frames. ], tot_loss[loss=0.3444, simple_loss=0.3932, pruned_loss=0.1478, over 4275850.33 frames. ], batch size: 143, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 06:28:57,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=143220.0, ans=0.0 2023-06-18 06:29:06,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=143280.0, ans=0.2 2023-06-18 06:29:15,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=143280.0, ans=0.125 2023-06-18 06:29:19,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=143280.0, ans=0.125 2023-06-18 06:29:44,924 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=12.0 2023-06-18 06:29:48,594 INFO [train.py:996] (2/4) Epoch 1, batch 23900, loss[loss=0.2964, simple_loss=0.3735, pruned_loss=0.1097, over 21344.00 frames. ], tot_loss[loss=0.3525, simple_loss=0.4019, pruned_loss=0.1515, over 4277808.32 frames. ], batch size: 131, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 06:30:08,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=143460.0, ans=0.04949747468305833 2023-06-18 06:30:13,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=143460.0, ans=0.125 2023-06-18 06:30:13,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=143460.0, ans=0.04949747468305833 2023-06-18 06:30:28,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=143460.0, ans=0.125 2023-06-18 06:30:44,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=143520.0, ans=0.5 2023-06-18 06:30:56,981 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.496e+02 3.761e+02 4.724e+02 6.134e+02 1.060e+03, threshold=9.448e+02, percent-clipped=2.0 2023-06-18 06:31:30,073 INFO [train.py:996] (2/4) Epoch 1, batch 23950, loss[loss=0.2993, simple_loss=0.3418, pruned_loss=0.1284, over 21641.00 frames. ], tot_loss[loss=0.3479, simple_loss=0.3945, pruned_loss=0.1506, over 4274269.58 frames. ], batch size: 282, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 06:31:50,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=143760.0, ans=0.125 2023-06-18 06:32:23,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=143820.0, ans=0.0 2023-06-18 06:32:34,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=143880.0, ans=0.05 2023-06-18 06:33:13,949 INFO [train.py:996] (2/4) Epoch 1, batch 24000, loss[loss=0.3826, simple_loss=0.4163, pruned_loss=0.1745, over 21708.00 frames. ], tot_loss[loss=0.3537, simple_loss=0.3971, pruned_loss=0.1551, over 4278336.82 frames. ], batch size: 298, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 06:33:13,949 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 06:33:36,588 INFO [train.py:1028] (2/4) Epoch 1, validation: loss=0.32, simple_loss=0.4122, pruned_loss=0.1139, over 1796401.00 frames. 2023-06-18 06:33:36,588 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-18 06:33:49,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=144000.0, ans=0.0 2023-06-18 06:34:48,618 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.428e+02 3.687e+02 4.611e+02 5.908e+02 1.149e+03, threshold=9.222e+02, percent-clipped=2.0 2023-06-18 06:35:20,191 INFO [train.py:996] (2/4) Epoch 1, batch 24050, loss[loss=0.2369, simple_loss=0.3088, pruned_loss=0.08248, over 21422.00 frames. ], tot_loss[loss=0.3545, simple_loss=0.398, pruned_loss=0.1555, over 4272401.91 frames. ], batch size: 176, lr: 2.46e-02, grad_scale: 32.0 2023-06-18 06:35:22,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=144300.0, ans=0.125 2023-06-18 06:35:39,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=144300.0, ans=0.125 2023-06-18 06:35:42,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=144360.0, ans=0.125 2023-06-18 06:36:13,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=144420.0, ans=0.0 2023-06-18 06:36:31,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=144480.0, ans=0.125 2023-06-18 06:37:07,464 INFO [train.py:996] (2/4) Epoch 1, batch 24100, loss[loss=0.3267, simple_loss=0.3842, pruned_loss=0.1346, over 21786.00 frames. ], tot_loss[loss=0.35, simple_loss=0.3972, pruned_loss=0.1514, over 4271046.10 frames. ], batch size: 247, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 06:37:48,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=144720.0, ans=0.1 2023-06-18 06:38:12,996 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.309e+02 3.321e+02 4.048e+02 5.410e+02 1.299e+03, threshold=8.096e+02, percent-clipped=1.0 2023-06-18 06:38:44,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=144840.0, ans=0.1 2023-06-18 06:38:49,131 INFO [train.py:996] (2/4) Epoch 1, batch 24150, loss[loss=0.345, simple_loss=0.3786, pruned_loss=0.1557, over 21505.00 frames. ], tot_loss[loss=0.3521, simple_loss=0.3967, pruned_loss=0.1537, over 4276897.99 frames. ], batch size: 194, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 06:39:30,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=144960.0, ans=0.125 2023-06-18 06:39:37,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=145020.0, ans=0.0 2023-06-18 06:39:49,624 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-18 06:40:03,354 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.44 vs. limit=10.0 2023-06-18 06:40:04,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=145080.0, ans=0.1 2023-06-18 06:40:13,038 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=15.0 2023-06-18 06:40:23,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=145140.0, ans=0.0 2023-06-18 06:40:31,831 INFO [train.py:996] (2/4) Epoch 1, batch 24200, loss[loss=0.3232, simple_loss=0.3811, pruned_loss=0.1326, over 21678.00 frames. ], tot_loss[loss=0.3523, simple_loss=0.3969, pruned_loss=0.1539, over 4282905.93 frames. ], batch size: 247, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 06:41:49,443 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.922e+02 3.664e+02 4.494e+02 5.781e+02 1.168e+03, threshold=8.988e+02, percent-clipped=4.0 2023-06-18 06:42:10,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=145440.0, ans=0.2 2023-06-18 06:42:21,384 INFO [train.py:996] (2/4) Epoch 1, batch 24250, loss[loss=0.3045, simple_loss=0.3856, pruned_loss=0.1118, over 21693.00 frames. ], tot_loss[loss=0.3395, simple_loss=0.3918, pruned_loss=0.1436, over 4272586.60 frames. ], batch size: 414, lr: 2.45e-02, grad_scale: 32.0 2023-06-18 06:42:28,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=145500.0, ans=0.125 2023-06-18 06:42:50,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=145560.0, ans=0.015 2023-06-18 06:44:01,971 INFO [train.py:996] (2/4) Epoch 1, batch 24300, loss[loss=0.1789, simple_loss=0.2478, pruned_loss=0.05501, over 21128.00 frames. ], tot_loss[loss=0.3253, simple_loss=0.3813, pruned_loss=0.1346, over 4273156.49 frames. ], batch size: 143, lr: 2.44e-02, grad_scale: 16.0 2023-06-18 06:45:01,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=145920.0, ans=0.2 2023-06-18 06:45:13,978 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.643e+02 3.046e+02 3.863e+02 5.440e+02 1.504e+03, threshold=7.726e+02, percent-clipped=4.0 2023-06-18 06:45:22,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=146040.0, ans=0.125 2023-06-18 06:45:43,172 INFO [train.py:996] (2/4) Epoch 1, batch 24350, loss[loss=0.3501, simple_loss=0.388, pruned_loss=0.1561, over 21349.00 frames. ], tot_loss[loss=0.3247, simple_loss=0.378, pruned_loss=0.1357, over 4281139.82 frames. ], batch size: 176, lr: 2.44e-02, grad_scale: 16.0 2023-06-18 06:45:49,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=146100.0, ans=0.125 2023-06-18 06:46:24,984 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2023-06-18 06:47:32,121 INFO [train.py:996] (2/4) Epoch 1, batch 24400, loss[loss=0.386, simple_loss=0.4797, pruned_loss=0.1461, over 19757.00 frames. ], tot_loss[loss=0.3349, simple_loss=0.3854, pruned_loss=0.1422, over 4283895.48 frames. ], batch size: 702, lr: 2.44e-02, grad_scale: 32.0 2023-06-18 06:48:14,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=146520.0, ans=0.125 2023-06-18 06:48:44,676 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.695e+02 4.218e+02 5.437e+02 7.202e+02 1.402e+03, threshold=1.087e+03, percent-clipped=21.0 2023-06-18 06:48:59,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=146640.0, ans=0.125 2023-06-18 06:49:13,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=146700.0, ans=0.0 2023-06-18 06:49:14,786 INFO [train.py:996] (2/4) Epoch 1, batch 24450, loss[loss=0.3114, simple_loss=0.3724, pruned_loss=0.1252, over 21635.00 frames. ], tot_loss[loss=0.3412, simple_loss=0.3908, pruned_loss=0.1458, over 4281462.24 frames. ], batch size: 247, lr: 2.44e-02, grad_scale: 32.0 2023-06-18 06:49:34,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=146700.0, ans=0.2 2023-06-18 06:49:55,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=146760.0, ans=0.125 2023-06-18 06:50:21,626 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-18 06:50:55,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=147000.0, ans=0.5 2023-06-18 06:50:56,344 INFO [train.py:996] (2/4) Epoch 1, batch 24500, loss[loss=0.3477, simple_loss=0.3892, pruned_loss=0.1532, over 21921.00 frames. ], tot_loss[loss=0.3399, simple_loss=0.3895, pruned_loss=0.1451, over 4284789.65 frames. ], batch size: 351, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 06:50:58,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=147000.0, ans=0.0 2023-06-18 06:50:59,045 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=15.0 2023-06-18 06:51:10,484 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 06:51:40,808 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.84 vs. limit=15.0 2023-06-18 06:51:55,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=147120.0, ans=0.125 2023-06-18 06:51:57,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=147120.0, ans=0.125 2023-06-18 06:52:14,946 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.363e+02 3.850e+02 5.028e+02 6.051e+02 9.604e+02, threshold=1.006e+03, percent-clipped=0.0 2023-06-18 06:52:16,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=147180.0, ans=0.2 2023-06-18 06:52:24,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=147240.0, ans=0.0 2023-06-18 06:52:26,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=147240.0, ans=0.125 2023-06-18 06:52:42,490 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.88 vs. limit=10.0 2023-06-18 06:52:44,343 INFO [train.py:996] (2/4) Epoch 1, batch 24550, loss[loss=0.3996, simple_loss=0.4325, pruned_loss=0.1833, over 21485.00 frames. ], tot_loss[loss=0.3472, simple_loss=0.3941, pruned_loss=0.1501, over 4286336.24 frames. ], batch size: 211, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 06:53:22,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=147420.0, ans=0.0 2023-06-18 06:53:54,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=147480.0, ans=0.2 2023-06-18 06:54:00,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=147480.0, ans=0.125 2023-06-18 06:54:26,346 INFO [train.py:996] (2/4) Epoch 1, batch 24600, loss[loss=0.3617, simple_loss=0.4, pruned_loss=0.1617, over 21198.00 frames. ], tot_loss[loss=0.3448, simple_loss=0.3886, pruned_loss=0.1505, over 4275599.73 frames. ], batch size: 143, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 06:54:45,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=147600.0, ans=0.125 2023-06-18 06:54:57,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=147660.0, ans=0.0 2023-06-18 06:55:19,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=147720.0, ans=0.1 2023-06-18 06:55:31,950 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.92 vs. limit=12.0 2023-06-18 06:55:38,415 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.603e+02 4.230e+02 5.450e+02 1.074e+03, threshold=8.460e+02, percent-clipped=1.0 2023-06-18 06:56:07,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=147900.0, ans=0.125 2023-06-18 06:56:08,489 INFO [train.py:996] (2/4) Epoch 1, batch 24650, loss[loss=0.269, simple_loss=0.3079, pruned_loss=0.1151, over 21488.00 frames. ], tot_loss[loss=0.336, simple_loss=0.3779, pruned_loss=0.147, over 4262141.93 frames. ], batch size: 213, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 06:56:28,058 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.20 vs. limit=22.5 2023-06-18 06:56:50,658 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=12.0 2023-06-18 06:57:18,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=148080.0, ans=0.1 2023-06-18 06:57:24,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=148080.0, ans=0.1 2023-06-18 06:57:50,587 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2023-06-18 06:57:50,938 INFO [train.py:996] (2/4) Epoch 1, batch 24700, loss[loss=0.2875, simple_loss=0.3376, pruned_loss=0.1187, over 21828.00 frames. ], tot_loss[loss=0.3328, simple_loss=0.3765, pruned_loss=0.1445, over 4263666.51 frames. ], batch size: 118, lr: 2.43e-02, grad_scale: 32.0 2023-06-18 06:59:03,016 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.227e+02 3.816e+02 4.904e+02 7.765e+02, threshold=7.633e+02, percent-clipped=0.0 2023-06-18 06:59:03,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=148380.0, ans=0.0 2023-06-18 06:59:07,323 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-18 06:59:32,579 INFO [train.py:996] (2/4) Epoch 1, batch 24750, loss[loss=0.2646, simple_loss=0.3084, pruned_loss=0.1104, over 21218.00 frames. ], tot_loss[loss=0.3232, simple_loss=0.3679, pruned_loss=0.1392, over 4257895.58 frames. ], batch size: 549, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 06:59:33,945 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=12.0 2023-06-18 07:00:56,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=148740.0, ans=0.1 2023-06-18 07:00:59,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=148740.0, ans=0.125 2023-06-18 07:01:14,224 INFO [train.py:996] (2/4) Epoch 1, batch 24800, loss[loss=0.333, simple_loss=0.3687, pruned_loss=0.1486, over 21450.00 frames. ], tot_loss[loss=0.3212, simple_loss=0.3644, pruned_loss=0.139, over 4262011.45 frames. ], batch size: 211, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 07:01:27,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=148800.0, ans=0.125 2023-06-18 07:01:32,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=148800.0, ans=0.125 2023-06-18 07:02:08,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=148920.0, ans=0.2 2023-06-18 07:02:22,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=148980.0, ans=0.125 2023-06-18 07:02:27,064 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 3.589e+02 4.591e+02 5.888e+02 8.855e+02, threshold=9.183e+02, percent-clipped=11.0 2023-06-18 07:02:47,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=149040.0, ans=0.2 2023-06-18 07:02:56,558 INFO [train.py:996] (2/4) Epoch 1, batch 24850, loss[loss=0.3845, simple_loss=0.4208, pruned_loss=0.1741, over 21738.00 frames. ], tot_loss[loss=0.3258, simple_loss=0.3664, pruned_loss=0.1426, over 4267698.98 frames. ], batch size: 414, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 07:03:12,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=149100.0, ans=0.2 2023-06-18 07:03:21,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=149160.0, ans=0.1 2023-06-18 07:04:06,021 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.92 vs. limit=15.0 2023-06-18 07:04:39,623 INFO [train.py:996] (2/4) Epoch 1, batch 24900, loss[loss=0.3776, simple_loss=0.418, pruned_loss=0.1686, over 21410.00 frames. ], tot_loss[loss=0.3297, simple_loss=0.3706, pruned_loss=0.1444, over 4266592.06 frames. ], batch size: 143, lr: 2.42e-02, grad_scale: 32.0 2023-06-18 07:05:11,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=149460.0, ans=0.1 2023-06-18 07:05:29,027 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-18 07:05:31,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=149520.0, ans=0.125 2023-06-18 07:05:53,722 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.681e+02 3.811e+02 4.758e+02 6.118e+02 1.056e+03, threshold=9.515e+02, percent-clipped=2.0 2023-06-18 07:06:07,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=149640.0, ans=0.2 2023-06-18 07:06:23,758 INFO [train.py:996] (2/4) Epoch 1, batch 24950, loss[loss=0.3981, simple_loss=0.4263, pruned_loss=0.1849, over 21300.00 frames. ], tot_loss[loss=0.3388, simple_loss=0.3793, pruned_loss=0.1492, over 4264211.97 frames. ], batch size: 548, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 07:07:10,343 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.46 vs. limit=15.0 2023-06-18 07:07:16,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=149820.0, ans=0.0 2023-06-18 07:07:52,421 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 07:08:00,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=149940.0, ans=0.1 2023-06-18 07:08:08,500 INFO [train.py:996] (2/4) Epoch 1, batch 25000, loss[loss=0.3233, simple_loss=0.3647, pruned_loss=0.141, over 21836.00 frames. ], tot_loss[loss=0.348, simple_loss=0.3889, pruned_loss=0.1536, over 4255541.06 frames. ], batch size: 107, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 07:08:22,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=150000.0, ans=0.125 2023-06-18 07:08:40,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=150060.0, ans=0.2 2023-06-18 07:08:45,021 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.62 vs. limit=15.0 2023-06-18 07:09:27,269 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.86 vs. limit=15.0 2023-06-18 07:09:27,599 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.657e+02 3.412e+02 4.030e+02 5.230e+02 1.013e+03, threshold=8.059e+02, percent-clipped=2.0 2023-06-18 07:09:37,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=150240.0, ans=0.2 2023-06-18 07:09:47,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=150240.0, ans=0.2 2023-06-18 07:09:57,196 INFO [train.py:996] (2/4) Epoch 1, batch 25050, loss[loss=0.2894, simple_loss=0.3322, pruned_loss=0.1233, over 21637.00 frames. ], tot_loss[loss=0.34, simple_loss=0.3796, pruned_loss=0.1502, over 4257602.76 frames. ], batch size: 298, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 07:10:27,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=150360.0, ans=0.125 2023-06-18 07:11:12,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=150480.0, ans=0.0 2023-06-18 07:11:17,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=150480.0, ans=0.035 2023-06-18 07:11:40,512 INFO [train.py:996] (2/4) Epoch 1, batch 25100, loss[loss=0.2789, simple_loss=0.3271, pruned_loss=0.1153, over 21600.00 frames. ], tot_loss[loss=0.3335, simple_loss=0.3722, pruned_loss=0.1473, over 4265640.46 frames. ], batch size: 298, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 07:11:44,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=150600.0, ans=0.0 2023-06-18 07:11:50,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=150600.0, ans=0.0 2023-06-18 07:12:09,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=150660.0, ans=0.0 2023-06-18 07:12:34,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=150720.0, ans=0.125 2023-06-18 07:12:43,635 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-06-18 07:12:52,243 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 3.624e+02 4.936e+02 6.636e+02 1.221e+03, threshold=9.872e+02, percent-clipped=16.0 2023-06-18 07:12:52,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=150780.0, ans=0.1 2023-06-18 07:13:05,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=150840.0, ans=0.2 2023-06-18 07:13:15,448 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.79 vs. limit=15.0 2023-06-18 07:13:16,024 INFO [train.py:996] (2/4) Epoch 1, batch 25150, loss[loss=0.2994, simple_loss=0.3569, pruned_loss=0.121, over 21507.00 frames. ], tot_loss[loss=0.3294, simple_loss=0.3729, pruned_loss=0.143, over 4263467.44 frames. ], batch size: 211, lr: 2.41e-02, grad_scale: 32.0 2023-06-18 07:14:24,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=151080.0, ans=0.125 2023-06-18 07:14:33,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=151080.0, ans=0.1 2023-06-18 07:14:35,771 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.21 vs. limit=12.0 2023-06-18 07:14:47,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=151140.0, ans=0.2 2023-06-18 07:14:56,933 INFO [train.py:996] (2/4) Epoch 1, batch 25200, loss[loss=0.3426, simple_loss=0.4127, pruned_loss=0.1362, over 21669.00 frames. ], tot_loss[loss=0.3255, simple_loss=0.372, pruned_loss=0.1395, over 4259725.72 frames. ], batch size: 441, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 07:16:14,343 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.108e+02 3.238e+02 4.132e+02 5.215e+02 8.390e+02, threshold=8.263e+02, percent-clipped=0.0 2023-06-18 07:16:38,457 INFO [train.py:996] (2/4) Epoch 1, batch 25250, loss[loss=0.3282, simple_loss=0.363, pruned_loss=0.1467, over 21583.00 frames. ], tot_loss[loss=0.3231, simple_loss=0.3701, pruned_loss=0.138, over 4249759.58 frames. ], batch size: 415, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 07:16:50,341 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 07:17:00,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=151560.0, ans=0.0 2023-06-18 07:17:23,255 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-18 07:17:28,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=151620.0, ans=0.0 2023-06-18 07:17:45,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=151680.0, ans=0.125 2023-06-18 07:18:21,294 INFO [train.py:996] (2/4) Epoch 1, batch 25300, loss[loss=0.316, simple_loss=0.3358, pruned_loss=0.1481, over 20162.00 frames. ], tot_loss[loss=0.3199, simple_loss=0.3655, pruned_loss=0.1371, over 4241642.39 frames. ], batch size: 703, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 07:19:01,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=151860.0, ans=0.1 2023-06-18 07:19:28,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=151980.0, ans=0.025 2023-06-18 07:19:39,723 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.535e+02 3.569e+02 4.461e+02 5.778e+02 9.355e+02, threshold=8.922e+02, percent-clipped=5.0 2023-06-18 07:19:51,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=152040.0, ans=0.125 2023-06-18 07:20:03,974 INFO [train.py:996] (2/4) Epoch 1, batch 25350, loss[loss=0.3108, simple_loss=0.3508, pruned_loss=0.1354, over 20106.00 frames. ], tot_loss[loss=0.3204, simple_loss=0.3675, pruned_loss=0.1367, over 4228848.58 frames. ], batch size: 703, lr: 2.40e-02, grad_scale: 32.0 2023-06-18 07:20:16,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=152100.0, ans=0.125 2023-06-18 07:20:29,066 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-18 07:21:07,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=152280.0, ans=0.2 2023-06-18 07:21:20,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=152280.0, ans=0.95 2023-06-18 07:21:31,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=152340.0, ans=0.125 2023-06-18 07:21:39,303 INFO [train.py:996] (2/4) Epoch 1, batch 25400, loss[loss=0.3118, simple_loss=0.3947, pruned_loss=0.1145, over 20768.00 frames. ], tot_loss[loss=0.3171, simple_loss=0.3639, pruned_loss=0.1351, over 4230159.89 frames. ], batch size: 607, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 07:22:56,316 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.405e+02 3.535e+02 4.232e+02 5.710e+02 1.225e+03, threshold=8.465e+02, percent-clipped=5.0 2023-06-18 07:23:15,236 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 07:23:20,774 INFO [train.py:996] (2/4) Epoch 1, batch 25450, loss[loss=0.4009, simple_loss=0.4159, pruned_loss=0.193, over 21747.00 frames. ], tot_loss[loss=0.3205, simple_loss=0.3657, pruned_loss=0.1376, over 4229919.98 frames. ], batch size: 508, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 07:23:29,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=152700.0, ans=0.125 2023-06-18 07:24:06,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=152820.0, ans=0.125 2023-06-18 07:24:08,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=152820.0, ans=0.1 2023-06-18 07:24:08,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=152820.0, ans=0.2 2023-06-18 07:24:31,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=152880.0, ans=0.04949747468305833 2023-06-18 07:24:41,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=152880.0, ans=0.125 2023-06-18 07:24:51,995 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-18 07:25:04,166 INFO [train.py:996] (2/4) Epoch 1, batch 25500, loss[loss=0.3096, simple_loss=0.3671, pruned_loss=0.126, over 21419.00 frames. ], tot_loss[loss=0.318, simple_loss=0.3669, pruned_loss=0.1346, over 4240033.37 frames. ], batch size: 211, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 07:26:22,537 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.010e+02 3.487e+02 4.551e+02 5.429e+02 1.003e+03, threshold=9.102e+02, percent-clipped=2.0 2023-06-18 07:26:26,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=153180.0, ans=0.1 2023-06-18 07:26:41,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=153240.0, ans=0.0 2023-06-18 07:26:42,030 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=22.5 2023-06-18 07:26:51,883 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=15.0 2023-06-18 07:26:52,246 INFO [train.py:996] (2/4) Epoch 1, batch 25550, loss[loss=0.3107, simple_loss=0.3848, pruned_loss=0.1183, over 21806.00 frames. ], tot_loss[loss=0.3224, simple_loss=0.3742, pruned_loss=0.1354, over 4244270.66 frames. ], batch size: 282, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 07:28:08,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=153540.0, ans=0.0 2023-06-18 07:28:23,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=153540.0, ans=0.125 2023-06-18 07:28:34,446 INFO [train.py:996] (2/4) Epoch 1, batch 25600, loss[loss=0.3987, simple_loss=0.4334, pruned_loss=0.1821, over 21717.00 frames. ], tot_loss[loss=0.3273, simple_loss=0.3795, pruned_loss=0.1375, over 4256856.25 frames. ], batch size: 351, lr: 2.39e-02, grad_scale: 32.0 2023-06-18 07:28:35,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=153600.0, ans=0.1 2023-06-18 07:29:36,695 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.393e+02 3.502e+02 4.172e+02 4.983e+02 8.051e+02, threshold=8.344e+02, percent-clipped=0.0 2023-06-18 07:29:49,021 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-18 07:29:50,733 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.26 vs. limit=10.0 2023-06-18 07:30:10,888 INFO [train.py:996] (2/4) Epoch 1, batch 25650, loss[loss=0.3064, simple_loss=0.3655, pruned_loss=0.1236, over 20863.00 frames. ], tot_loss[loss=0.3309, simple_loss=0.3802, pruned_loss=0.1409, over 4253917.86 frames. ], batch size: 608, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 07:30:23,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=153900.0, ans=0.125 2023-06-18 07:31:46,154 INFO [train.py:996] (2/4) Epoch 1, batch 25700, loss[loss=0.43, simple_loss=0.4857, pruned_loss=0.1872, over 19842.00 frames. ], tot_loss[loss=0.332, simple_loss=0.3784, pruned_loss=0.1428, over 4250753.06 frames. ], batch size: 702, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 07:32:31,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=154320.0, ans=10.0 2023-06-18 07:33:00,071 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.737e+02 4.412e+02 5.511e+02 6.649e+02 1.111e+03, threshold=1.102e+03, percent-clipped=12.0 2023-06-18 07:33:30,731 INFO [train.py:996] (2/4) Epoch 1, batch 25750, loss[loss=0.4824, simple_loss=0.5412, pruned_loss=0.2118, over 19915.00 frames. ], tot_loss[loss=0.341, simple_loss=0.3858, pruned_loss=0.1481, over 4260917.51 frames. ], batch size: 702, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 07:33:47,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=154500.0, ans=0.125 2023-06-18 07:34:17,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=154620.0, ans=0.125 2023-06-18 07:34:18,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=154620.0, ans=0.125 2023-06-18 07:34:20,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=154620.0, ans=0.2 2023-06-18 07:34:27,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=154620.0, ans=0.125 2023-06-18 07:34:41,638 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.45 vs. limit=15.0 2023-06-18 07:35:16,168 INFO [train.py:996] (2/4) Epoch 1, batch 25800, loss[loss=0.4604, simple_loss=0.476, pruned_loss=0.2224, over 21361.00 frames. ], tot_loss[loss=0.3545, simple_loss=0.3996, pruned_loss=0.1547, over 4263168.30 frames. ], batch size: 507, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 07:35:26,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=154800.0, ans=0.125 2023-06-18 07:35:41,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=154860.0, ans=10.0 2023-06-18 07:36:28,469 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.698e+02 3.746e+02 4.301e+02 5.401e+02 1.441e+03, threshold=8.601e+02, percent-clipped=2.0 2023-06-18 07:36:44,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=155040.0, ans=0.125 2023-06-18 07:36:45,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=155040.0, ans=0.1 2023-06-18 07:36:57,513 INFO [train.py:996] (2/4) Epoch 1, batch 25850, loss[loss=0.3195, simple_loss=0.3522, pruned_loss=0.1434, over 21359.00 frames. ], tot_loss[loss=0.3544, simple_loss=0.4013, pruned_loss=0.1537, over 4261820.49 frames. ], batch size: 176, lr: 2.38e-02, grad_scale: 32.0 2023-06-18 07:37:12,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=155100.0, ans=0.0 2023-06-18 07:37:16,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=155100.0, ans=0.04949747468305833 2023-06-18 07:37:17,091 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.07 vs. limit=22.5 2023-06-18 07:38:04,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=155280.0, ans=6.0 2023-06-18 07:38:46,027 INFO [train.py:996] (2/4) Epoch 1, batch 25900, loss[loss=0.3381, simple_loss=0.4046, pruned_loss=0.1358, over 21575.00 frames. ], tot_loss[loss=0.3558, simple_loss=0.4027, pruned_loss=0.1545, over 4269701.91 frames. ], batch size: 230, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 07:38:59,507 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 07:39:11,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=155460.0, ans=0.0 2023-06-18 07:39:16,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=155460.0, ans=0.2 2023-06-18 07:39:58,794 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.678e+02 3.705e+02 4.419e+02 5.739e+02 1.257e+03, threshold=8.839e+02, percent-clipped=5.0 2023-06-18 07:40:28,239 INFO [train.py:996] (2/4) Epoch 1, batch 25950, loss[loss=0.3442, simple_loss=0.3952, pruned_loss=0.1465, over 21706.00 frames. ], tot_loss[loss=0.3616, simple_loss=0.4081, pruned_loss=0.1575, over 4279065.00 frames. ], batch size: 298, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 07:40:28,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=155700.0, ans=0.125 2023-06-18 07:40:57,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=155760.0, ans=0.015 2023-06-18 07:41:20,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=155820.0, ans=0.0 2023-06-18 07:41:22,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=155820.0, ans=0.0 2023-06-18 07:41:38,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=155880.0, ans=0.0 2023-06-18 07:42:10,856 INFO [train.py:996] (2/4) Epoch 1, batch 26000, loss[loss=0.3769, simple_loss=0.426, pruned_loss=0.1639, over 21621.00 frames. ], tot_loss[loss=0.361, simple_loss=0.4095, pruned_loss=0.1563, over 4286333.62 frames. ], batch size: 263, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 07:42:39,804 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.61 vs. limit=12.0 2023-06-18 07:43:02,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=156120.0, ans=0.125 2023-06-18 07:43:27,132 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.481e+02 3.511e+02 4.125e+02 5.678e+02 8.372e+02, threshold=8.249e+02, percent-clipped=0.0 2023-06-18 07:43:39,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=156240.0, ans=0.125 2023-06-18 07:43:41,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=156240.0, ans=0.125 2023-06-18 07:43:51,441 INFO [train.py:996] (2/4) Epoch 1, batch 26050, loss[loss=0.363, simple_loss=0.3865, pruned_loss=0.1698, over 21379.00 frames. ], tot_loss[loss=0.3618, simple_loss=0.4088, pruned_loss=0.1574, over 4288570.30 frames. ], batch size: 176, lr: 2.37e-02, grad_scale: 32.0 2023-06-18 07:43:54,161 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.37 vs. limit=15.0 2023-06-18 07:44:34,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=156360.0, ans=0.2 2023-06-18 07:44:39,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=156420.0, ans=0.125 2023-06-18 07:44:46,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=156420.0, ans=0.05 2023-06-18 07:44:49,264 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.96 vs. limit=15.0 2023-06-18 07:44:50,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=156420.0, ans=0.125 2023-06-18 07:45:14,524 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.82 vs. limit=6.0 2023-06-18 07:45:31,175 INFO [train.py:996] (2/4) Epoch 1, batch 26100, loss[loss=0.356, simple_loss=0.3997, pruned_loss=0.1562, over 21899.00 frames. ], tot_loss[loss=0.3563, simple_loss=0.4011, pruned_loss=0.1558, over 4297013.62 frames. ], batch size: 107, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 07:46:14,244 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-18 07:46:18,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=156720.0, ans=0.1 2023-06-18 07:46:48,618 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 3.762e+02 4.665e+02 5.349e+02 1.153e+03, threshold=9.330e+02, percent-clipped=6.0 2023-06-18 07:46:56,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=156840.0, ans=0.1 2023-06-18 07:47:02,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=156840.0, ans=0.0 2023-06-18 07:47:02,883 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2023-06-18 07:47:12,628 INFO [train.py:996] (2/4) Epoch 1, batch 26150, loss[loss=0.3635, simple_loss=0.4, pruned_loss=0.1635, over 21336.00 frames. ], tot_loss[loss=0.3552, simple_loss=0.3981, pruned_loss=0.1561, over 4302554.21 frames. ], batch size: 548, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 07:47:51,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=156960.0, ans=0.125 2023-06-18 07:47:55,087 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=15.0 2023-06-18 07:48:53,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=157140.0, ans=0.125 2023-06-18 07:48:56,360 INFO [train.py:996] (2/4) Epoch 1, batch 26200, loss[loss=0.3116, simple_loss=0.396, pruned_loss=0.1136, over 21817.00 frames. ], tot_loss[loss=0.3506, simple_loss=0.3969, pruned_loss=0.1522, over 4295325.95 frames. ], batch size: 282, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 07:49:36,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=157260.0, ans=0.125 2023-06-18 07:50:09,835 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.511e+02 3.383e+02 4.279e+02 5.483e+02 1.348e+03, threshold=8.558e+02, percent-clipped=4.0 2023-06-18 07:50:36,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=157440.0, ans=0.125 2023-06-18 07:50:50,287 INFO [train.py:996] (2/4) Epoch 1, batch 26250, loss[loss=0.3551, simple_loss=0.3919, pruned_loss=0.1591, over 21491.00 frames. ], tot_loss[loss=0.355, simple_loss=0.4051, pruned_loss=0.1525, over 4289332.46 frames. ], batch size: 211, lr: 2.36e-02, grad_scale: 32.0 2023-06-18 07:51:07,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=157560.0, ans=0.1 2023-06-18 07:51:34,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=157620.0, ans=0.0 2023-06-18 07:51:39,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=157620.0, ans=0.2 2023-06-18 07:51:51,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=157680.0, ans=0.125 2023-06-18 07:51:52,260 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=22.5 2023-06-18 07:51:53,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=157680.0, ans=0.2 2023-06-18 07:51:56,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=157680.0, ans=0.125 2023-06-18 07:51:56,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=157680.0, ans=0.125 2023-06-18 07:52:03,936 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 07:52:06,292 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.36 vs. limit=15.0 2023-06-18 07:52:22,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=157740.0, ans=0.125 2023-06-18 07:52:24,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=157740.0, ans=0.1 2023-06-18 07:52:31,388 INFO [train.py:996] (2/4) Epoch 1, batch 26300, loss[loss=0.3954, simple_loss=0.4137, pruned_loss=0.1886, over 21624.00 frames. ], tot_loss[loss=0.3526, simple_loss=0.3999, pruned_loss=0.1527, over 4300738.93 frames. ], batch size: 471, lr: 2.36e-02, grad_scale: 64.0 2023-06-18 07:53:04,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=157860.0, ans=0.1 2023-06-18 07:53:38,870 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.479e+02 3.639e+02 4.284e+02 5.347e+02 9.355e+02, threshold=8.568e+02, percent-clipped=1.0 2023-06-18 07:54:09,744 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.38 vs. limit=12.0 2023-06-18 07:54:13,135 INFO [train.py:996] (2/4) Epoch 1, batch 26350, loss[loss=0.3802, simple_loss=0.4164, pruned_loss=0.172, over 21456.00 frames. ], tot_loss[loss=0.353, simple_loss=0.3986, pruned_loss=0.1537, over 4302341.68 frames. ], batch size: 131, lr: 2.35e-02, grad_scale: 64.0 2023-06-18 07:54:59,259 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=15.0 2023-06-18 07:55:54,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=158400.0, ans=0.125 2023-06-18 07:55:55,507 INFO [train.py:996] (2/4) Epoch 1, batch 26400, loss[loss=0.333, simple_loss=0.3585, pruned_loss=0.1538, over 21677.00 frames. ], tot_loss[loss=0.3498, simple_loss=0.3924, pruned_loss=0.1536, over 4292500.64 frames. ], batch size: 333, lr: 2.35e-02, grad_scale: 64.0 2023-06-18 07:56:12,279 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 07:56:54,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=158580.0, ans=0.0 2023-06-18 07:57:05,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=158580.0, ans=0.125 2023-06-18 07:57:16,077 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.384e+02 3.627e+02 4.358e+02 5.298e+02 1.261e+03, threshold=8.716e+02, percent-clipped=4.0 2023-06-18 07:57:21,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=158640.0, ans=0.0 2023-06-18 07:57:44,283 INFO [train.py:996] (2/4) Epoch 1, batch 26450, loss[loss=0.4097, simple_loss=0.4932, pruned_loss=0.1631, over 21185.00 frames. ], tot_loss[loss=0.3491, simple_loss=0.3919, pruned_loss=0.1532, over 4290834.01 frames. ], batch size: 549, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 07:58:05,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=158760.0, ans=0.125 2023-06-18 07:58:10,364 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=22.5 2023-06-18 07:58:11,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=158760.0, ans=0.0 2023-06-18 07:58:49,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=158880.0, ans=0.125 2023-06-18 07:59:28,096 INFO [train.py:996] (2/4) Epoch 1, batch 26500, loss[loss=0.2465, simple_loss=0.2966, pruned_loss=0.09817, over 21265.00 frames. ], tot_loss[loss=0.3457, simple_loss=0.3915, pruned_loss=0.15, over 4280291.17 frames. ], batch size: 176, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 07:59:28,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=159000.0, ans=0.2 2023-06-18 07:59:55,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=159060.0, ans=0.1 2023-06-18 08:00:21,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=159120.0, ans=0.125 2023-06-18 08:00:31,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=159120.0, ans=0.125 2023-06-18 08:00:49,634 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.433e+02 3.803e+02 4.749e+02 6.034e+02 1.314e+03, threshold=9.498e+02, percent-clipped=6.0 2023-06-18 08:00:50,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=159180.0, ans=0.1 2023-06-18 08:01:06,192 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.99 vs. limit=15.0 2023-06-18 08:01:13,502 INFO [train.py:996] (2/4) Epoch 1, batch 26550, loss[loss=0.2304, simple_loss=0.286, pruned_loss=0.08746, over 21239.00 frames. ], tot_loss[loss=0.3369, simple_loss=0.3852, pruned_loss=0.1443, over 4266174.96 frames. ], batch size: 176, lr: 2.35e-02, grad_scale: 32.0 2023-06-18 08:01:14,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=159300.0, ans=0.09899494936611666 2023-06-18 08:01:52,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=159360.0, ans=0.1 2023-06-18 08:02:13,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=159420.0, ans=0.0 2023-06-18 08:02:40,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=159540.0, ans=0.125 2023-06-18 08:03:00,179 INFO [train.py:996] (2/4) Epoch 1, batch 26600, loss[loss=0.3342, simple_loss=0.383, pruned_loss=0.1427, over 21693.00 frames. ], tot_loss[loss=0.333, simple_loss=0.3852, pruned_loss=0.1404, over 4268699.34 frames. ], batch size: 282, lr: 2.34e-02, grad_scale: 32.0 2023-06-18 08:03:18,995 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.39 vs. limit=15.0 2023-06-18 08:03:23,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=159660.0, ans=0.125 2023-06-18 08:03:23,533 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-18 08:04:08,042 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 3.521e+02 4.224e+02 5.242e+02 1.118e+03, threshold=8.449e+02, percent-clipped=1.0 2023-06-18 08:04:15,755 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.83 vs. limit=22.5 2023-06-18 08:04:19,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=159840.0, ans=0.125 2023-06-18 08:04:36,041 INFO [train.py:996] (2/4) Epoch 1, batch 26650, loss[loss=0.306, simple_loss=0.3416, pruned_loss=0.1352, over 21569.00 frames. ], tot_loss[loss=0.3277, simple_loss=0.3779, pruned_loss=0.1388, over 4272647.06 frames. ], batch size: 247, lr: 2.34e-02, grad_scale: 32.0 2023-06-18 08:04:44,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=159900.0, ans=0.025 2023-06-18 08:04:50,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=159900.0, ans=0.0 2023-06-18 08:04:58,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=159960.0, ans=0.125 2023-06-18 08:05:24,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=160020.0, ans=0.0 2023-06-18 08:05:41,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=160080.0, ans=0.0 2023-06-18 08:06:03,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=160140.0, ans=0.2 2023-06-18 08:06:11,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=160140.0, ans=0.0 2023-06-18 08:06:16,332 INFO [train.py:996] (2/4) Epoch 1, batch 26700, loss[loss=0.2062, simple_loss=0.265, pruned_loss=0.07368, over 21145.00 frames. ], tot_loss[loss=0.3158, simple_loss=0.3674, pruned_loss=0.1321, over 4278350.45 frames. ], batch size: 176, lr: 2.34e-02, grad_scale: 32.0 2023-06-18 08:06:25,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=160200.0, ans=0.125 2023-06-18 08:06:57,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=160320.0, ans=0.125 2023-06-18 08:07:05,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=160320.0, ans=0.1 2023-06-18 08:07:18,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=160380.0, ans=0.125 2023-06-18 08:07:23,787 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.832e+02 2.935e+02 3.510e+02 4.681e+02 9.206e+02, threshold=7.020e+02, percent-clipped=3.0 2023-06-18 08:08:03,391 INFO [train.py:996] (2/4) Epoch 1, batch 26750, loss[loss=0.3569, simple_loss=0.4052, pruned_loss=0.1543, over 21392.00 frames. ], tot_loss[loss=0.3133, simple_loss=0.3663, pruned_loss=0.1301, over 4286674.72 frames. ], batch size: 131, lr: 2.34e-02, grad_scale: 32.0 2023-06-18 08:08:20,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=160500.0, ans=0.1 2023-06-18 08:08:25,863 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-18 08:08:33,267 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:08:40,400 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=16.42 vs. limit=15.0 2023-06-18 08:08:41,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=160620.0, ans=0.1 2023-06-18 08:09:34,541 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=19.89 vs. limit=15.0 2023-06-18 08:09:52,198 INFO [train.py:996] (2/4) Epoch 1, batch 26800, loss[loss=0.3397, simple_loss=0.3823, pruned_loss=0.1485, over 21322.00 frames. ], tot_loss[loss=0.3245, simple_loss=0.3757, pruned_loss=0.1366, over 4286325.84 frames. ], batch size: 176, lr: 2.34e-02, grad_scale: 32.0 2023-06-18 08:09:56,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=160800.0, ans=0.09899494936611666 2023-06-18 08:10:05,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=160800.0, ans=0.125 2023-06-18 08:10:50,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=160980.0, ans=0.125 2023-06-18 08:11:01,909 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.593e+02 3.549e+02 4.364e+02 5.200e+02 1.402e+03, threshold=8.728e+02, percent-clipped=9.0 2023-06-18 08:11:26,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=161100.0, ans=0.1 2023-06-18 08:11:27,771 INFO [train.py:996] (2/4) Epoch 1, batch 26850, loss[loss=0.3045, simple_loss=0.3453, pruned_loss=0.1318, over 21810.00 frames. ], tot_loss[loss=0.3311, simple_loss=0.3795, pruned_loss=0.1413, over 4283467.17 frames. ], batch size: 118, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 08:11:37,650 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.24 vs. limit=15.0 2023-06-18 08:11:57,017 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.63 vs. limit=15.0 2023-06-18 08:13:02,130 INFO [train.py:996] (2/4) Epoch 1, batch 26900, loss[loss=0.265, simple_loss=0.3183, pruned_loss=0.1058, over 21692.00 frames. ], tot_loss[loss=0.3246, simple_loss=0.3695, pruned_loss=0.1398, over 4273915.00 frames. ], batch size: 124, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 08:13:03,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=161400.0, ans=0.2 2023-06-18 08:14:06,393 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.335e+02 3.385e+02 4.142e+02 4.911e+02 9.199e+02, threshold=8.284e+02, percent-clipped=1.0 2023-06-18 08:14:14,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=161580.0, ans=0.125 2023-06-18 08:14:16,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=161640.0, ans=0.025 2023-06-18 08:14:37,080 INFO [train.py:996] (2/4) Epoch 1, batch 26950, loss[loss=0.4181, simple_loss=0.441, pruned_loss=0.1976, over 20009.00 frames. ], tot_loss[loss=0.3238, simple_loss=0.3683, pruned_loss=0.1397, over 4273203.18 frames. ], batch size: 702, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 08:14:40,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=161700.0, ans=0.0 2023-06-18 08:14:47,315 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.77 vs. limit=6.0 2023-06-18 08:15:06,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=161820.0, ans=0.125 2023-06-18 08:15:29,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=161820.0, ans=0.0 2023-06-18 08:16:03,194 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.88 vs. limit=15.0 2023-06-18 08:16:06,534 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=12.0 2023-06-18 08:16:13,290 INFO [train.py:996] (2/4) Epoch 1, batch 27000, loss[loss=0.2796, simple_loss=0.3495, pruned_loss=0.1048, over 21674.00 frames. ], tot_loss[loss=0.3216, simple_loss=0.3691, pruned_loss=0.137, over 4276413.09 frames. ], batch size: 247, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 08:16:13,291 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 08:16:29,108 INFO [train.py:1028] (2/4) Epoch 1, validation: loss=0.2828, simple_loss=0.3784, pruned_loss=0.09358, over 1796401.00 frames. 2023-06-18 08:16:29,109 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-18 08:16:31,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=162000.0, ans=0.125 2023-06-18 08:16:40,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=162000.0, ans=0.95 2023-06-18 08:16:54,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=162060.0, ans=0.0 2023-06-18 08:17:18,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=162120.0, ans=0.125 2023-06-18 08:17:26,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=162180.0, ans=0.2 2023-06-18 08:17:39,547 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 3.208e+02 3.737e+02 4.814e+02 7.556e+02, threshold=7.473e+02, percent-clipped=0.0 2023-06-18 08:17:44,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=162240.0, ans=0.0 2023-06-18 08:18:01,112 INFO [train.py:996] (2/4) Epoch 1, batch 27050, loss[loss=0.3673, simple_loss=0.4142, pruned_loss=0.1602, over 21564.00 frames. ], tot_loss[loss=0.3204, simple_loss=0.3728, pruned_loss=0.134, over 4279263.93 frames. ], batch size: 471, lr: 2.33e-02, grad_scale: 32.0 2023-06-18 08:18:55,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=162420.0, ans=0.125 2023-06-18 08:19:37,666 INFO [train.py:996] (2/4) Epoch 1, batch 27100, loss[loss=0.2883, simple_loss=0.3733, pruned_loss=0.1017, over 21681.00 frames. ], tot_loss[loss=0.3251, simple_loss=0.3761, pruned_loss=0.137, over 4277896.57 frames. ], batch size: 230, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 08:20:19,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=162660.0, ans=0.125 2023-06-18 08:20:46,389 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:20:52,390 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 3.734e+02 4.835e+02 6.632e+02 1.398e+03, threshold=9.671e+02, percent-clipped=18.0 2023-06-18 08:21:14,201 INFO [train.py:996] (2/4) Epoch 1, batch 27150, loss[loss=0.3824, simple_loss=0.4438, pruned_loss=0.1605, over 21767.00 frames. ], tot_loss[loss=0.3332, simple_loss=0.3867, pruned_loss=0.1399, over 4277015.46 frames. ], batch size: 351, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 08:22:55,456 INFO [train.py:996] (2/4) Epoch 1, batch 27200, loss[loss=0.4274, simple_loss=0.4506, pruned_loss=0.2021, over 21797.00 frames. ], tot_loss[loss=0.3433, simple_loss=0.3981, pruned_loss=0.1442, over 4273847.49 frames. ], batch size: 124, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 08:23:06,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=163200.0, ans=0.0 2023-06-18 08:23:39,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=163320.0, ans=0.125 2023-06-18 08:23:39,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=163320.0, ans=0.07 2023-06-18 08:23:40,467 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=15.0 2023-06-18 08:24:05,762 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.291e+02 3.595e+02 4.705e+02 6.129e+02 1.080e+03, threshold=9.409e+02, percent-clipped=7.0 2023-06-18 08:24:41,938 INFO [train.py:996] (2/4) Epoch 1, batch 27250, loss[loss=0.3528, simple_loss=0.3922, pruned_loss=0.1567, over 21871.00 frames. ], tot_loss[loss=0.3527, simple_loss=0.4027, pruned_loss=0.1514, over 4266972.18 frames. ], batch size: 371, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 08:24:48,931 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:24:51,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=163500.0, ans=0.1 2023-06-18 08:25:19,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=163620.0, ans=0.1 2023-06-18 08:26:13,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=163740.0, ans=0.035 2023-06-18 08:26:20,979 INFO [train.py:996] (2/4) Epoch 1, batch 27300, loss[loss=0.3685, simple_loss=0.4323, pruned_loss=0.1524, over 21313.00 frames. ], tot_loss[loss=0.3559, simple_loss=0.4052, pruned_loss=0.1533, over 4261606.02 frames. ], batch size: 549, lr: 2.32e-02, grad_scale: 32.0 2023-06-18 08:26:42,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=163860.0, ans=0.125 2023-06-18 08:26:46,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=163860.0, ans=0.2 2023-06-18 08:26:53,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=163860.0, ans=0.125 2023-06-18 08:27:30,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=163980.0, ans=0.2 2023-06-18 08:27:36,280 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.630e+02 3.615e+02 4.138e+02 5.244e+02 1.044e+03, threshold=8.277e+02, percent-clipped=1.0 2023-06-18 08:28:02,483 INFO [train.py:996] (2/4) Epoch 1, batch 27350, loss[loss=0.3119, simple_loss=0.3683, pruned_loss=0.1277, over 21260.00 frames. ], tot_loss[loss=0.3571, simple_loss=0.4068, pruned_loss=0.1537, over 4268441.34 frames. ], batch size: 143, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 08:28:05,195 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-18 08:28:48,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=164220.0, ans=0.0 2023-06-18 08:29:11,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=164280.0, ans=0.125 2023-06-18 08:29:12,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=164280.0, ans=0.125 2023-06-18 08:29:26,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=164340.0, ans=0.2 2023-06-18 08:29:37,991 INFO [train.py:996] (2/4) Epoch 1, batch 27400, loss[loss=0.3352, simple_loss=0.3654, pruned_loss=0.1525, over 21338.00 frames. ], tot_loss[loss=0.3535, simple_loss=0.4017, pruned_loss=0.1526, over 4266690.36 frames. ], batch size: 143, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 08:29:51,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=164400.0, ans=0.1 2023-06-18 08:30:47,962 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.406e+02 3.591e+02 4.552e+02 5.428e+02 9.216e+02, threshold=9.104e+02, percent-clipped=2.0 2023-06-18 08:30:57,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=164640.0, ans=0.0 2023-06-18 08:30:57,909 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.73 vs. limit=10.0 2023-06-18 08:31:13,687 INFO [train.py:996] (2/4) Epoch 1, batch 27450, loss[loss=0.3302, simple_loss=0.3903, pruned_loss=0.1351, over 21621.00 frames. ], tot_loss[loss=0.3468, simple_loss=0.3945, pruned_loss=0.1495, over 4272755.28 frames. ], batch size: 247, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 08:32:42,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=164940.0, ans=15.0 2023-06-18 08:32:49,996 INFO [train.py:996] (2/4) Epoch 1, batch 27500, loss[loss=0.3215, simple_loss=0.3607, pruned_loss=0.1412, over 21584.00 frames. ], tot_loss[loss=0.3476, simple_loss=0.3934, pruned_loss=0.1509, over 4282215.83 frames. ], batch size: 212, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 08:33:02,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=165000.0, ans=0.125 2023-06-18 08:33:30,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=165120.0, ans=0.125 2023-06-18 08:33:50,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=165180.0, ans=0.0 2023-06-18 08:33:56,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=165180.0, ans=0.05 2023-06-18 08:34:03,945 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.083e+02 3.309e+02 3.875e+02 5.024e+02 1.518e+03, threshold=7.749e+02, percent-clipped=3.0 2023-06-18 08:34:24,889 INFO [train.py:996] (2/4) Epoch 1, batch 27550, loss[loss=0.2706, simple_loss=0.3273, pruned_loss=0.107, over 21603.00 frames. ], tot_loss[loss=0.3373, simple_loss=0.3844, pruned_loss=0.1451, over 4279545.89 frames. ], batch size: 263, lr: 2.31e-02, grad_scale: 32.0 2023-06-18 08:34:32,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=165300.0, ans=0.0 2023-06-18 08:34:39,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=165360.0, ans=0.125 2023-06-18 08:34:40,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=165360.0, ans=0.5 2023-06-18 08:35:21,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=165420.0, ans=0.2 2023-06-18 08:35:25,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=165480.0, ans=0.2 2023-06-18 08:35:51,191 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.27 vs. limit=22.5 2023-06-18 08:35:57,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=165600.0, ans=0.125 2023-06-18 08:35:59,086 INFO [train.py:996] (2/4) Epoch 1, batch 27600, loss[loss=0.3406, simple_loss=0.3692, pruned_loss=0.156, over 21265.00 frames. ], tot_loss[loss=0.3319, simple_loss=0.3769, pruned_loss=0.1434, over 4268766.78 frames. ], batch size: 471, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 08:36:29,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=165660.0, ans=0.0 2023-06-18 08:36:38,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=165720.0, ans=0.125 2023-06-18 08:37:06,000 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.02 vs. limit=6.0 2023-06-18 08:37:06,690 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 3.205e+02 4.118e+02 5.523e+02 1.130e+03, threshold=8.236e+02, percent-clipped=6.0 2023-06-18 08:37:16,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=165840.0, ans=0.0 2023-06-18 08:37:30,548 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.86 vs. limit=22.5 2023-06-18 08:37:31,482 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:37:32,572 INFO [train.py:996] (2/4) Epoch 1, batch 27650, loss[loss=0.3526, simple_loss=0.384, pruned_loss=0.1606, over 21622.00 frames. ], tot_loss[loss=0.3263, simple_loss=0.3696, pruned_loss=0.1415, over 4276758.86 frames. ], batch size: 441, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 08:37:59,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=165960.0, ans=0.125 2023-06-18 08:38:19,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=166020.0, ans=0.125 2023-06-18 08:38:31,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=166080.0, ans=0.015 2023-06-18 08:38:41,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=166080.0, ans=0.1 2023-06-18 08:39:06,515 INFO [train.py:996] (2/4) Epoch 1, batch 27700, loss[loss=0.4228, simple_loss=0.4583, pruned_loss=0.1937, over 21556.00 frames. ], tot_loss[loss=0.3232, simple_loss=0.3696, pruned_loss=0.1384, over 4284564.15 frames. ], batch size: 508, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 08:39:47,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=166320.0, ans=0.05 2023-06-18 08:40:12,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=166380.0, ans=0.2 2023-06-18 08:40:20,599 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.538e+02 3.610e+02 4.452e+02 5.999e+02 1.124e+03, threshold=8.903e+02, percent-clipped=7.0 2023-06-18 08:40:37,975 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.80 vs. limit=22.5 2023-06-18 08:40:41,468 INFO [train.py:996] (2/4) Epoch 1, batch 27750, loss[loss=0.3439, simple_loss=0.3936, pruned_loss=0.1471, over 21847.00 frames. ], tot_loss[loss=0.3236, simple_loss=0.3724, pruned_loss=0.1374, over 4281038.05 frames. ], batch size: 351, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 08:40:46,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=166500.0, ans=0.125 2023-06-18 08:41:15,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=166620.0, ans=0.1 2023-06-18 08:41:37,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=166620.0, ans=0.0 2023-06-18 08:41:49,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=166680.0, ans=0.07 2023-06-18 08:42:16,149 INFO [train.py:996] (2/4) Epoch 1, batch 27800, loss[loss=0.3201, simple_loss=0.367, pruned_loss=0.1366, over 21497.00 frames. ], tot_loss[loss=0.325, simple_loss=0.3721, pruned_loss=0.1389, over 4290132.31 frames. ], batch size: 131, lr: 2.30e-02, grad_scale: 32.0 2023-06-18 08:42:35,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=166860.0, ans=0.125 2023-06-18 08:42:42,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=166860.0, ans=0.125 2023-06-18 08:43:14,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=166980.0, ans=0.1 2023-06-18 08:43:18,145 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=22.5 2023-06-18 08:43:20,175 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.998e+02 3.396e+02 4.185e+02 5.590e+02 8.815e+02, threshold=8.371e+02, percent-clipped=0.0 2023-06-18 08:43:28,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=166980.0, ans=0.1 2023-06-18 08:43:46,601 INFO [train.py:996] (2/4) Epoch 1, batch 27850, loss[loss=0.3573, simple_loss=0.3942, pruned_loss=0.1602, over 21367.00 frames. ], tot_loss[loss=0.3279, simple_loss=0.3729, pruned_loss=0.1414, over 4294254.77 frames. ], batch size: 159, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 08:43:59,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=167100.0, ans=0.125 2023-06-18 08:44:18,636 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.38 vs. limit=15.0 2023-06-18 08:44:35,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=167220.0, ans=0.0 2023-06-18 08:45:10,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=167340.0, ans=0.125 2023-06-18 08:45:15,242 INFO [train.py:996] (2/4) Epoch 1, batch 27900, loss[loss=0.3404, simple_loss=0.4049, pruned_loss=0.1379, over 21799.00 frames. ], tot_loss[loss=0.3359, simple_loss=0.3851, pruned_loss=0.1433, over 4297653.05 frames. ], batch size: 316, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 08:45:47,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=167460.0, ans=0.0 2023-06-18 08:46:22,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=167580.0, ans=0.0 2023-06-18 08:46:26,449 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.517e+02 3.916e+02 4.626e+02 5.836e+02 1.013e+03, threshold=9.252e+02, percent-clipped=5.0 2023-06-18 08:46:58,160 INFO [train.py:996] (2/4) Epoch 1, batch 27950, loss[loss=0.327, simple_loss=0.4075, pruned_loss=0.1232, over 20751.00 frames. ], tot_loss[loss=0.3268, simple_loss=0.3809, pruned_loss=0.1364, over 4292403.85 frames. ], batch size: 607, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 08:47:04,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=167700.0, ans=0.125 2023-06-18 08:47:06,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=167700.0, ans=0.125 2023-06-18 08:47:11,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=167700.0, ans=0.2 2023-06-18 08:48:01,576 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.85 vs. limit=15.0 2023-06-18 08:48:35,920 INFO [train.py:996] (2/4) Epoch 1, batch 28000, loss[loss=0.3415, simple_loss=0.3817, pruned_loss=0.1507, over 21560.00 frames. ], tot_loss[loss=0.3231, simple_loss=0.3793, pruned_loss=0.1334, over 4293697.17 frames. ], batch size: 548, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 08:48:37,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=168000.0, ans=0.125 2023-06-18 08:48:49,874 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:49:35,508 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 3.441e+02 4.640e+02 5.582e+02 1.043e+03, threshold=9.281e+02, percent-clipped=2.0 2023-06-18 08:50:11,361 INFO [train.py:996] (2/4) Epoch 1, batch 28050, loss[loss=0.3338, simple_loss=0.3885, pruned_loss=0.1395, over 21692.00 frames. ], tot_loss[loss=0.3238, simple_loss=0.3767, pruned_loss=0.1354, over 4296393.08 frames. ], batch size: 389, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 08:50:25,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=168360.0, ans=0.0 2023-06-18 08:50:41,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=168360.0, ans=0.1 2023-06-18 08:50:46,448 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.43 vs. limit=15.0 2023-06-18 08:50:57,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=168420.0, ans=0.125 2023-06-18 08:51:18,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=168480.0, ans=0.125 2023-06-18 08:51:31,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=168540.0, ans=0.0 2023-06-18 08:51:34,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=168540.0, ans=0.2 2023-06-18 08:51:41,521 INFO [train.py:996] (2/4) Epoch 1, batch 28100, loss[loss=0.308, simple_loss=0.3444, pruned_loss=0.1358, over 21805.00 frames. ], tot_loss[loss=0.3233, simple_loss=0.3746, pruned_loss=0.136, over 4296283.51 frames. ], batch size: 124, lr: 2.29e-02, grad_scale: 32.0 2023-06-18 08:51:47,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=168600.0, ans=0.2 2023-06-18 08:52:51,341 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.564e+02 3.651e+02 4.541e+02 5.753e+02 9.912e+02, threshold=9.083e+02, percent-clipped=1.0 2023-06-18 08:53:16,369 INFO [train.py:996] (2/4) Epoch 1, batch 28150, loss[loss=0.2917, simple_loss=0.3239, pruned_loss=0.1297, over 21602.00 frames. ], tot_loss[loss=0.3205, simple_loss=0.3685, pruned_loss=0.1362, over 4279089.41 frames. ], batch size: 247, lr: 2.28e-02, grad_scale: 16.0 2023-06-18 08:53:17,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=168900.0, ans=0.1 2023-06-18 08:53:17,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=168900.0, ans=0.1 2023-06-18 08:53:46,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=168960.0, ans=0.125 2023-06-18 08:54:36,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=169140.0, ans=0.125 2023-06-18 08:54:51,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=169200.0, ans=0.0 2023-06-18 08:54:52,587 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-06-18 08:54:53,288 INFO [train.py:996] (2/4) Epoch 1, batch 28200, loss[loss=0.4567, simple_loss=0.5406, pruned_loss=0.1864, over 19782.00 frames. ], tot_loss[loss=0.3243, simple_loss=0.3699, pruned_loss=0.1394, over 4277779.76 frames. ], batch size: 702, lr: 2.28e-02, grad_scale: 16.0 2023-06-18 08:54:57,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=169200.0, ans=0.125 2023-06-18 08:54:58,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=169200.0, ans=0.125 2023-06-18 08:54:59,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=169200.0, ans=0.125 2023-06-18 08:55:08,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=169260.0, ans=0.2 2023-06-18 08:55:32,081 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 08:56:00,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=169380.0, ans=0.125 2023-06-18 08:56:04,068 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 3.823e+02 5.073e+02 6.497e+02 1.031e+03, threshold=1.015e+03, percent-clipped=3.0 2023-06-18 08:56:10,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=169440.0, ans=0.0 2023-06-18 08:56:28,475 INFO [train.py:996] (2/4) Epoch 1, batch 28250, loss[loss=0.2925, simple_loss=0.3328, pruned_loss=0.1261, over 21205.00 frames. ], tot_loss[loss=0.3316, simple_loss=0.3756, pruned_loss=0.1438, over 4271997.89 frames. ], batch size: 159, lr: 2.28e-02, grad_scale: 16.0 2023-06-18 08:56:33,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=169500.0, ans=0.1 2023-06-18 08:56:37,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=169500.0, ans=0.2 2023-06-18 08:56:54,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=169560.0, ans=0.0 2023-06-18 08:57:06,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=169620.0, ans=0.0 2023-06-18 08:57:31,881 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.44 vs. limit=15.0 2023-06-18 08:57:59,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=169800.0, ans=0.2 2023-06-18 08:58:00,558 INFO [train.py:996] (2/4) Epoch 1, batch 28300, loss[loss=0.3328, simple_loss=0.401, pruned_loss=0.1324, over 21672.00 frames. ], tot_loss[loss=0.3273, simple_loss=0.3727, pruned_loss=0.1409, over 4273906.80 frames. ], batch size: 441, lr: 2.28e-02, grad_scale: 16.0 2023-06-18 08:58:15,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=169860.0, ans=0.1 2023-06-18 08:58:43,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=169920.0, ans=0.125 2023-06-18 08:58:52,130 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.08 vs. limit=15.0 2023-06-18 08:58:54,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=169920.0, ans=0.2 2023-06-18 08:58:59,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=169980.0, ans=0.0 2023-06-18 08:59:11,383 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.033e+02 3.420e+02 4.323e+02 5.538e+02 1.121e+03, threshold=8.647e+02, percent-clipped=1.0 2023-06-18 08:59:28,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=170040.0, ans=0.1 2023-06-18 08:59:31,051 INFO [train.py:996] (2/4) Epoch 1, batch 28350, loss[loss=0.2231, simple_loss=0.3073, pruned_loss=0.06943, over 21350.00 frames. ], tot_loss[loss=0.3157, simple_loss=0.3668, pruned_loss=0.1323, over 4277842.62 frames. ], batch size: 211, lr: 2.28e-02, grad_scale: 16.0 2023-06-18 08:59:38,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=170100.0, ans=0.035 2023-06-18 08:59:43,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=170100.0, ans=0.0 2023-06-18 08:59:47,184 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.23 vs. limit=15.0 2023-06-18 08:59:51,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=170160.0, ans=0.125 2023-06-18 08:59:53,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=170160.0, ans=6.0 2023-06-18 09:00:45,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=170280.0, ans=0.125 2023-06-18 09:01:02,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=170340.0, ans=0.125 2023-06-18 09:01:09,031 INFO [train.py:996] (2/4) Epoch 1, batch 28400, loss[loss=0.3917, simple_loss=0.4693, pruned_loss=0.157, over 19726.00 frames. ], tot_loss[loss=0.3128, simple_loss=0.3626, pruned_loss=0.1315, over 4274541.15 frames. ], batch size: 703, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 09:01:10,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=170400.0, ans=0.04949747468305833 2023-06-18 09:01:14,452 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-06-18 09:01:24,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=170460.0, ans=0.5 2023-06-18 09:01:42,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=170460.0, ans=0.125 2023-06-18 09:01:53,474 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.42 vs. limit=22.5 2023-06-18 09:02:24,610 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.128e+02 3.627e+02 4.521e+02 5.478e+02 1.024e+03, threshold=9.042e+02, percent-clipped=4.0 2023-06-18 09:02:33,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=170640.0, ans=0.1 2023-06-18 09:02:43,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=170700.0, ans=0.125 2023-06-18 09:02:44,594 INFO [train.py:996] (2/4) Epoch 1, batch 28450, loss[loss=0.3169, simple_loss=0.3812, pruned_loss=0.1263, over 21753.00 frames. ], tot_loss[loss=0.3239, simple_loss=0.3711, pruned_loss=0.1384, over 4272309.01 frames. ], batch size: 112, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 09:02:58,224 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=22.5 2023-06-18 09:04:07,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=170940.0, ans=0.07 2023-06-18 09:04:20,714 INFO [train.py:996] (2/4) Epoch 1, batch 28500, loss[loss=0.3955, simple_loss=0.4279, pruned_loss=0.1816, over 21248.00 frames. ], tot_loss[loss=0.3318, simple_loss=0.3759, pruned_loss=0.1438, over 4280587.34 frames. ], batch size: 143, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 09:04:22,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=171000.0, ans=0.125 2023-06-18 09:04:57,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=171060.0, ans=0.0 2023-06-18 09:05:37,218 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.453e+02 3.668e+02 4.799e+02 6.213e+02 1.260e+03, threshold=9.598e+02, percent-clipped=4.0 2023-06-18 09:06:07,521 INFO [train.py:996] (2/4) Epoch 1, batch 28550, loss[loss=0.3668, simple_loss=0.4339, pruned_loss=0.1498, over 21724.00 frames. ], tot_loss[loss=0.3402, simple_loss=0.3845, pruned_loss=0.1479, over 4283452.58 frames. ], batch size: 247, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 09:06:12,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=171300.0, ans=0.2 2023-06-18 09:06:14,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=171300.0, ans=0.125 2023-06-18 09:06:14,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=171300.0, ans=0.0 2023-06-18 09:07:06,422 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-18 09:07:09,291 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.41 vs. limit=12.0 2023-06-18 09:07:10,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=171480.0, ans=0.0 2023-06-18 09:07:42,100 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=15.0 2023-06-18 09:07:47,165 INFO [train.py:996] (2/4) Epoch 1, batch 28600, loss[loss=0.3335, simple_loss=0.3862, pruned_loss=0.1404, over 21374.00 frames. ], tot_loss[loss=0.3459, simple_loss=0.3919, pruned_loss=0.1499, over 4283172.64 frames. ], batch size: 211, lr: 2.27e-02, grad_scale: 32.0 2023-06-18 09:07:54,168 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.24 vs. limit=22.5 2023-06-18 09:08:47,903 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.471e+02 3.320e+02 4.208e+02 5.237e+02 8.981e+02, threshold=8.415e+02, percent-clipped=0.0 2023-06-18 09:09:02,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=171840.0, ans=0.125 2023-06-18 09:09:22,065 INFO [train.py:996] (2/4) Epoch 1, batch 28650, loss[loss=0.2875, simple_loss=0.3297, pruned_loss=0.1227, over 21857.00 frames. ], tot_loss[loss=0.3399, simple_loss=0.384, pruned_loss=0.1479, over 4286134.02 frames. ], batch size: 98, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 09:09:48,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=171960.0, ans=0.0 2023-06-18 09:09:52,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=171960.0, ans=0.125 2023-06-18 09:09:54,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=171960.0, ans=0.125 2023-06-18 09:10:01,372 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.17 vs. limit=6.0 2023-06-18 09:10:10,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=172020.0, ans=0.0 2023-06-18 09:10:57,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=172200.0, ans=0.125 2023-06-18 09:10:58,011 INFO [train.py:996] (2/4) Epoch 1, batch 28700, loss[loss=0.3392, simple_loss=0.3837, pruned_loss=0.1473, over 21796.00 frames. ], tot_loss[loss=0.3404, simple_loss=0.383, pruned_loss=0.1489, over 4288305.29 frames. ], batch size: 247, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 09:11:05,031 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-18 09:11:18,526 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 09:11:33,801 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=22.5 2023-06-18 09:12:05,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=172380.0, ans=0.125 2023-06-18 09:12:09,234 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 3.359e+02 4.224e+02 5.589e+02 9.530e+02, threshold=8.447e+02, percent-clipped=4.0 2023-06-18 09:12:38,147 INFO [train.py:996] (2/4) Epoch 1, batch 28750, loss[loss=0.3335, simple_loss=0.3738, pruned_loss=0.1466, over 21553.00 frames. ], tot_loss[loss=0.3407, simple_loss=0.3821, pruned_loss=0.1496, over 4289649.74 frames. ], batch size: 144, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 09:13:00,718 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=7.788e-03 2023-06-18 09:13:08,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=172620.0, ans=0.125 2023-06-18 09:13:11,963 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.35 vs. limit=15.0 2023-06-18 09:14:11,453 INFO [train.py:996] (2/4) Epoch 1, batch 28800, loss[loss=0.3809, simple_loss=0.4213, pruned_loss=0.1702, over 21571.00 frames. ], tot_loss[loss=0.3428, simple_loss=0.3864, pruned_loss=0.1496, over 4289370.51 frames. ], batch size: 414, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 09:14:30,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=172860.0, ans=0.125 2023-06-18 09:15:20,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=172980.0, ans=0.125 2023-06-18 09:15:25,214 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.483e+02 3.326e+02 4.027e+02 5.437e+02 1.151e+03, threshold=8.055e+02, percent-clipped=4.0 2023-06-18 09:15:41,179 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=15.0 2023-06-18 09:15:49,268 INFO [train.py:996] (2/4) Epoch 1, batch 28850, loss[loss=0.347, simple_loss=0.3736, pruned_loss=0.1602, over 20917.00 frames. ], tot_loss[loss=0.3462, simple_loss=0.3887, pruned_loss=0.1519, over 4290028.63 frames. ], batch size: 607, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 09:15:52,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=173100.0, ans=0.1 2023-06-18 09:16:14,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=173160.0, ans=0.0 2023-06-18 09:16:19,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=173220.0, ans=0.04949747468305833 2023-06-18 09:17:25,763 INFO [train.py:996] (2/4) Epoch 1, batch 28900, loss[loss=0.4115, simple_loss=0.4481, pruned_loss=0.1875, over 20884.00 frames. ], tot_loss[loss=0.3498, simple_loss=0.392, pruned_loss=0.1538, over 4289786.65 frames. ], batch size: 608, lr: 2.26e-02, grad_scale: 32.0 2023-06-18 09:17:26,260 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 09:17:39,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=173400.0, ans=0.0 2023-06-18 09:17:42,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=173460.0, ans=0.125 2023-06-18 09:17:42,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=173460.0, ans=0.1 2023-06-18 09:18:34,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=173580.0, ans=0.1 2023-06-18 09:18:38,144 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.639e+02 3.786e+02 4.531e+02 6.034e+02 1.219e+03, threshold=9.062e+02, percent-clipped=7.0 2023-06-18 09:18:59,000 INFO [train.py:996] (2/4) Epoch 1, batch 28950, loss[loss=0.2783, simple_loss=0.325, pruned_loss=0.1158, over 21230.00 frames. ], tot_loss[loss=0.3453, simple_loss=0.3896, pruned_loss=0.1505, over 4276100.45 frames. ], batch size: 176, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 09:19:06,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=173700.0, ans=0.0 2023-06-18 09:19:23,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=173760.0, ans=0.2 2023-06-18 09:20:04,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=173880.0, ans=0.0 2023-06-18 09:20:21,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=173940.0, ans=0.2 2023-06-18 09:20:30,562 INFO [train.py:996] (2/4) Epoch 1, batch 29000, loss[loss=0.3661, simple_loss=0.4037, pruned_loss=0.1642, over 19951.00 frames. ], tot_loss[loss=0.3459, simple_loss=0.3936, pruned_loss=0.1491, over 4263594.95 frames. ], batch size: 703, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 09:21:37,660 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.74 vs. limit=10.0 2023-06-18 09:21:45,738 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 3.309e+02 4.233e+02 5.463e+02 9.741e+02, threshold=8.465e+02, percent-clipped=3.0 2023-06-18 09:22:01,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=174240.0, ans=0.125 2023-06-18 09:22:05,671 INFO [train.py:996] (2/4) Epoch 1, batch 29050, loss[loss=0.316, simple_loss=0.3631, pruned_loss=0.1344, over 21565.00 frames. ], tot_loss[loss=0.347, simple_loss=0.3925, pruned_loss=0.1508, over 4275457.49 frames. ], batch size: 548, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 09:22:11,399 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 09:22:34,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=174360.0, ans=0.125 2023-06-18 09:23:31,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=174540.0, ans=0.125 2023-06-18 09:23:40,554 INFO [train.py:996] (2/4) Epoch 1, batch 29100, loss[loss=0.275, simple_loss=0.3207, pruned_loss=0.1147, over 21684.00 frames. ], tot_loss[loss=0.3362, simple_loss=0.3803, pruned_loss=0.146, over 4279918.05 frames. ], batch size: 316, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 09:24:12,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=174660.0, ans=0.125 2023-06-18 09:24:21,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=174720.0, ans=0.125 2023-06-18 09:24:23,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=174720.0, ans=0.0 2023-06-18 09:24:40,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=174780.0, ans=0.1 2023-06-18 09:24:48,546 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 3.440e+02 4.060e+02 5.417e+02 8.880e+02, threshold=8.120e+02, percent-clipped=2.0 2023-06-18 09:25:17,854 INFO [train.py:996] (2/4) Epoch 1, batch 29150, loss[loss=0.2668, simple_loss=0.3399, pruned_loss=0.09687, over 21278.00 frames. ], tot_loss[loss=0.3321, simple_loss=0.3788, pruned_loss=0.1427, over 4275836.94 frames. ], batch size: 176, lr: 2.25e-02, grad_scale: 32.0 2023-06-18 09:25:19,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=174900.0, ans=0.125 2023-06-18 09:25:45,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=174960.0, ans=0.2 2023-06-18 09:25:52,024 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-06-18 09:25:59,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=175020.0, ans=0.0 2023-06-18 09:26:06,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=175080.0, ans=0.125 2023-06-18 09:26:48,414 INFO [train.py:996] (2/4) Epoch 1, batch 29200, loss[loss=0.3028, simple_loss=0.3365, pruned_loss=0.1346, over 21275.00 frames. ], tot_loss[loss=0.3271, simple_loss=0.3729, pruned_loss=0.1407, over 4275791.95 frames. ], batch size: 144, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 09:27:15,793 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.78 vs. limit=10.0 2023-06-18 09:27:16,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=175260.0, ans=0.0 2023-06-18 09:27:42,574 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=15.0 2023-06-18 09:27:50,859 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.448e+02 3.297e+02 4.275e+02 5.517e+02 1.101e+03, threshold=8.550e+02, percent-clipped=8.0 2023-06-18 09:27:54,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=175440.0, ans=0.125 2023-06-18 09:28:11,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=175440.0, ans=0.125 2023-06-18 09:28:11,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=175440.0, ans=0.125 2023-06-18 09:28:25,610 INFO [train.py:996] (2/4) Epoch 1, batch 29250, loss[loss=0.3443, simple_loss=0.4044, pruned_loss=0.1421, over 21829.00 frames. ], tot_loss[loss=0.3223, simple_loss=0.3702, pruned_loss=0.1372, over 4266858.07 frames. ], batch size: 317, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 09:28:41,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=175500.0, ans=0.125 2023-06-18 09:28:48,360 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.84 vs. limit=6.0 2023-06-18 09:28:51,538 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=15.0 2023-06-18 09:28:58,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=175560.0, ans=0.125 2023-06-18 09:29:06,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=175620.0, ans=10.0 2023-06-18 09:30:05,184 INFO [train.py:996] (2/4) Epoch 1, batch 29300, loss[loss=0.2992, simple_loss=0.3781, pruned_loss=0.1102, over 21711.00 frames. ], tot_loss[loss=0.323, simple_loss=0.3724, pruned_loss=0.1368, over 4269534.68 frames. ], batch size: 332, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 09:30:13,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=175800.0, ans=0.2 2023-06-18 09:30:29,041 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-06-18 09:30:41,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=175920.0, ans=0.125 2023-06-18 09:31:07,351 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.203e+02 3.858e+02 5.248e+02 6.398e+02 1.119e+03, threshold=1.050e+03, percent-clipped=2.0 2023-06-18 09:31:16,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.whiten.whitening_limit, batch_count=176040.0, ans=12.0 2023-06-18 09:31:37,316 INFO [train.py:996] (2/4) Epoch 1, batch 29350, loss[loss=0.3325, simple_loss=0.3931, pruned_loss=0.136, over 21637.00 frames. ], tot_loss[loss=0.3216, simple_loss=0.3688, pruned_loss=0.1372, over 4270610.80 frames. ], batch size: 298, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 09:33:05,656 INFO [train.py:996] (2/4) Epoch 1, batch 29400, loss[loss=0.3224, simple_loss=0.3901, pruned_loss=0.1274, over 21294.00 frames. ], tot_loss[loss=0.319, simple_loss=0.3693, pruned_loss=0.1344, over 4275285.06 frames. ], batch size: 551, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 09:33:09,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=176400.0, ans=0.125 2023-06-18 09:34:04,777 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.60 vs. limit=15.0 2023-06-18 09:34:22,285 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.444e+02 3.521e+02 4.187e+02 5.229e+02 1.148e+03, threshold=8.373e+02, percent-clipped=2.0 2023-06-18 09:34:30,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=176640.0, ans=0.0 2023-06-18 09:34:42,257 INFO [train.py:996] (2/4) Epoch 1, batch 29450, loss[loss=0.3847, simple_loss=0.4253, pruned_loss=0.1721, over 21573.00 frames. ], tot_loss[loss=0.3183, simple_loss=0.3689, pruned_loss=0.1338, over 4271133.39 frames. ], batch size: 414, lr: 2.24e-02, grad_scale: 32.0 2023-06-18 09:34:48,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=176700.0, ans=0.125 2023-06-18 09:35:06,053 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.30 vs. limit=22.5 2023-06-18 09:35:13,855 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=15.0 2023-06-18 09:36:18,609 INFO [train.py:996] (2/4) Epoch 1, batch 29500, loss[loss=0.3371, simple_loss=0.3707, pruned_loss=0.1518, over 21267.00 frames. ], tot_loss[loss=0.3266, simple_loss=0.3749, pruned_loss=0.1391, over 4276069.55 frames. ], batch size: 159, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 09:36:24,647 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=22.5 2023-06-18 09:36:47,468 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-06-18 09:37:09,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=177120.0, ans=0.1 2023-06-18 09:37:29,653 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 3.303e+02 3.922e+02 4.990e+02 9.245e+02, threshold=7.844e+02, percent-clipped=1.0 2023-06-18 09:37:39,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=177240.0, ans=0.1 2023-06-18 09:37:54,388 INFO [train.py:996] (2/4) Epoch 1, batch 29550, loss[loss=0.3414, simple_loss=0.3796, pruned_loss=0.1516, over 21886.00 frames. ], tot_loss[loss=0.3279, simple_loss=0.374, pruned_loss=0.1409, over 4281927.57 frames. ], batch size: 414, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 09:38:43,988 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=15.0 2023-06-18 09:38:51,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=177420.0, ans=0.125 2023-06-18 09:38:55,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=177480.0, ans=0.05 2023-06-18 09:39:29,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=177600.0, ans=0.125 2023-06-18 09:39:30,941 INFO [train.py:996] (2/4) Epoch 1, batch 29600, loss[loss=0.3411, simple_loss=0.4083, pruned_loss=0.137, over 21840.00 frames. ], tot_loss[loss=0.3339, simple_loss=0.3805, pruned_loss=0.1437, over 4282218.33 frames. ], batch size: 316, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 09:39:32,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=177600.0, ans=0.1 2023-06-18 09:39:39,685 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-18 09:39:45,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=177660.0, ans=10.0 2023-06-18 09:40:14,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=177720.0, ans=0.125 2023-06-18 09:40:29,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=177780.0, ans=0.025 2023-06-18 09:40:36,112 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=15.0 2023-06-18 09:40:41,178 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 3.582e+02 4.234e+02 5.701e+02 1.045e+03, threshold=8.469e+02, percent-clipped=5.0 2023-06-18 09:40:46,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=177840.0, ans=0.125 2023-06-18 09:40:46,995 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.25 vs. limit=10.0 2023-06-18 09:41:00,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=177900.0, ans=0.125 2023-06-18 09:41:01,414 INFO [train.py:996] (2/4) Epoch 1, batch 29650, loss[loss=0.3009, simple_loss=0.3433, pruned_loss=0.1292, over 21853.00 frames. ], tot_loss[loss=0.3286, simple_loss=0.3784, pruned_loss=0.1394, over 4283136.10 frames. ], batch size: 351, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 09:41:47,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=178020.0, ans=0.1 2023-06-18 09:41:58,857 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.75 vs. limit=22.5 2023-06-18 09:42:37,429 INFO [train.py:996] (2/4) Epoch 1, batch 29700, loss[loss=0.2202, simple_loss=0.2976, pruned_loss=0.07139, over 21700.00 frames. ], tot_loss[loss=0.3298, simple_loss=0.3794, pruned_loss=0.1401, over 4288991.71 frames. ], batch size: 298, lr: 2.23e-02, grad_scale: 32.0 2023-06-18 09:43:46,803 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=22.5 2023-06-18 09:43:51,722 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.547e+02 3.946e+02 4.847e+02 6.771e+02 1.201e+03, threshold=9.693e+02, percent-clipped=9.0 2023-06-18 09:43:52,798 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.54 vs. limit=22.5 2023-06-18 09:43:57,125 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=6.96 vs. limit=6.0 2023-06-18 09:44:05,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=178440.0, ans=0.015 2023-06-18 09:44:06,292 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.52 vs. limit=15.0 2023-06-18 09:44:11,030 INFO [train.py:996] (2/4) Epoch 1, batch 29750, loss[loss=0.2732, simple_loss=0.3311, pruned_loss=0.1076, over 21463.00 frames. ], tot_loss[loss=0.3311, simple_loss=0.3835, pruned_loss=0.1394, over 4285915.07 frames. ], batch size: 131, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 09:44:13,571 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.61 vs. limit=15.0 2023-06-18 09:44:20,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=178500.0, ans=0.0 2023-06-18 09:44:59,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=178620.0, ans=0.125 2023-06-18 09:45:02,002 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.27 vs. limit=10.0 2023-06-18 09:45:14,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=178680.0, ans=0.125 2023-06-18 09:45:26,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=178740.0, ans=0.09899494936611666 2023-06-18 09:45:29,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=178740.0, ans=0.2 2023-06-18 09:45:32,952 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-18 09:45:41,175 INFO [train.py:996] (2/4) Epoch 1, batch 29800, loss[loss=0.3519, simple_loss=0.3911, pruned_loss=0.1564, over 21752.00 frames. ], tot_loss[loss=0.3335, simple_loss=0.3856, pruned_loss=0.1408, over 4290150.97 frames. ], batch size: 112, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 09:45:43,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=178800.0, ans=0.125 2023-06-18 09:46:23,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=178860.0, ans=0.0 2023-06-18 09:46:32,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=178920.0, ans=0.1 2023-06-18 09:46:56,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=178980.0, ans=0.0 2023-06-18 09:46:57,391 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 3.287e+02 3.786e+02 4.602e+02 9.209e+02, threshold=7.572e+02, percent-clipped=0.0 2023-06-18 09:47:11,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=179040.0, ans=0.1 2023-06-18 09:47:17,296 INFO [train.py:996] (2/4) Epoch 1, batch 29850, loss[loss=0.2734, simple_loss=0.333, pruned_loss=0.1069, over 21796.00 frames. ], tot_loss[loss=0.3279, simple_loss=0.3806, pruned_loss=0.1376, over 4291362.76 frames. ], batch size: 282, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 09:47:27,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=179100.0, ans=0.0 2023-06-18 09:47:31,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=179160.0, ans=0.1 2023-06-18 09:47:34,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=179160.0, ans=0.1 2023-06-18 09:48:52,473 INFO [train.py:996] (2/4) Epoch 1, batch 29900, loss[loss=0.3611, simple_loss=0.3942, pruned_loss=0.164, over 21732.00 frames. ], tot_loss[loss=0.3307, simple_loss=0.3803, pruned_loss=0.1405, over 4294584.04 frames. ], batch size: 351, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 09:49:21,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=179460.0, ans=0.1 2023-06-18 09:49:34,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=179460.0, ans=0.125 2023-06-18 09:49:37,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=179520.0, ans=0.0 2023-06-18 09:50:02,877 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.51 vs. limit=15.0 2023-06-18 09:50:09,533 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 3.414e+02 4.153e+02 5.110e+02 1.047e+03, threshold=8.306e+02, percent-clipped=3.0 2023-06-18 09:50:34,418 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-18 09:50:34,876 INFO [train.py:996] (2/4) Epoch 1, batch 29950, loss[loss=0.3609, simple_loss=0.3989, pruned_loss=0.1614, over 21735.00 frames. ], tot_loss[loss=0.3368, simple_loss=0.384, pruned_loss=0.1448, over 4289789.07 frames. ], batch size: 298, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 09:51:10,722 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-18 09:51:29,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=179820.0, ans=0.0 2023-06-18 09:51:45,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=179880.0, ans=0.125 2023-06-18 09:52:08,322 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.44 vs. limit=10.0 2023-06-18 09:52:16,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=180000.0, ans=0.125 2023-06-18 09:52:17,271 INFO [train.py:996] (2/4) Epoch 1, batch 30000, loss[loss=0.3033, simple_loss=0.3875, pruned_loss=0.1095, over 21662.00 frames. ], tot_loss[loss=0.3386, simple_loss=0.3865, pruned_loss=0.1454, over 4285313.57 frames. ], batch size: 389, lr: 2.22e-02, grad_scale: 32.0 2023-06-18 09:52:17,272 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 09:52:35,097 INFO [train.py:1028] (2/4) Epoch 1, validation: loss=0.2819, simple_loss=0.3813, pruned_loss=0.09129, over 1796401.00 frames. 2023-06-18 09:52:35,098 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-18 09:53:16,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=180120.0, ans=0.125 2023-06-18 09:53:22,794 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 09:53:54,187 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.532e+02 3.360e+02 4.099e+02 5.189e+02 8.987e+02, threshold=8.197e+02, percent-clipped=1.0 2023-06-18 09:53:58,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=180240.0, ans=0.0 2023-06-18 09:54:19,761 INFO [train.py:996] (2/4) Epoch 1, batch 30050, loss[loss=0.4282, simple_loss=0.5132, pruned_loss=0.1716, over 21211.00 frames. ], tot_loss[loss=0.3348, simple_loss=0.389, pruned_loss=0.1403, over 4268422.78 frames. ], batch size: 549, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 09:54:24,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=180300.0, ans=0.125 2023-06-18 09:54:30,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=180300.0, ans=0.0 2023-06-18 09:54:33,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=180360.0, ans=0.125 2023-06-18 09:54:45,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=180360.0, ans=0.125 2023-06-18 09:55:50,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=180540.0, ans=0.2 2023-06-18 09:55:51,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=180540.0, ans=0.2 2023-06-18 09:55:56,183 INFO [train.py:996] (2/4) Epoch 1, batch 30100, loss[loss=0.3505, simple_loss=0.3694, pruned_loss=0.1658, over 21513.00 frames. ], tot_loss[loss=0.3357, simple_loss=0.3896, pruned_loss=0.1409, over 4264822.65 frames. ], batch size: 414, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 09:56:35,848 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.17 vs. limit=15.0 2023-06-18 09:56:58,475 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.74 vs. limit=10.0 2023-06-18 09:57:06,256 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=22.5 2023-06-18 09:57:11,284 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.278e+02 3.434e+02 4.118e+02 5.111e+02 9.252e+02, threshold=8.235e+02, percent-clipped=1.0 2023-06-18 09:57:17,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=180840.0, ans=0.1 2023-06-18 09:57:31,609 INFO [train.py:996] (2/4) Epoch 1, batch 30150, loss[loss=0.3736, simple_loss=0.4007, pruned_loss=0.1732, over 21567.00 frames. ], tot_loss[loss=0.3343, simple_loss=0.3835, pruned_loss=0.1425, over 4263662.98 frames. ], batch size: 230, lr: 2.21e-02, grad_scale: 64.0 2023-06-18 09:58:13,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=181020.0, ans=0.0 2023-06-18 09:58:13,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=181020.0, ans=0.125 2023-06-18 09:58:17,294 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=15.0 2023-06-18 09:58:43,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=181080.0, ans=0.125 2023-06-18 09:58:56,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=181140.0, ans=0.5 2023-06-18 09:59:05,030 INFO [train.py:996] (2/4) Epoch 1, batch 30200, loss[loss=0.3363, simple_loss=0.3953, pruned_loss=0.1387, over 21184.00 frames. ], tot_loss[loss=0.3332, simple_loss=0.3854, pruned_loss=0.1405, over 4267010.76 frames. ], batch size: 143, lr: 2.21e-02, grad_scale: 64.0 2023-06-18 09:59:05,352 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 09:59:38,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=181260.0, ans=0.125 2023-06-18 09:59:40,189 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=22.5 2023-06-18 09:59:54,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=181320.0, ans=22.5 2023-06-18 10:00:18,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=181380.0, ans=15.0 2023-06-18 10:00:23,211 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.642e+02 3.548e+02 4.753e+02 6.643e+02 1.324e+03, threshold=9.506e+02, percent-clipped=12.0 2023-06-18 10:00:32,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=181440.0, ans=0.125 2023-06-18 10:00:51,982 INFO [train.py:996] (2/4) Epoch 1, batch 30250, loss[loss=0.4327, simple_loss=0.5038, pruned_loss=0.1808, over 21648.00 frames. ], tot_loss[loss=0.3419, simple_loss=0.3954, pruned_loss=0.1442, over 4272781.53 frames. ], batch size: 414, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 10:01:21,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=181560.0, ans=0.0 2023-06-18 10:01:37,177 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-06-18 10:02:14,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=181740.0, ans=0.125 2023-06-18 10:02:32,634 INFO [train.py:996] (2/4) Epoch 1, batch 30300, loss[loss=0.3035, simple_loss=0.3408, pruned_loss=0.1331, over 21806.00 frames. ], tot_loss[loss=0.3379, simple_loss=0.3899, pruned_loss=0.143, over 4277605.11 frames. ], batch size: 98, lr: 2.21e-02, grad_scale: 32.0 2023-06-18 10:02:37,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=181800.0, ans=0.025 2023-06-18 10:02:52,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=181860.0, ans=0.2 2023-06-18 10:03:21,620 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=12.0 2023-06-18 10:03:24,801 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.39 vs. limit=15.0 2023-06-18 10:03:28,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=181980.0, ans=0.04949747468305833 2023-06-18 10:03:30,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=181980.0, ans=0.1 2023-06-18 10:03:47,903 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 3.818e+02 4.498e+02 5.757e+02 1.296e+03, threshold=8.996e+02, percent-clipped=5.0 2023-06-18 10:03:48,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=181980.0, ans=0.0 2023-06-18 10:04:11,451 INFO [train.py:996] (2/4) Epoch 1, batch 30350, loss[loss=0.4305, simple_loss=0.4646, pruned_loss=0.1982, over 21514.00 frames. ], tot_loss[loss=0.3418, simple_loss=0.3926, pruned_loss=0.1455, over 4270045.26 frames. ], batch size: 509, lr: 2.20e-02, grad_scale: 32.0 2023-06-18 10:04:49,083 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:05:08,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=182280.0, ans=0.0 2023-06-18 10:05:09,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=182280.0, ans=0.0 2023-06-18 10:05:11,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=182340.0, ans=0.04949747468305833 2023-06-18 10:05:26,627 INFO [train.py:996] (2/4) Epoch 1, batch 30400, loss[loss=0.3454, simple_loss=0.3555, pruned_loss=0.1676, over 20352.00 frames. ], tot_loss[loss=0.3342, simple_loss=0.3842, pruned_loss=0.1421, over 4261824.23 frames. ], batch size: 703, lr: 2.20e-02, grad_scale: 32.0 2023-06-18 10:06:07,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=182520.0, ans=0.0 2023-06-18 10:06:31,076 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.202e+02 4.537e+02 5.674e+02 8.183e+02 2.727e+03, threshold=1.135e+03, percent-clipped=13.0 2023-06-18 10:06:47,841 INFO [train.py:996] (2/4) Epoch 1, batch 30450, loss[loss=0.4055, simple_loss=0.5055, pruned_loss=0.1527, over 19917.00 frames. ], tot_loss[loss=0.3378, simple_loss=0.388, pruned_loss=0.1438, over 4202760.99 frames. ], batch size: 702, lr: 2.20e-02, grad_scale: 32.0 2023-06-18 10:07:03,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=182760.0, ans=0.0 2023-06-18 10:07:05,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=182760.0, ans=0.5 2023-06-18 10:07:38,432 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=12.0 2023-06-18 10:09:26,536 INFO [train.py:996] (2/4) Epoch 2, batch 0, loss[loss=0.4268, simple_loss=0.4297, pruned_loss=0.2119, over 21770.00 frames. ], tot_loss[loss=0.4268, simple_loss=0.4297, pruned_loss=0.2119, over 21770.00 frames. ], batch size: 102, lr: 2.01e-02, grad_scale: 32.0 2023-06-18 10:09:26,536 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 10:09:43,673 INFO [train.py:1028] (2/4) Epoch 2, validation: loss=0.3124, simple_loss=0.4068, pruned_loss=0.109, over 1796401.00 frames. 2023-06-18 10:09:43,674 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-18 10:10:02,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=183030.0, ans=0.125 2023-06-18 10:10:17,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=183090.0, ans=0.1 2023-06-18 10:10:32,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=183150.0, ans=0.125 2023-06-18 10:10:33,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=183150.0, ans=0.5 2023-06-18 10:10:50,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=183150.0, ans=0.2 2023-06-18 10:10:54,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=183150.0, ans=0.0 2023-06-18 10:11:01,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=183210.0, ans=0.0 2023-06-18 10:11:04,121 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.518e+02 4.740e+02 6.519e+02 1.031e+03 2.172e+03, threshold=1.304e+03, percent-clipped=18.0 2023-06-18 10:11:13,455 INFO [train.py:996] (2/4) Epoch 2, batch 50, loss[loss=0.302, simple_loss=0.365, pruned_loss=0.1195, over 21235.00 frames. ], tot_loss[loss=0.3344, simple_loss=0.3825, pruned_loss=0.1432, over 952371.44 frames. ], batch size: 176, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:11:37,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=183330.0, ans=0.0 2023-06-18 10:11:37,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=183330.0, ans=0.2 2023-06-18 10:11:52,862 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=22.5 2023-06-18 10:11:56,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=183390.0, ans=0.2 2023-06-18 10:11:58,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=183390.0, ans=0.2 2023-06-18 10:12:09,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=183450.0, ans=0.0 2023-06-18 10:12:35,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=183510.0, ans=0.1 2023-06-18 10:12:49,800 INFO [train.py:996] (2/4) Epoch 2, batch 100, loss[loss=0.3513, simple_loss=0.4359, pruned_loss=0.1334, over 21742.00 frames. ], tot_loss[loss=0.3452, simple_loss=0.404, pruned_loss=0.1432, over 1690134.81 frames. ], batch size: 332, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:12:59,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=183570.0, ans=0.125 2023-06-18 10:13:02,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=183570.0, ans=0.125 2023-06-18 10:13:04,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=183570.0, ans=0.0 2023-06-18 10:14:03,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=183750.0, ans=0.0 2023-06-18 10:14:16,100 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.204e+02 3.494e+02 4.383e+02 5.480e+02 8.773e+02, threshold=8.766e+02, percent-clipped=0.0 2023-06-18 10:14:19,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=183810.0, ans=0.0 2023-06-18 10:14:19,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=183810.0, ans=0.0 2023-06-18 10:14:29,955 INFO [train.py:996] (2/4) Epoch 2, batch 150, loss[loss=0.3258, simple_loss=0.3865, pruned_loss=0.1325, over 21470.00 frames. ], tot_loss[loss=0.3469, simple_loss=0.4052, pruned_loss=0.1443, over 2249611.81 frames. ], batch size: 131, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:14:34,968 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:15:00,614 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.21 vs. limit=22.5 2023-06-18 10:15:22,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=184050.0, ans=0.0 2023-06-18 10:15:59,064 INFO [train.py:996] (2/4) Epoch 2, batch 200, loss[loss=0.3471, simple_loss=0.3904, pruned_loss=0.1519, over 21866.00 frames. ], tot_loss[loss=0.3411, simple_loss=0.3998, pruned_loss=0.1412, over 2697163.76 frames. ], batch size: 371, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:16:53,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=184350.0, ans=0.2 2023-06-18 10:17:15,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=184350.0, ans=0.0 2023-06-18 10:17:20,990 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=22.5 2023-06-18 10:17:24,371 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.289e+02 3.721e+02 4.278e+02 5.715e+02 9.625e+02, threshold=8.556e+02, percent-clipped=2.0 2023-06-18 10:17:33,214 INFO [train.py:996] (2/4) Epoch 2, batch 250, loss[loss=0.2964, simple_loss=0.3502, pruned_loss=0.1213, over 21748.00 frames. ], tot_loss[loss=0.338, simple_loss=0.3943, pruned_loss=0.1409, over 3053041.77 frames. ], batch size: 371, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:17:33,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=184470.0, ans=10.0 2023-06-18 10:18:19,824 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=15.0 2023-06-18 10:18:28,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=184650.0, ans=0.125 2023-06-18 10:18:58,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=184710.0, ans=0.1 2023-06-18 10:19:15,578 INFO [train.py:996] (2/4) Epoch 2, batch 300, loss[loss=0.3091, simple_loss=0.3525, pruned_loss=0.1329, over 21341.00 frames. ], tot_loss[loss=0.3372, simple_loss=0.3905, pruned_loss=0.1419, over 3322958.65 frames. ], batch size: 159, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:19:18,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=184770.0, ans=0.125 2023-06-18 10:20:29,696 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.38 vs. limit=15.0 2023-06-18 10:20:35,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=185010.0, ans=0.125 2023-06-18 10:20:39,251 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.502e+02 4.825e+02 6.381e+02 1.072e+03, threshold=9.650e+02, percent-clipped=6.0 2023-06-18 10:20:48,553 INFO [train.py:996] (2/4) Epoch 2, batch 350, loss[loss=0.3289, simple_loss=0.4134, pruned_loss=0.1222, over 21334.00 frames. ], tot_loss[loss=0.3296, simple_loss=0.3833, pruned_loss=0.1379, over 3530180.23 frames. ], batch size: 548, lr: 2.00e-02, grad_scale: 32.0 2023-06-18 10:21:22,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=185130.0, ans=0.1 2023-06-18 10:21:24,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=185130.0, ans=0.1 2023-06-18 10:21:47,276 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-18 10:22:18,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=185310.0, ans=0.125 2023-06-18 10:22:26,795 INFO [train.py:996] (2/4) Epoch 2, batch 400, loss[loss=0.364, simple_loss=0.4099, pruned_loss=0.159, over 21355.00 frames. ], tot_loss[loss=0.3236, simple_loss=0.3752, pruned_loss=0.136, over 3694127.01 frames. ], batch size: 471, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 10:22:37,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=185370.0, ans=0.125 2023-06-18 10:22:56,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=185430.0, ans=0.1 2023-06-18 10:23:02,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=185430.0, ans=0.1 2023-06-18 10:23:20,189 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=22.5 2023-06-18 10:23:48,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=185610.0, ans=0.125 2023-06-18 10:23:49,220 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.303e+02 3.316e+02 4.407e+02 6.179e+02 1.311e+03, threshold=8.814e+02, percent-clipped=2.0 2023-06-18 10:23:58,121 INFO [train.py:996] (2/4) Epoch 2, batch 450, loss[loss=0.3479, simple_loss=0.4276, pruned_loss=0.1341, over 21535.00 frames. ], tot_loss[loss=0.3203, simple_loss=0.3714, pruned_loss=0.1346, over 3822890.59 frames. ], batch size: 508, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 10:24:06,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=185670.0, ans=0.025 2023-06-18 10:24:15,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=185670.0, ans=0.125 2023-06-18 10:24:21,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=185730.0, ans=0.125 2023-06-18 10:24:43,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=185790.0, ans=0.125 2023-06-18 10:25:03,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=185850.0, ans=0.2 2023-06-18 10:25:08,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=185850.0, ans=0.0 2023-06-18 10:25:33,376 INFO [train.py:996] (2/4) Epoch 2, batch 500, loss[loss=0.3296, simple_loss=0.408, pruned_loss=0.1256, over 21756.00 frames. ], tot_loss[loss=0.321, simple_loss=0.3742, pruned_loss=0.1339, over 3920358.05 frames. ], batch size: 298, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 10:25:43,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=185970.0, ans=0.0 2023-06-18 10:26:17,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=186090.0, ans=0.125 2023-06-18 10:26:55,382 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 3.837e+02 4.927e+02 6.704e+02 1.422e+03, threshold=9.853e+02, percent-clipped=11.0 2023-06-18 10:27:04,395 INFO [train.py:996] (2/4) Epoch 2, batch 550, loss[loss=0.3499, simple_loss=0.3862, pruned_loss=0.1568, over 21881.00 frames. ], tot_loss[loss=0.321, simple_loss=0.3754, pruned_loss=0.1334, over 4002250.79 frames. ], batch size: 351, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 10:27:55,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=186390.0, ans=0.125 2023-06-18 10:28:18,578 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-06-18 10:28:43,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=186510.0, ans=0.125 2023-06-18 10:28:46,194 INFO [train.py:996] (2/4) Epoch 2, batch 600, loss[loss=0.4186, simple_loss=0.4862, pruned_loss=0.1755, over 21523.00 frames. ], tot_loss[loss=0.3241, simple_loss=0.379, pruned_loss=0.1346, over 4064836.92 frames. ], batch size: 471, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 10:29:14,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=186630.0, ans=0.125 2023-06-18 10:29:55,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=186750.0, ans=0.2 2023-06-18 10:30:07,493 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.448e+02 3.420e+02 4.275e+02 5.622e+02 1.549e+03, threshold=8.550e+02, percent-clipped=4.0 2023-06-18 10:30:21,318 INFO [train.py:996] (2/4) Epoch 2, batch 650, loss[loss=0.3014, simple_loss=0.3421, pruned_loss=0.1303, over 21223.00 frames. ], tot_loss[loss=0.3256, simple_loss=0.3808, pruned_loss=0.1352, over 4119671.47 frames. ], batch size: 143, lr: 1.99e-02, grad_scale: 32.0 2023-06-18 10:30:42,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=186930.0, ans=0.125 2023-06-18 10:31:32,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=187050.0, ans=0.0 2023-06-18 10:31:41,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=15.0 2023-06-18 10:31:56,864 INFO [train.py:996] (2/4) Epoch 2, batch 700, loss[loss=0.3709, simple_loss=0.4349, pruned_loss=0.1534, over 21689.00 frames. ], tot_loss[loss=0.3258, simple_loss=0.3813, pruned_loss=0.1351, over 4152293.03 frames. ], batch size: 389, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:32:01,531 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:32:26,903 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=16.70 vs. limit=15.0 2023-06-18 10:33:02,517 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.91 vs. limit=22.5 2023-06-18 10:33:18,461 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 3.924e+02 4.620e+02 5.995e+02 1.020e+03, threshold=9.239e+02, percent-clipped=3.0 2023-06-18 10:33:32,515 INFO [train.py:996] (2/4) Epoch 2, batch 750, loss[loss=0.3522, simple_loss=0.457, pruned_loss=0.1237, over 20786.00 frames. ], tot_loss[loss=0.3289, simple_loss=0.3825, pruned_loss=0.1376, over 4184551.81 frames. ], batch size: 607, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:34:02,863 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-18 10:34:46,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=187710.0, ans=0.0 2023-06-18 10:34:51,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=187710.0, ans=0.125 2023-06-18 10:35:07,275 INFO [train.py:996] (2/4) Epoch 2, batch 800, loss[loss=0.2958, simple_loss=0.3457, pruned_loss=0.123, over 21592.00 frames. ], tot_loss[loss=0.3278, simple_loss=0.3804, pruned_loss=0.1376, over 4201336.53 frames. ], batch size: 263, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:35:39,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=187830.0, ans=0.2 2023-06-18 10:35:46,441 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=15.0 2023-06-18 10:36:06,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=187950.0, ans=0.2 2023-06-18 10:36:19,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=187950.0, ans=0.125 2023-06-18 10:36:28,272 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.369e+02 3.568e+02 4.374e+02 5.699e+02 1.207e+03, threshold=8.749e+02, percent-clipped=3.0 2023-06-18 10:36:30,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=188010.0, ans=0.0 2023-06-18 10:36:39,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=188010.0, ans=0.2 2023-06-18 10:36:41,764 INFO [train.py:996] (2/4) Epoch 2, batch 850, loss[loss=0.3153, simple_loss=0.3504, pruned_loss=0.14, over 21572.00 frames. ], tot_loss[loss=0.3251, simple_loss=0.3768, pruned_loss=0.1368, over 4221976.24 frames. ], batch size: 441, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:37:32,358 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-18 10:38:17,645 INFO [train.py:996] (2/4) Epoch 2, batch 900, loss[loss=0.3216, simple_loss=0.3699, pruned_loss=0.1366, over 21730.00 frames. ], tot_loss[loss=0.3228, simple_loss=0.3733, pruned_loss=0.1361, over 4242110.58 frames. ], batch size: 389, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:38:18,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=188370.0, ans=0.125 2023-06-18 10:38:56,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=188490.0, ans=0.2 2023-06-18 10:39:30,489 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=15.0 2023-06-18 10:39:33,195 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=12.0 2023-06-18 10:39:45,695 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.391e+02 3.152e+02 3.796e+02 5.190e+02 9.493e+02, threshold=7.592e+02, percent-clipped=1.0 2023-06-18 10:39:49,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=188610.0, ans=0.2 2023-06-18 10:39:55,107 INFO [train.py:996] (2/4) Epoch 2, batch 950, loss[loss=0.2331, simple_loss=0.3033, pruned_loss=0.08143, over 21219.00 frames. ], tot_loss[loss=0.3185, simple_loss=0.3689, pruned_loss=0.134, over 4253398.64 frames. ], batch size: 159, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:40:06,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=188670.0, ans=0.1 2023-06-18 10:40:20,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=188730.0, ans=0.0 2023-06-18 10:40:22,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=188730.0, ans=0.07 2023-06-18 10:41:21,315 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-18 10:41:30,581 INFO [train.py:996] (2/4) Epoch 2, batch 1000, loss[loss=0.3596, simple_loss=0.4025, pruned_loss=0.1584, over 21787.00 frames. ], tot_loss[loss=0.3183, simple_loss=0.3692, pruned_loss=0.1337, over 4269660.46 frames. ], batch size: 124, lr: 1.98e-02, grad_scale: 32.0 2023-06-18 10:42:30,532 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-18 10:43:00,423 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 3.236e+02 4.022e+02 4.696e+02 7.726e+02, threshold=8.043e+02, percent-clipped=1.0 2023-06-18 10:43:09,841 INFO [train.py:996] (2/4) Epoch 2, batch 1050, loss[loss=0.3222, simple_loss=0.3806, pruned_loss=0.1319, over 21545.00 frames. ], tot_loss[loss=0.3209, simple_loss=0.3712, pruned_loss=0.1354, over 4274208.24 frames. ], batch size: 471, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:43:22,987 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.99 vs. limit=15.0 2023-06-18 10:43:39,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=189330.0, ans=0.0 2023-06-18 10:43:48,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=189330.0, ans=0.125 2023-06-18 10:44:27,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=189450.0, ans=0.0 2023-06-18 10:44:29,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=189510.0, ans=0.125 2023-06-18 10:44:32,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=189510.0, ans=0.125 2023-06-18 10:44:35,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=189510.0, ans=0.125 2023-06-18 10:44:45,551 INFO [train.py:996] (2/4) Epoch 2, batch 1100, loss[loss=0.2716, simple_loss=0.3079, pruned_loss=0.1177, over 21213.00 frames. ], tot_loss[loss=0.32, simple_loss=0.3707, pruned_loss=0.1347, over 4274894.86 frames. ], batch size: 548, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:45:26,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=189630.0, ans=0.0 2023-06-18 10:45:59,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=189750.0, ans=0.07 2023-06-18 10:46:13,015 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.368e+02 4.116e+02 5.178e+02 8.115e+02 1.294e+03, threshold=1.036e+03, percent-clipped=24.0 2023-06-18 10:46:27,153 INFO [train.py:996] (2/4) Epoch 2, batch 1150, loss[loss=0.2502, simple_loss=0.2806, pruned_loss=0.1099, over 16315.00 frames. ], tot_loss[loss=0.3191, simple_loss=0.3709, pruned_loss=0.1337, over 4278029.23 frames. ], batch size: 60, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:46:32,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=189870.0, ans=0.0 2023-06-18 10:46:41,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=189870.0, ans=0.2 2023-06-18 10:46:44,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=189870.0, ans=0.125 2023-06-18 10:47:29,990 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=5.302e-03 2023-06-18 10:47:50,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=190110.0, ans=0.125 2023-06-18 10:47:57,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=190110.0, ans=0.04949747468305833 2023-06-18 10:48:04,118 INFO [train.py:996] (2/4) Epoch 2, batch 1200, loss[loss=0.3407, simple_loss=0.3866, pruned_loss=0.1474, over 21530.00 frames. ], tot_loss[loss=0.3169, simple_loss=0.3694, pruned_loss=0.1322, over 4278027.83 frames. ], batch size: 230, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:48:32,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=190230.0, ans=0.125 2023-06-18 10:48:37,365 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:48:49,142 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-18 10:48:52,883 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 10:49:06,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=190350.0, ans=0.125 2023-06-18 10:49:32,221 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.620e+02 3.616e+02 4.499e+02 6.126e+02 1.054e+03, threshold=8.999e+02, percent-clipped=1.0 2023-06-18 10:49:35,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=190410.0, ans=0.0 2023-06-18 10:49:41,279 INFO [train.py:996] (2/4) Epoch 2, batch 1250, loss[loss=0.3364, simple_loss=0.3826, pruned_loss=0.1451, over 21958.00 frames. ], tot_loss[loss=0.3204, simple_loss=0.3724, pruned_loss=0.1342, over 4284124.00 frames. ], batch size: 316, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:50:08,057 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=17.07 vs. limit=15.0 2023-06-18 10:51:01,210 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-06-18 10:51:07,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=190710.0, ans=0.025 2023-06-18 10:51:22,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=190770.0, ans=0.0 2023-06-18 10:51:23,599 INFO [train.py:996] (2/4) Epoch 2, batch 1300, loss[loss=0.2729, simple_loss=0.3462, pruned_loss=0.09979, over 21634.00 frames. ], tot_loss[loss=0.3223, simple_loss=0.3743, pruned_loss=0.1352, over 4285743.64 frames. ], batch size: 230, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:51:53,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=190830.0, ans=0.125 2023-06-18 10:52:10,966 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.94 vs. limit=10.0 2023-06-18 10:52:45,327 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.347e+02 3.558e+02 4.499e+02 5.832e+02 1.027e+03, threshold=8.998e+02, percent-clipped=2.0 2023-06-18 10:53:00,058 INFO [train.py:996] (2/4) Epoch 2, batch 1350, loss[loss=0.339, simple_loss=0.3869, pruned_loss=0.1456, over 21821.00 frames. ], tot_loss[loss=0.3241, simple_loss=0.3756, pruned_loss=0.1363, over 4288453.89 frames. ], batch size: 107, lr: 1.97e-02, grad_scale: 32.0 2023-06-18 10:53:12,052 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-18 10:54:02,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=191250.0, ans=0.125 2023-06-18 10:54:05,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=191250.0, ans=0.125 2023-06-18 10:54:07,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=191250.0, ans=0.04949747468305833 2023-06-18 10:54:41,382 INFO [train.py:996] (2/4) Epoch 2, batch 1400, loss[loss=0.3074, simple_loss=0.3434, pruned_loss=0.1357, over 21733.00 frames. ], tot_loss[loss=0.3231, simple_loss=0.3742, pruned_loss=0.136, over 4283104.02 frames. ], batch size: 316, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 10:54:41,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=191370.0, ans=0.125 2023-06-18 10:55:22,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=191490.0, ans=0.125 2023-06-18 10:55:28,983 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.14 vs. limit=15.0 2023-06-18 10:55:38,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=191550.0, ans=0.5 2023-06-18 10:55:48,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=191550.0, ans=0.125 2023-06-18 10:56:04,087 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.318e+02 3.616e+02 4.167e+02 4.895e+02 9.301e+02, threshold=8.333e+02, percent-clipped=3.0 2023-06-18 10:56:18,045 INFO [train.py:996] (2/4) Epoch 2, batch 1450, loss[loss=0.3556, simple_loss=0.3931, pruned_loss=0.159, over 21292.00 frames. ], tot_loss[loss=0.3255, simple_loss=0.3762, pruned_loss=0.1374, over 4285598.30 frames. ], batch size: 176, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 10:57:02,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=191790.0, ans=0.1 2023-06-18 10:57:10,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=191850.0, ans=0.05 2023-06-18 10:57:54,136 INFO [train.py:996] (2/4) Epoch 2, batch 1500, loss[loss=0.3387, simple_loss=0.3783, pruned_loss=0.1496, over 21890.00 frames. ], tot_loss[loss=0.3291, simple_loss=0.3783, pruned_loss=0.14, over 4291539.81 frames. ], batch size: 371, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 10:58:49,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=192150.0, ans=0.125 2023-06-18 10:59:22,600 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.381e+02 3.361e+02 4.007e+02 4.888e+02 8.078e+02, threshold=8.013e+02, percent-clipped=0.0 2023-06-18 10:59:24,014 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.87 vs. limit=15.0 2023-06-18 10:59:32,407 INFO [train.py:996] (2/4) Epoch 2, batch 1550, loss[loss=0.2636, simple_loss=0.3171, pruned_loss=0.1051, over 21784.00 frames. ], tot_loss[loss=0.3246, simple_loss=0.3744, pruned_loss=0.1373, over 4285034.27 frames. ], batch size: 124, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 10:59:55,837 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.87 vs. limit=10.0 2023-06-18 11:01:16,373 INFO [train.py:996] (2/4) Epoch 2, batch 1600, loss[loss=0.2673, simple_loss=0.3219, pruned_loss=0.1063, over 21810.00 frames. ], tot_loss[loss=0.3215, simple_loss=0.3718, pruned_loss=0.1357, over 4288881.52 frames. ], batch size: 282, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 11:01:42,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=192630.0, ans=0.07 2023-06-18 11:02:34,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=192810.0, ans=0.125 2023-06-18 11:02:44,873 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.226e+02 3.757e+02 4.591e+02 6.473e+02 1.240e+03, threshold=9.183e+02, percent-clipped=13.0 2023-06-18 11:02:54,039 INFO [train.py:996] (2/4) Epoch 2, batch 1650, loss[loss=0.3169, simple_loss=0.3658, pruned_loss=0.134, over 21442.00 frames. ], tot_loss[loss=0.3189, simple_loss=0.3705, pruned_loss=0.1336, over 4288369.81 frames. ], batch size: 194, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 11:03:14,837 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:04:25,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=193110.0, ans=0.2 2023-06-18 11:04:31,332 INFO [train.py:996] (2/4) Epoch 2, batch 1700, loss[loss=0.4385, simple_loss=0.4659, pruned_loss=0.2055, over 21418.00 frames. ], tot_loss[loss=0.3246, simple_loss=0.3759, pruned_loss=0.1366, over 4286853.38 frames. ], batch size: 507, lr: 1.96e-02, grad_scale: 32.0 2023-06-18 11:04:36,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=193170.0, ans=0.1 2023-06-18 11:04:41,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=193170.0, ans=0.0 2023-06-18 11:05:56,007 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.866e+02 3.867e+02 4.593e+02 5.670e+02 8.844e+02, threshold=9.185e+02, percent-clipped=0.0 2023-06-18 11:05:56,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=193410.0, ans=0.125 2023-06-18 11:06:05,808 INFO [train.py:996] (2/4) Epoch 2, batch 1750, loss[loss=0.2638, simple_loss=0.345, pruned_loss=0.09133, over 21815.00 frames. ], tot_loss[loss=0.323, simple_loss=0.3769, pruned_loss=0.1346, over 4286586.24 frames. ], batch size: 316, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 11:06:36,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=193530.0, ans=0.125 2023-06-18 11:07:09,229 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.43 vs. limit=15.0 2023-06-18 11:07:37,056 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.55 vs. limit=6.0 2023-06-18 11:07:39,030 INFO [train.py:996] (2/4) Epoch 2, batch 1800, loss[loss=0.2604, simple_loss=0.333, pruned_loss=0.09389, over 21641.00 frames. ], tot_loss[loss=0.3192, simple_loss=0.3739, pruned_loss=0.1322, over 4280393.58 frames. ], batch size: 263, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 11:08:35,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=193890.0, ans=0.125 2023-06-18 11:08:50,929 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:09:05,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=194010.0, ans=0.0 2023-06-18 11:09:07,452 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=21.32 vs. limit=22.5 2023-06-18 11:09:07,966 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 3.185e+02 3.782e+02 4.585e+02 7.556e+02, threshold=7.564e+02, percent-clipped=0.0 2023-06-18 11:09:17,326 INFO [train.py:996] (2/4) Epoch 2, batch 1850, loss[loss=0.2889, simple_loss=0.3451, pruned_loss=0.1164, over 21232.00 frames. ], tot_loss[loss=0.3126, simple_loss=0.3705, pruned_loss=0.1274, over 4278535.50 frames. ], batch size: 143, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 11:09:47,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=194130.0, ans=0.05 2023-06-18 11:10:07,811 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-06-18 11:10:18,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=194250.0, ans=0.0 2023-06-18 11:10:38,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=194310.0, ans=0.125 2023-06-18 11:10:48,410 INFO [train.py:996] (2/4) Epoch 2, batch 1900, loss[loss=0.3142, simple_loss=0.3487, pruned_loss=0.1399, over 21875.00 frames. ], tot_loss[loss=0.3141, simple_loss=0.3714, pruned_loss=0.1284, over 4280177.69 frames. ], batch size: 118, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 11:10:56,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=194370.0, ans=0.0 2023-06-18 11:11:48,917 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=15.0 2023-06-18 11:12:15,718 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.308e+02 3.697e+02 4.739e+02 6.641e+02 1.232e+03, threshold=9.479e+02, percent-clipped=18.0 2023-06-18 11:12:23,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=194670.0, ans=0.125 2023-06-18 11:12:24,860 INFO [train.py:996] (2/4) Epoch 2, batch 1950, loss[loss=0.2506, simple_loss=0.3242, pruned_loss=0.08847, over 21640.00 frames. ], tot_loss[loss=0.3111, simple_loss=0.366, pruned_loss=0.1281, over 4274335.08 frames. ], batch size: 263, lr: 1.95e-02, grad_scale: 32.0 2023-06-18 11:12:43,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=194670.0, ans=0.0 2023-06-18 11:12:49,561 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.14 vs. limit=12.0 2023-06-18 11:12:53,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=194730.0, ans=0.125 2023-06-18 11:13:15,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=194790.0, ans=0.025 2023-06-18 11:13:20,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=194790.0, ans=0.2 2023-06-18 11:13:33,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=194850.0, ans=0.125 2023-06-18 11:13:44,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=194850.0, ans=0.0 2023-06-18 11:14:14,861 INFO [train.py:996] (2/4) Epoch 2, batch 2000, loss[loss=0.443, simple_loss=0.4959, pruned_loss=0.1951, over 21508.00 frames. ], tot_loss[loss=0.3065, simple_loss=0.3604, pruned_loss=0.1263, over 4270503.29 frames. ], batch size: 471, lr: 1.95e-02, grad_scale: 64.0 2023-06-18 11:14:23,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=194970.0, ans=0.0 2023-06-18 11:14:31,551 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-18 11:14:39,158 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-18 11:14:56,168 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.84 vs. limit=15.0 2023-06-18 11:15:00,587 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2023-06-18 11:15:02,104 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.60 vs. limit=10.0 2023-06-18 11:15:22,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=195150.0, ans=10.0 2023-06-18 11:15:31,713 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 3.372e+02 4.347e+02 5.379e+02 1.010e+03, threshold=8.694e+02, percent-clipped=3.0 2023-06-18 11:15:36,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=195210.0, ans=0.125 2023-06-18 11:15:38,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=195210.0, ans=0.125 2023-06-18 11:15:45,823 INFO [train.py:996] (2/4) Epoch 2, batch 2050, loss[loss=0.3978, simple_loss=0.4283, pruned_loss=0.1837, over 21614.00 frames. ], tot_loss[loss=0.3065, simple_loss=0.3613, pruned_loss=0.1258, over 4262992.86 frames. ], batch size: 471, lr: 1.95e-02, grad_scale: 64.0 2023-06-18 11:16:08,949 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.84 vs. limit=22.5 2023-06-18 11:16:26,026 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.67 vs. limit=15.0 2023-06-18 11:16:32,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=195390.0, ans=0.1 2023-06-18 11:16:36,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=195390.0, ans=0.07 2023-06-18 11:16:51,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=195450.0, ans=0.2 2023-06-18 11:16:55,088 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.31 vs. limit=12.0 2023-06-18 11:17:22,776 INFO [train.py:996] (2/4) Epoch 2, batch 2100, loss[loss=0.302, simple_loss=0.39, pruned_loss=0.107, over 21567.00 frames. ], tot_loss[loss=0.3112, simple_loss=0.3663, pruned_loss=0.128, over 4266509.90 frames. ], batch size: 441, lr: 1.94e-02, grad_scale: 64.0 2023-06-18 11:17:26,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=195570.0, ans=0.1 2023-06-18 11:17:33,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=195570.0, ans=0.05 2023-06-18 11:18:21,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=195750.0, ans=0.125 2023-06-18 11:18:27,676 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-18 11:18:52,351 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.529e+02 3.854e+02 4.642e+02 6.317e+02 1.235e+03, threshold=9.284e+02, percent-clipped=5.0 2023-06-18 11:18:59,865 INFO [train.py:996] (2/4) Epoch 2, batch 2150, loss[loss=0.2835, simple_loss=0.3358, pruned_loss=0.1156, over 21811.00 frames. ], tot_loss[loss=0.3168, simple_loss=0.37, pruned_loss=0.1318, over 4261827.54 frames. ], batch size: 317, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 11:19:00,357 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:19:05,662 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=15.0 2023-06-18 11:20:12,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=196110.0, ans=0.125 2023-06-18 11:20:13,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=196110.0, ans=0.0 2023-06-18 11:20:32,188 INFO [train.py:996] (2/4) Epoch 2, batch 2200, loss[loss=0.3469, simple_loss=0.4102, pruned_loss=0.1418, over 21710.00 frames. ], tot_loss[loss=0.3186, simple_loss=0.3723, pruned_loss=0.1325, over 4266546.67 frames. ], batch size: 414, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 11:20:43,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=196170.0, ans=0.2 2023-06-18 11:21:00,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.68 vs. limit=12.0 2023-06-18 11:21:16,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=196290.0, ans=0.125 2023-06-18 11:21:51,290 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.243e+02 3.406e+02 4.174e+02 5.326e+02 1.037e+03, threshold=8.349e+02, percent-clipped=3.0 2023-06-18 11:21:59,047 INFO [train.py:996] (2/4) Epoch 2, batch 2250, loss[loss=0.2645, simple_loss=0.3412, pruned_loss=0.09388, over 21728.00 frames. ], tot_loss[loss=0.3131, simple_loss=0.3688, pruned_loss=0.1287, over 4268574.15 frames. ], batch size: 332, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 11:23:24,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=196710.0, ans=0.2 2023-06-18 11:23:34,868 INFO [train.py:996] (2/4) Epoch 2, batch 2300, loss[loss=0.3368, simple_loss=0.3592, pruned_loss=0.1572, over 21864.00 frames. ], tot_loss[loss=0.3101, simple_loss=0.3646, pruned_loss=0.1278, over 4269262.62 frames. ], batch size: 373, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 11:24:00,790 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.78 vs. limit=15.0 2023-06-18 11:24:58,612 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.168e+02 3.491e+02 4.298e+02 5.253e+02 1.181e+03, threshold=8.597e+02, percent-clipped=4.0 2023-06-18 11:25:02,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=197010.0, ans=0.1 2023-06-18 11:25:06,159 INFO [train.py:996] (2/4) Epoch 2, batch 2350, loss[loss=0.2929, simple_loss=0.3349, pruned_loss=0.1254, over 21506.00 frames. ], tot_loss[loss=0.3086, simple_loss=0.3598, pruned_loss=0.1287, over 4272265.35 frames. ], batch size: 391, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 11:25:10,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=197070.0, ans=0.0 2023-06-18 11:25:32,739 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.71 vs. limit=10.0 2023-06-18 11:26:03,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=197250.0, ans=0.1 2023-06-18 11:26:17,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=197250.0, ans=0.125 2023-06-18 11:26:37,456 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.08 vs. limit=6.0 2023-06-18 11:26:43,928 INFO [train.py:996] (2/4) Epoch 2, batch 2400, loss[loss=0.3038, simple_loss=0.336, pruned_loss=0.1358, over 21245.00 frames. ], tot_loss[loss=0.3137, simple_loss=0.363, pruned_loss=0.1322, over 4276100.79 frames. ], batch size: 548, lr: 1.94e-02, grad_scale: 32.0 2023-06-18 11:26:58,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=197370.0, ans=0.0 2023-06-18 11:27:08,120 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.34 vs. limit=15.0 2023-06-18 11:27:37,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=197490.0, ans=0.95 2023-06-18 11:28:18,676 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.754e+02 3.750e+02 4.331e+02 6.076e+02 1.202e+03, threshold=8.663e+02, percent-clipped=8.0 2023-06-18 11:28:31,375 INFO [train.py:996] (2/4) Epoch 2, batch 2450, loss[loss=0.3252, simple_loss=0.3902, pruned_loss=0.1301, over 21277.00 frames. ], tot_loss[loss=0.3201, simple_loss=0.3704, pruned_loss=0.1349, over 4269793.43 frames. ], batch size: 143, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:28:36,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=197670.0, ans=0.1 2023-06-18 11:28:37,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=197670.0, ans=10.0 2023-06-18 11:28:50,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=197730.0, ans=0.05 2023-06-18 11:29:36,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=197850.0, ans=0.125 2023-06-18 11:29:46,225 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.81 vs. limit=15.0 2023-06-18 11:29:51,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=197910.0, ans=0.125 2023-06-18 11:30:04,084 INFO [train.py:996] (2/4) Epoch 2, batch 2500, loss[loss=0.287, simple_loss=0.3726, pruned_loss=0.1007, over 21404.00 frames. ], tot_loss[loss=0.3203, simple_loss=0.3697, pruned_loss=0.1355, over 4272403.46 frames. ], batch size: 194, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:30:33,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=198030.0, ans=0.1 2023-06-18 11:30:43,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=198090.0, ans=0.125 2023-06-18 11:31:16,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=198150.0, ans=0.2 2023-06-18 11:31:34,336 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.350e+02 3.317e+02 4.381e+02 5.204e+02 7.754e+02, threshold=8.763e+02, percent-clipped=1.0 2023-06-18 11:31:46,956 INFO [train.py:996] (2/4) Epoch 2, batch 2550, loss[loss=0.2953, simple_loss=0.342, pruned_loss=0.1243, over 21415.00 frames. ], tot_loss[loss=0.3185, simple_loss=0.3686, pruned_loss=0.1342, over 4267518.35 frames. ], batch size: 194, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:31:53,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=198270.0, ans=0.0 2023-06-18 11:31:58,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=198270.0, ans=0.0 2023-06-18 11:32:12,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=198330.0, ans=0.125 2023-06-18 11:32:20,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=198390.0, ans=0.125 2023-06-18 11:32:51,434 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=22.5 2023-06-18 11:33:09,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=198510.0, ans=0.125 2023-06-18 11:33:18,198 INFO [train.py:996] (2/4) Epoch 2, batch 2600, loss[loss=0.3181, simple_loss=0.3586, pruned_loss=0.1388, over 21571.00 frames. ], tot_loss[loss=0.3208, simple_loss=0.3692, pruned_loss=0.1362, over 4263713.18 frames. ], batch size: 230, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:33:18,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=198570.0, ans=0.05 2023-06-18 11:34:36,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=198810.0, ans=15.0 2023-06-18 11:34:47,551 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.517e+02 3.436e+02 4.244e+02 5.240e+02 1.197e+03, threshold=8.488e+02, percent-clipped=2.0 2023-06-18 11:34:55,308 INFO [train.py:996] (2/4) Epoch 2, batch 2650, loss[loss=0.2925, simple_loss=0.3646, pruned_loss=0.1102, over 21610.00 frames. ], tot_loss[loss=0.3234, simple_loss=0.3713, pruned_loss=0.1377, over 4273646.53 frames. ], batch size: 230, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:35:21,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=198930.0, ans=0.125 2023-06-18 11:35:37,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=198990.0, ans=0.125 2023-06-18 11:36:07,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=199050.0, ans=0.035 2023-06-18 11:36:25,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=199110.0, ans=0.0 2023-06-18 11:36:26,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=199110.0, ans=0.125 2023-06-18 11:36:39,088 INFO [train.py:996] (2/4) Epoch 2, batch 2700, loss[loss=0.3303, simple_loss=0.4347, pruned_loss=0.113, over 19765.00 frames. ], tot_loss[loss=0.3205, simple_loss=0.3694, pruned_loss=0.1358, over 4264411.36 frames. ], batch size: 703, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:36:39,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=199170.0, ans=0.2 2023-06-18 11:36:42,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=199170.0, ans=0.1 2023-06-18 11:36:48,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=199170.0, ans=0.125 2023-06-18 11:37:00,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=199230.0, ans=0.125 2023-06-18 11:37:42,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=199350.0, ans=10.0 2023-06-18 11:38:02,903 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.843e+02 4.256e+02 5.020e+02 6.245e+02 1.096e+03, threshold=1.004e+03, percent-clipped=9.0 2023-06-18 11:38:12,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=199410.0, ans=0.2 2023-06-18 11:38:14,794 INFO [train.py:996] (2/4) Epoch 2, batch 2750, loss[loss=0.349, simple_loss=0.3907, pruned_loss=0.1536, over 21882.00 frames. ], tot_loss[loss=0.3174, simple_loss=0.3663, pruned_loss=0.1342, over 4271962.56 frames. ], batch size: 107, lr: 1.93e-02, grad_scale: 32.0 2023-06-18 11:38:26,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=199470.0, ans=0.0 2023-06-18 11:39:12,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=199650.0, ans=0.125 2023-06-18 11:39:20,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=199650.0, ans=0.07 2023-06-18 11:39:55,073 INFO [train.py:996] (2/4) Epoch 2, batch 2800, loss[loss=0.3127, simple_loss=0.3419, pruned_loss=0.1417, over 21729.00 frames. ], tot_loss[loss=0.3209, simple_loss=0.3704, pruned_loss=0.1357, over 4272729.39 frames. ], batch size: 124, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:40:01,015 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=15.0 2023-06-18 11:40:09,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=199830.0, ans=0.0 2023-06-18 11:40:47,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=199890.0, ans=0.0 2023-06-18 11:40:56,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=199950.0, ans=0.1 2023-06-18 11:41:03,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=199950.0, ans=10.0 2023-06-18 11:41:25,823 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.594e+02 3.620e+02 4.325e+02 5.387e+02 9.118e+02, threshold=8.651e+02, percent-clipped=0.0 2023-06-18 11:41:33,777 INFO [train.py:996] (2/4) Epoch 2, batch 2850, loss[loss=0.2514, simple_loss=0.3058, pruned_loss=0.09851, over 21627.00 frames. ], tot_loss[loss=0.3252, simple_loss=0.3753, pruned_loss=0.1375, over 4276782.66 frames. ], batch size: 230, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:42:13,651 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:42:18,566 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-18 11:42:33,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=200250.0, ans=0.0 2023-06-18 11:42:57,673 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=15.0 2023-06-18 11:43:10,722 INFO [train.py:996] (2/4) Epoch 2, batch 2900, loss[loss=0.2895, simple_loss=0.3404, pruned_loss=0.1193, over 21368.00 frames. ], tot_loss[loss=0.3201, simple_loss=0.37, pruned_loss=0.1351, over 4271188.31 frames. ], batch size: 176, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:43:25,410 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-18 11:43:57,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=200490.0, ans=0.125 2023-06-18 11:44:13,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=200550.0, ans=0.025 2023-06-18 11:44:38,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=200610.0, ans=0.125 2023-06-18 11:44:39,472 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.560e+02 4.023e+02 4.917e+02 6.862e+02 1.107e+03, threshold=9.834e+02, percent-clipped=8.0 2023-06-18 11:44:47,211 INFO [train.py:996] (2/4) Epoch 2, batch 2950, loss[loss=0.2911, simple_loss=0.3453, pruned_loss=0.1185, over 21637.00 frames. ], tot_loss[loss=0.3232, simple_loss=0.3735, pruned_loss=0.1364, over 4279148.29 frames. ], batch size: 263, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:44:52,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=200670.0, ans=0.125 2023-06-18 11:45:03,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=200730.0, ans=0.125 2023-06-18 11:45:54,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=200850.0, ans=0.1 2023-06-18 11:46:20,654 INFO [train.py:996] (2/4) Epoch 2, batch 3000, loss[loss=0.2564, simple_loss=0.3057, pruned_loss=0.1035, over 20008.00 frames. ], tot_loss[loss=0.3229, simple_loss=0.3748, pruned_loss=0.1355, over 4275372.47 frames. ], batch size: 702, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:46:20,655 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 11:46:36,283 INFO [train.py:1028] (2/4) Epoch 2, validation: loss=0.2851, simple_loss=0.377, pruned_loss=0.09657, over 1796401.00 frames. 2023-06-18 11:46:36,284 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-18 11:47:09,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=201030.0, ans=0.125 2023-06-18 11:47:25,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=201090.0, ans=0.125 2023-06-18 11:48:07,018 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 3.149e+02 4.031e+02 5.261e+02 8.201e+02, threshold=8.061e+02, percent-clipped=0.0 2023-06-18 11:48:15,254 INFO [train.py:996] (2/4) Epoch 2, batch 3050, loss[loss=0.3744, simple_loss=0.4131, pruned_loss=0.1679, over 21765.00 frames. ], tot_loss[loss=0.322, simple_loss=0.3759, pruned_loss=0.1341, over 4273923.44 frames. ], batch size: 441, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:48:25,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=201270.0, ans=0.125 2023-06-18 11:48:37,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=201270.0, ans=0.125 2023-06-18 11:48:40,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=201330.0, ans=0.025 2023-06-18 11:48:55,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=201330.0, ans=0.125 2023-06-18 11:49:13,900 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:49:21,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=201450.0, ans=0.125 2023-06-18 11:49:42,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=201510.0, ans=0.0 2023-06-18 11:50:02,729 INFO [train.py:996] (2/4) Epoch 2, batch 3100, loss[loss=0.411, simple_loss=0.4533, pruned_loss=0.1843, over 21555.00 frames. ], tot_loss[loss=0.3203, simple_loss=0.3748, pruned_loss=0.1329, over 4273370.23 frames. ], batch size: 508, lr: 1.92e-02, grad_scale: 32.0 2023-06-18 11:50:09,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=201570.0, ans=0.125 2023-06-18 11:51:14,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=201810.0, ans=0.125 2023-06-18 11:51:16,409 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.07 vs. limit=15.0 2023-06-18 11:51:21,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=201810.0, ans=0.0 2023-06-18 11:51:31,174 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 3.312e+02 4.167e+02 4.991e+02 9.720e+02, threshold=8.334e+02, percent-clipped=2.0 2023-06-18 11:51:33,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=201810.0, ans=0.1 2023-06-18 11:51:35,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=201810.0, ans=0.1 2023-06-18 11:51:39,074 INFO [train.py:996] (2/4) Epoch 2, batch 3150, loss[loss=0.3853, simple_loss=0.4226, pruned_loss=0.174, over 21600.00 frames. ], tot_loss[loss=0.3195, simple_loss=0.3739, pruned_loss=0.1325, over 4273796.91 frames. ], batch size: 415, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 11:51:45,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=201870.0, ans=0.1 2023-06-18 11:52:18,306 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=22.5 2023-06-18 11:53:02,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=202110.0, ans=0.0 2023-06-18 11:53:15,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=202110.0, ans=0.0 2023-06-18 11:53:22,384 INFO [train.py:996] (2/4) Epoch 2, batch 3200, loss[loss=0.2717, simple_loss=0.3479, pruned_loss=0.09777, over 21660.00 frames. ], tot_loss[loss=0.3201, simple_loss=0.3755, pruned_loss=0.1324, over 4280140.86 frames. ], batch size: 298, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 11:53:41,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=202230.0, ans=0.0 2023-06-18 11:53:53,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.out_whiten.whitening_limit, batch_count=202230.0, ans=8.0 2023-06-18 11:54:21,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=202350.0, ans=0.0 2023-06-18 11:54:34,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=202350.0, ans=0.1 2023-06-18 11:54:51,733 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.398e+02 3.651e+02 4.650e+02 5.913e+02 1.032e+03, threshold=9.300e+02, percent-clipped=10.0 2023-06-18 11:55:04,327 INFO [train.py:996] (2/4) Epoch 2, batch 3250, loss[loss=0.3606, simple_loss=0.4316, pruned_loss=0.1449, over 20914.00 frames. ], tot_loss[loss=0.3246, simple_loss=0.3782, pruned_loss=0.1355, over 4283075.74 frames. ], batch size: 607, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 11:55:33,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=202530.0, ans=0.04949747468305833 2023-06-18 11:55:48,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=202590.0, ans=0.1 2023-06-18 11:55:48,717 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=15.0 2023-06-18 11:56:06,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=202650.0, ans=0.2 2023-06-18 11:56:38,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=202710.0, ans=0.1 2023-06-18 11:56:39,521 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=15.0 2023-06-18 11:56:41,108 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.03 vs. limit=22.5 2023-06-18 11:56:43,245 INFO [train.py:996] (2/4) Epoch 2, batch 3300, loss[loss=0.2602, simple_loss=0.3227, pruned_loss=0.09886, over 21332.00 frames. ], tot_loss[loss=0.3249, simple_loss=0.3762, pruned_loss=0.1368, over 4283778.12 frames. ], batch size: 131, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 11:56:43,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=202770.0, ans=0.0 2023-06-18 11:56:51,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=202770.0, ans=0.125 2023-06-18 11:56:58,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=202830.0, ans=0.125 2023-06-18 11:57:09,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=202830.0, ans=0.2 2023-06-18 11:58:00,058 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:58:06,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=203010.0, ans=0.0 2023-06-18 11:58:08,739 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.252e+02 3.845e+02 4.643e+02 5.721e+02 1.092e+03, threshold=9.285e+02, percent-clipped=5.0 2023-06-18 11:58:15,431 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 11:58:16,403 INFO [train.py:996] (2/4) Epoch 2, batch 3350, loss[loss=0.3158, simple_loss=0.3546, pruned_loss=0.1385, over 21240.00 frames. ], tot_loss[loss=0.3265, simple_loss=0.3794, pruned_loss=0.1368, over 4276652.53 frames. ], batch size: 608, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 11:59:23,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=203250.0, ans=0.125 2023-06-18 11:59:50,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=203310.0, ans=0.0 2023-06-18 11:59:53,113 INFO [train.py:996] (2/4) Epoch 2, batch 3400, loss[loss=0.2674, simple_loss=0.343, pruned_loss=0.09593, over 16629.00 frames. ], tot_loss[loss=0.327, simple_loss=0.379, pruned_loss=0.1375, over 4278542.00 frames. ], batch size: 60, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 12:00:05,344 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.75 vs. limit=22.5 2023-06-18 12:00:06,868 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.36 vs. limit=15.0 2023-06-18 12:00:13,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=203430.0, ans=0.125 2023-06-18 12:00:29,024 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-18 12:01:06,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=203550.0, ans=0.2 2023-06-18 12:01:09,915 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=12.0 2023-06-18 12:01:18,181 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 3.307e+02 4.139e+02 5.241e+02 1.031e+03, threshold=8.278e+02, percent-clipped=1.0 2023-06-18 12:01:18,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=203610.0, ans=0.0 2023-06-18 12:01:26,575 INFO [train.py:996] (2/4) Epoch 2, batch 3450, loss[loss=0.3897, simple_loss=0.4315, pruned_loss=0.174, over 21698.00 frames. ], tot_loss[loss=0.323, simple_loss=0.3735, pruned_loss=0.1363, over 4282782.17 frames. ], batch size: 332, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 12:02:07,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=203790.0, ans=0.04949747468305833 2023-06-18 12:02:31,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=203790.0, ans=0.95 2023-06-18 12:03:06,562 INFO [train.py:996] (2/4) Epoch 2, batch 3500, loss[loss=0.4344, simple_loss=0.4684, pruned_loss=0.2001, over 21469.00 frames. ], tot_loss[loss=0.3313, simple_loss=0.3818, pruned_loss=0.1404, over 4284988.54 frames. ], batch size: 471, lr: 1.91e-02, grad_scale: 32.0 2023-06-18 12:03:52,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=204090.0, ans=0.0 2023-06-18 12:04:25,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=204150.0, ans=0.2 2023-06-18 12:04:35,404 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.505e+02 3.628e+02 4.522e+02 5.964e+02 1.068e+03, threshold=9.044e+02, percent-clipped=7.0 2023-06-18 12:04:47,806 INFO [train.py:996] (2/4) Epoch 2, batch 3550, loss[loss=0.2965, simple_loss=0.3505, pruned_loss=0.1212, over 21639.00 frames. ], tot_loss[loss=0.3337, simple_loss=0.3846, pruned_loss=0.1413, over 4283969.32 frames. ], batch size: 247, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:05:42,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=204390.0, ans=0.125 2023-06-18 12:05:46,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=204390.0, ans=0.2 2023-06-18 12:05:46,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=204390.0, ans=0.1 2023-06-18 12:05:49,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=204450.0, ans=0.0 2023-06-18 12:06:05,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=204510.0, ans=0.125 2023-06-18 12:06:07,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=204510.0, ans=0.04949747468305833 2023-06-18 12:06:26,630 INFO [train.py:996] (2/4) Epoch 2, batch 3600, loss[loss=0.3526, simple_loss=0.4063, pruned_loss=0.1495, over 21350.00 frames. ], tot_loss[loss=0.3288, simple_loss=0.3774, pruned_loss=0.1401, over 4280529.35 frames. ], batch size: 549, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:07:57,520 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.601e+02 3.619e+02 4.242e+02 5.545e+02 1.042e+03, threshold=8.484e+02, percent-clipped=1.0 2023-06-18 12:08:05,238 INFO [train.py:996] (2/4) Epoch 2, batch 3650, loss[loss=0.2776, simple_loss=0.3413, pruned_loss=0.107, over 21600.00 frames. ], tot_loss[loss=0.3293, simple_loss=0.3783, pruned_loss=0.1402, over 4271759.23 frames. ], batch size: 230, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:08:54,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=204990.0, ans=0.125 2023-06-18 12:09:41,439 INFO [train.py:996] (2/4) Epoch 2, batch 3700, loss[loss=0.3506, simple_loss=0.3941, pruned_loss=0.1535, over 21400.00 frames. ], tot_loss[loss=0.3262, simple_loss=0.3758, pruned_loss=0.1383, over 4281496.52 frames. ], batch size: 549, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:09:51,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=205170.0, ans=0.05 2023-06-18 12:09:57,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=205170.0, ans=0.125 2023-06-18 12:10:06,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=205230.0, ans=0.035 2023-06-18 12:10:06,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=205230.0, ans=0.1 2023-06-18 12:10:29,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=205290.0, ans=0.0 2023-06-18 12:10:49,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=205350.0, ans=0.0 2023-06-18 12:11:09,714 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.222e+02 3.219e+02 3.762e+02 4.567e+02 1.013e+03, threshold=7.524e+02, percent-clipped=2.0 2023-06-18 12:11:22,732 INFO [train.py:996] (2/4) Epoch 2, batch 3750, loss[loss=0.2479, simple_loss=0.3158, pruned_loss=0.08998, over 21785.00 frames. ], tot_loss[loss=0.3232, simple_loss=0.3727, pruned_loss=0.1369, over 4287824.50 frames. ], batch size: 282, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:11:24,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=205470.0, ans=0.125 2023-06-18 12:11:32,852 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.38 vs. limit=22.5 2023-06-18 12:12:23,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=205650.0, ans=0.125 2023-06-18 12:12:59,790 INFO [train.py:996] (2/4) Epoch 2, batch 3800, loss[loss=0.3779, simple_loss=0.4142, pruned_loss=0.1708, over 21684.00 frames. ], tot_loss[loss=0.3189, simple_loss=0.3695, pruned_loss=0.1342, over 4277976.63 frames. ], batch size: 351, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:13:49,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=205890.0, ans=0.1 2023-06-18 12:13:59,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=205950.0, ans=0.0 2023-06-18 12:14:22,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=206010.0, ans=0.1 2023-06-18 12:14:28,923 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.303e+02 3.435e+02 4.330e+02 5.504e+02 8.212e+02, threshold=8.659e+02, percent-clipped=4.0 2023-06-18 12:14:37,062 INFO [train.py:996] (2/4) Epoch 2, batch 3850, loss[loss=0.3998, simple_loss=0.4968, pruned_loss=0.1514, over 19777.00 frames. ], tot_loss[loss=0.3197, simple_loss=0.3693, pruned_loss=0.1351, over 4279216.26 frames. ], batch size: 702, lr: 1.90e-02, grad_scale: 32.0 2023-06-18 12:15:00,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=206130.0, ans=0.125 2023-06-18 12:15:31,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=206250.0, ans=0.125 2023-06-18 12:16:13,536 INFO [train.py:996] (2/4) Epoch 2, batch 3900, loss[loss=0.3153, simple_loss=0.374, pruned_loss=0.1283, over 16744.00 frames. ], tot_loss[loss=0.3187, simple_loss=0.3667, pruned_loss=0.1353, over 4281108.26 frames. ], batch size: 60, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:16:32,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=206430.0, ans=0.125 2023-06-18 12:16:59,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=206490.0, ans=0.125 2023-06-18 12:17:40,197 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 12:17:42,970 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.558e+02 3.735e+02 4.662e+02 6.230e+02 1.205e+03, threshold=9.323e+02, percent-clipped=9.0 2023-06-18 12:17:50,625 INFO [train.py:996] (2/4) Epoch 2, batch 3950, loss[loss=0.299, simple_loss=0.3604, pruned_loss=0.1188, over 19899.00 frames. ], tot_loss[loss=0.3217, simple_loss=0.372, pruned_loss=0.1357, over 4279379.66 frames. ], batch size: 703, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:18:14,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=206730.0, ans=0.2 2023-06-18 12:18:26,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=206730.0, ans=0.0 2023-06-18 12:18:31,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=206790.0, ans=0.125 2023-06-18 12:18:53,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=206850.0, ans=0.0 2023-06-18 12:19:20,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=206910.0, ans=0.2 2023-06-18 12:19:27,502 INFO [train.py:996] (2/4) Epoch 2, batch 4000, loss[loss=0.2694, simple_loss=0.3161, pruned_loss=0.1114, over 21876.00 frames. ], tot_loss[loss=0.3101, simple_loss=0.3606, pruned_loss=0.1298, over 4272498.14 frames. ], batch size: 373, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:19:34,647 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.32 vs. limit=15.0 2023-06-18 12:20:28,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=207150.0, ans=0.1 2023-06-18 12:20:44,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=207210.0, ans=0.125 2023-06-18 12:20:45,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=207210.0, ans=0.1 2023-06-18 12:20:50,974 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.358e+02 3.512e+02 4.093e+02 5.285e+02 8.562e+02, threshold=8.187e+02, percent-clipped=0.0 2023-06-18 12:21:03,165 INFO [train.py:996] (2/4) Epoch 2, batch 4050, loss[loss=0.3896, simple_loss=0.4274, pruned_loss=0.1759, over 21527.00 frames. ], tot_loss[loss=0.309, simple_loss=0.3606, pruned_loss=0.1287, over 4263562.75 frames. ], batch size: 507, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:21:43,428 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-18 12:22:14,728 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.72 vs. limit=6.0 2023-06-18 12:22:15,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=207450.0, ans=0.04949747468305833 2023-06-18 12:22:36,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=207510.0, ans=0.2 2023-06-18 12:22:44,816 INFO [train.py:996] (2/4) Epoch 2, batch 4100, loss[loss=0.3294, simple_loss=0.356, pruned_loss=0.1514, over 21269.00 frames. ], tot_loss[loss=0.3114, simple_loss=0.3627, pruned_loss=0.1301, over 4265794.90 frames. ], batch size: 159, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:23:26,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=207690.0, ans=0.0 2023-06-18 12:23:41,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=207750.0, ans=0.0 2023-06-18 12:23:43,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=207750.0, ans=0.0 2023-06-18 12:24:04,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=207810.0, ans=0.0 2023-06-18 12:24:08,762 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 2.979e+02 3.498e+02 4.033e+02 7.129e+02, threshold=6.997e+02, percent-clipped=0.0 2023-06-18 12:24:20,659 INFO [train.py:996] (2/4) Epoch 2, batch 4150, loss[loss=0.3546, simple_loss=0.392, pruned_loss=0.1586, over 21421.00 frames. ], tot_loss[loss=0.3057, simple_loss=0.3615, pruned_loss=0.125, over 4274909.80 frames. ], batch size: 508, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:24:55,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=207990.0, ans=0.2 2023-06-18 12:25:23,691 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.95 vs. limit=8.0 2023-06-18 12:25:59,009 INFO [train.py:996] (2/4) Epoch 2, batch 4200, loss[loss=0.2855, simple_loss=0.3318, pruned_loss=0.1196, over 21222.00 frames. ], tot_loss[loss=0.303, simple_loss=0.359, pruned_loss=0.1235, over 4273404.31 frames. ], batch size: 159, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:26:03,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=208170.0, ans=0.125 2023-06-18 12:26:14,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=208230.0, ans=0.1 2023-06-18 12:26:54,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=208290.0, ans=0.0 2023-06-18 12:27:22,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=208410.0, ans=0.125 2023-06-18 12:27:26,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=208410.0, ans=0.1 2023-06-18 12:27:31,462 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.591e+02 4.541e+02 5.629e+02 1.049e+03, threshold=9.081e+02, percent-clipped=10.0 2023-06-18 12:27:32,456 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.77 vs. limit=15.0 2023-06-18 12:27:38,048 INFO [train.py:996] (2/4) Epoch 2, batch 4250, loss[loss=0.3856, simple_loss=0.4167, pruned_loss=0.1773, over 21340.00 frames. ], tot_loss[loss=0.313, simple_loss=0.3702, pruned_loss=0.1279, over 4270795.01 frames. ], batch size: 548, lr: 1.89e-02, grad_scale: 32.0 2023-06-18 12:28:26,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=208590.0, ans=15.0 2023-06-18 12:29:12,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=208710.0, ans=0.125 2023-06-18 12:29:25,644 INFO [train.py:996] (2/4) Epoch 2, batch 4300, loss[loss=0.2904, simple_loss=0.3553, pruned_loss=0.1128, over 21442.00 frames. ], tot_loss[loss=0.3195, simple_loss=0.3764, pruned_loss=0.1314, over 4271404.99 frames. ], batch size: 211, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:29:34,288 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-18 12:30:17,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=208890.0, ans=0.2 2023-06-18 12:30:35,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=208950.0, ans=0.125 2023-06-18 12:31:04,002 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.525e+02 3.305e+02 3.923e+02 4.922e+02 1.064e+03, threshold=7.846e+02, percent-clipped=2.0 2023-06-18 12:31:10,115 INFO [train.py:996] (2/4) Epoch 2, batch 4350, loss[loss=0.3214, simple_loss=0.348, pruned_loss=0.1474, over 21362.00 frames. ], tot_loss[loss=0.3165, simple_loss=0.3733, pruned_loss=0.1299, over 4263511.06 frames. ], batch size: 177, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:31:29,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=209130.0, ans=0.125 2023-06-18 12:31:54,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=209190.0, ans=0.125 2023-06-18 12:31:54,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=209190.0, ans=0.125 2023-06-18 12:32:23,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=209250.0, ans=0.1 2023-06-18 12:32:47,836 INFO [train.py:996] (2/4) Epoch 2, batch 4400, loss[loss=0.2872, simple_loss=0.3601, pruned_loss=0.1071, over 21705.00 frames. ], tot_loss[loss=0.3155, simple_loss=0.3702, pruned_loss=0.1303, over 4265889.71 frames. ], batch size: 298, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:32:51,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=209370.0, ans=0.125 2023-06-18 12:32:56,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=209370.0, ans=0.1 2023-06-18 12:33:26,883 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.44 vs. limit=15.0 2023-06-18 12:33:30,219 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.15 vs. limit=8.0 2023-06-18 12:33:54,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=209550.0, ans=0.0 2023-06-18 12:34:18,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=209610.0, ans=0.0 2023-06-18 12:34:19,649 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.360e+02 3.653e+02 4.609e+02 5.842e+02 1.096e+03, threshold=9.217e+02, percent-clipped=5.0 2023-06-18 12:34:21,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=209610.0, ans=0.125 2023-06-18 12:34:21,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=209610.0, ans=0.0 2023-06-18 12:34:30,777 INFO [train.py:996] (2/4) Epoch 2, batch 4450, loss[loss=0.364, simple_loss=0.433, pruned_loss=0.1475, over 21839.00 frames. ], tot_loss[loss=0.3175, simple_loss=0.3741, pruned_loss=0.1304, over 4267387.18 frames. ], batch size: 316, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:34:37,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=209670.0, ans=0.125 2023-06-18 12:34:40,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=209670.0, ans=0.125 2023-06-18 12:34:50,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=209730.0, ans=0.2 2023-06-18 12:34:53,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=209730.0, ans=0.1 2023-06-18 12:35:37,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=209850.0, ans=0.125 2023-06-18 12:35:40,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=209850.0, ans=0.2 2023-06-18 12:35:57,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=209910.0, ans=0.125 2023-06-18 12:36:06,140 INFO [train.py:996] (2/4) Epoch 2, batch 4500, loss[loss=0.3077, simple_loss=0.3603, pruned_loss=0.1275, over 21890.00 frames. ], tot_loss[loss=0.3214, simple_loss=0.377, pruned_loss=0.1329, over 4270645.63 frames. ], batch size: 118, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:36:30,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=210030.0, ans=0.125 2023-06-18 12:36:34,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=210030.0, ans=0.0 2023-06-18 12:36:45,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=210090.0, ans=0.2 2023-06-18 12:37:37,916 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.570e+02 3.555e+02 4.065e+02 4.925e+02 8.814e+02, threshold=8.131e+02, percent-clipped=0.0 2023-06-18 12:37:48,821 INFO [train.py:996] (2/4) Epoch 2, batch 4550, loss[loss=0.3444, simple_loss=0.3976, pruned_loss=0.1457, over 21740.00 frames. ], tot_loss[loss=0.3236, simple_loss=0.3801, pruned_loss=0.1335, over 4278900.95 frames. ], batch size: 298, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:39:25,118 INFO [train.py:996] (2/4) Epoch 2, batch 4600, loss[loss=0.2873, simple_loss=0.3457, pruned_loss=0.1145, over 21511.00 frames. ], tot_loss[loss=0.3275, simple_loss=0.3836, pruned_loss=0.1357, over 4281226.80 frames. ], batch size: 195, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:39:27,858 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=22.5 2023-06-18 12:39:32,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=210570.0, ans=0.09899494936611666 2023-06-18 12:39:40,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=210630.0, ans=0.1 2023-06-18 12:39:44,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=210630.0, ans=0.1 2023-06-18 12:40:04,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=210630.0, ans=0.0 2023-06-18 12:40:12,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=210690.0, ans=0.0 2023-06-18 12:40:27,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=210750.0, ans=0.125 2023-06-18 12:40:29,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=210750.0, ans=0.2 2023-06-18 12:40:35,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=210750.0, ans=0.125 2023-06-18 12:40:55,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=210810.0, ans=0.125 2023-06-18 12:40:56,143 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.810e+02 3.566e+02 4.300e+02 5.335e+02 1.700e+03, threshold=8.600e+02, percent-clipped=8.0 2023-06-18 12:41:02,507 INFO [train.py:996] (2/4) Epoch 2, batch 4650, loss[loss=0.1619, simple_loss=0.2201, pruned_loss=0.0519, over 16181.00 frames. ], tot_loss[loss=0.3231, simple_loss=0.3798, pruned_loss=0.1332, over 4274546.59 frames. ], batch size: 60, lr: 1.88e-02, grad_scale: 32.0 2023-06-18 12:41:05,458 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.09 vs. limit=15.0 2023-06-18 12:41:06,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=210870.0, ans=0.0 2023-06-18 12:41:15,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=210870.0, ans=0.125 2023-06-18 12:41:29,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=210930.0, ans=0.125 2023-06-18 12:41:34,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=210930.0, ans=0.125 2023-06-18 12:41:42,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=210990.0, ans=0.0 2023-06-18 12:41:43,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=210990.0, ans=0.125 2023-06-18 12:41:54,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=211050.0, ans=0.0 2023-06-18 12:42:07,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=211050.0, ans=0.125 2023-06-18 12:42:21,483 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-18 12:42:32,563 INFO [train.py:996] (2/4) Epoch 2, batch 4700, loss[loss=0.2552, simple_loss=0.3093, pruned_loss=0.1005, over 21406.00 frames. ], tot_loss[loss=0.3119, simple_loss=0.3667, pruned_loss=0.1285, over 4274121.22 frames. ], batch size: 131, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:42:48,441 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=22.5 2023-06-18 12:42:51,737 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.72 vs. limit=15.0 2023-06-18 12:43:18,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=211290.0, ans=0.125 2023-06-18 12:44:00,735 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 3.242e+02 4.193e+02 5.636e+02 1.011e+03, threshold=8.385e+02, percent-clipped=2.0 2023-06-18 12:44:06,713 INFO [train.py:996] (2/4) Epoch 2, batch 4750, loss[loss=0.2959, simple_loss=0.3375, pruned_loss=0.1271, over 21655.00 frames. ], tot_loss[loss=0.3097, simple_loss=0.3617, pruned_loss=0.1288, over 4281329.62 frames. ], batch size: 230, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:44:10,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=211470.0, ans=0.1 2023-06-18 12:45:12,545 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=22.5 2023-06-18 12:45:15,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=211650.0, ans=0.1 2023-06-18 12:45:30,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=211710.0, ans=0.125 2023-06-18 12:45:34,295 INFO [train.py:996] (2/4) Epoch 2, batch 4800, loss[loss=0.2831, simple_loss=0.3441, pruned_loss=0.1111, over 21533.00 frames. ], tot_loss[loss=0.3119, simple_loss=0.3639, pruned_loss=0.13, over 4280526.06 frames. ], batch size: 230, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:45:39,333 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 12:46:29,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=211950.0, ans=0.2 2023-06-18 12:47:01,229 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 3.590e+02 4.523e+02 5.544e+02 1.095e+03, threshold=9.046e+02, percent-clipped=1.0 2023-06-18 12:47:07,536 INFO [train.py:996] (2/4) Epoch 2, batch 4850, loss[loss=0.375, simple_loss=0.4137, pruned_loss=0.1682, over 21372.00 frames. ], tot_loss[loss=0.3114, simple_loss=0.3634, pruned_loss=0.1297, over 4276029.20 frames. ], batch size: 507, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:47:35,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=212130.0, ans=0.125 2023-06-18 12:47:37,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=212130.0, ans=0.125 2023-06-18 12:48:29,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=212310.0, ans=0.0 2023-06-18 12:48:35,065 INFO [train.py:996] (2/4) Epoch 2, batch 4900, loss[loss=0.2975, simple_loss=0.3758, pruned_loss=0.1096, over 21582.00 frames. ], tot_loss[loss=0.3152, simple_loss=0.3662, pruned_loss=0.1321, over 4281358.02 frames. ], batch size: 230, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:48:48,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=212370.0, ans=0.125 2023-06-18 12:49:18,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=212490.0, ans=0.1 2023-06-18 12:50:06,443 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 3.496e+02 4.489e+02 5.539e+02 1.137e+03, threshold=8.978e+02, percent-clipped=3.0 2023-06-18 12:50:13,010 INFO [train.py:996] (2/4) Epoch 2, batch 4950, loss[loss=0.3316, simple_loss=0.4038, pruned_loss=0.1297, over 21450.00 frames. ], tot_loss[loss=0.3143, simple_loss=0.3696, pruned_loss=0.1295, over 4280221.74 frames. ], batch size: 507, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:50:18,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=212670.0, ans=0.125 2023-06-18 12:50:36,946 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.48 vs. limit=6.0 2023-06-18 12:50:39,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=212730.0, ans=0.1 2023-06-18 12:50:57,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=212790.0, ans=0.1 2023-06-18 12:51:11,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=212850.0, ans=0.125 2023-06-18 12:51:13,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=212850.0, ans=0.125 2023-06-18 12:51:46,442 INFO [train.py:996] (2/4) Epoch 2, batch 5000, loss[loss=0.2861, simple_loss=0.3501, pruned_loss=0.111, over 21403.00 frames. ], tot_loss[loss=0.3093, simple_loss=0.3678, pruned_loss=0.1254, over 4283376.77 frames. ], batch size: 131, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:52:10,885 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.07 vs. limit=12.0 2023-06-18 12:52:38,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=213150.0, ans=0.0 2023-06-18 12:53:06,330 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.969e+02 3.666e+02 4.897e+02 8.510e+02, threshold=7.332e+02, percent-clipped=0.0 2023-06-18 12:53:12,751 INFO [train.py:996] (2/4) Epoch 2, batch 5050, loss[loss=0.3222, simple_loss=0.374, pruned_loss=0.1352, over 21851.00 frames. ], tot_loss[loss=0.311, simple_loss=0.3669, pruned_loss=0.1276, over 4286951.26 frames. ], batch size: 118, lr: 1.87e-02, grad_scale: 32.0 2023-06-18 12:54:43,421 INFO [train.py:996] (2/4) Epoch 2, batch 5100, loss[loss=0.2651, simple_loss=0.3322, pruned_loss=0.09902, over 21803.00 frames. ], tot_loss[loss=0.3111, simple_loss=0.3649, pruned_loss=0.1287, over 4287120.54 frames. ], batch size: 298, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 12:54:53,966 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.58 vs. limit=22.5 2023-06-18 12:55:32,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=213690.0, ans=0.0 2023-06-18 12:55:53,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=213750.0, ans=0.0 2023-06-18 12:55:57,108 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.03 vs. limit=15.0 2023-06-18 12:56:12,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=213810.0, ans=0.125 2023-06-18 12:56:13,630 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.391e+02 3.389e+02 4.046e+02 5.054e+02 9.083e+02, threshold=8.093e+02, percent-clipped=6.0 2023-06-18 12:56:19,877 INFO [train.py:996] (2/4) Epoch 2, batch 5150, loss[loss=0.3866, simple_loss=0.4204, pruned_loss=0.1764, over 21562.00 frames. ], tot_loss[loss=0.3116, simple_loss=0.3643, pruned_loss=0.1294, over 4284351.68 frames. ], batch size: 471, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 12:56:38,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=213930.0, ans=0.2 2023-06-18 12:57:23,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=214050.0, ans=0.1 2023-06-18 12:57:36,606 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 12:57:55,487 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.88 vs. limit=12.0 2023-06-18 12:57:56,090 INFO [train.py:996] (2/4) Epoch 2, batch 5200, loss[loss=0.3525, simple_loss=0.4326, pruned_loss=0.1362, over 21223.00 frames. ], tot_loss[loss=0.3134, simple_loss=0.3667, pruned_loss=0.13, over 4283473.54 frames. ], batch size: 548, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 12:57:57,473 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-18 12:59:23,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=214410.0, ans=0.2 2023-06-18 12:59:25,384 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.88 vs. limit=15.0 2023-06-18 12:59:25,785 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.469e+02 3.610e+02 4.791e+02 6.505e+02 1.223e+03, threshold=9.582e+02, percent-clipped=11.0 2023-06-18 12:59:32,066 INFO [train.py:996] (2/4) Epoch 2, batch 5250, loss[loss=0.3742, simple_loss=0.4186, pruned_loss=0.165, over 21718.00 frames. ], tot_loss[loss=0.3111, simple_loss=0.3689, pruned_loss=0.1266, over 4281680.39 frames. ], batch size: 441, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 12:59:53,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=214470.0, ans=0.125 2023-06-18 13:00:08,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=214530.0, ans=0.125 2023-06-18 13:00:42,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=214650.0, ans=0.0 2023-06-18 13:01:12,303 INFO [train.py:996] (2/4) Epoch 2, batch 5300, loss[loss=0.3609, simple_loss=0.3919, pruned_loss=0.165, over 21766.00 frames. ], tot_loss[loss=0.3129, simple_loss=0.3688, pruned_loss=0.1285, over 4283637.34 frames. ], batch size: 441, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 13:02:19,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=214950.0, ans=15.0 2023-06-18 13:02:26,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=215010.0, ans=0.1 2023-06-18 13:02:35,407 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.026e+02 3.046e+02 3.546e+02 4.539e+02 8.571e+02, threshold=7.092e+02, percent-clipped=0.0 2023-06-18 13:02:41,316 INFO [train.py:996] (2/4) Epoch 2, batch 5350, loss[loss=0.2901, simple_loss=0.3353, pruned_loss=0.1225, over 21859.00 frames. ], tot_loss[loss=0.3151, simple_loss=0.3682, pruned_loss=0.131, over 4285377.32 frames. ], batch size: 124, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 13:03:16,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=215130.0, ans=0.1 2023-06-18 13:03:19,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=215130.0, ans=0.125 2023-06-18 13:03:48,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=215250.0, ans=0.1 2023-06-18 13:04:04,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=215310.0, ans=0.0 2023-06-18 13:04:16,996 INFO [train.py:996] (2/4) Epoch 2, batch 5400, loss[loss=0.3045, simple_loss=0.3518, pruned_loss=0.1285, over 21480.00 frames. ], tot_loss[loss=0.3191, simple_loss=0.3702, pruned_loss=0.134, over 4285113.25 frames. ], batch size: 131, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 13:04:23,310 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=15.0 2023-06-18 13:04:39,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=215370.0, ans=0.125 2023-06-18 13:05:00,566 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=15.0 2023-06-18 13:05:09,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=215490.0, ans=0.1 2023-06-18 13:05:16,376 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.71 vs. limit=10.0 2023-06-18 13:05:58,049 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.195e+02 3.179e+02 4.117e+02 5.254e+02 8.433e+02, threshold=8.234e+02, percent-clipped=2.0 2023-06-18 13:06:03,211 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 13:06:04,240 INFO [train.py:996] (2/4) Epoch 2, batch 5450, loss[loss=0.414, simple_loss=0.5249, pruned_loss=0.1515, over 19687.00 frames. ], tot_loss[loss=0.3172, simple_loss=0.3718, pruned_loss=0.1313, over 4288355.61 frames. ], batch size: 702, lr: 1.86e-02, grad_scale: 32.0 2023-06-18 13:06:58,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=215850.0, ans=0.125 2023-06-18 13:07:00,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=215850.0, ans=0.1 2023-06-18 13:07:31,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=215910.0, ans=0.5 2023-06-18 13:07:36,810 INFO [train.py:996] (2/4) Epoch 2, batch 5500, loss[loss=0.3266, simple_loss=0.4072, pruned_loss=0.1231, over 21660.00 frames. ], tot_loss[loss=0.3157, simple_loss=0.3766, pruned_loss=0.1274, over 4287666.16 frames. ], batch size: 389, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:08:07,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=216030.0, ans=0.0 2023-06-18 13:09:13,844 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 3.111e+02 3.765e+02 4.593e+02 1.085e+03, threshold=7.530e+02, percent-clipped=3.0 2023-06-18 13:09:20,402 INFO [train.py:996] (2/4) Epoch 2, batch 5550, loss[loss=0.314, simple_loss=0.3836, pruned_loss=0.1222, over 21670.00 frames. ], tot_loss[loss=0.3097, simple_loss=0.3725, pruned_loss=0.1235, over 4284146.54 frames. ], batch size: 298, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:09:21,514 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.78 vs. limit=10.0 2023-06-18 13:09:49,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=216330.0, ans=0.07 2023-06-18 13:10:45,851 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-18 13:11:01,305 INFO [train.py:996] (2/4) Epoch 2, batch 5600, loss[loss=0.3736, simple_loss=0.4472, pruned_loss=0.15, over 21645.00 frames. ], tot_loss[loss=0.3039, simple_loss=0.3694, pruned_loss=0.1192, over 4280940.40 frames. ], batch size: 441, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:11:12,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=216570.0, ans=0.0 2023-06-18 13:11:45,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=216690.0, ans=0.125 2023-06-18 13:11:48,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=216690.0, ans=0.0 2023-06-18 13:11:57,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=216750.0, ans=0.2 2023-06-18 13:12:17,547 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=15.0 2023-06-18 13:12:25,403 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 3.142e+02 3.780e+02 5.340e+02 1.337e+03, threshold=7.560e+02, percent-clipped=11.0 2023-06-18 13:12:36,238 INFO [train.py:996] (2/4) Epoch 2, batch 5650, loss[loss=0.357, simple_loss=0.3925, pruned_loss=0.1607, over 21851.00 frames. ], tot_loss[loss=0.309, simple_loss=0.3728, pruned_loss=0.1226, over 4288499.40 frames. ], batch size: 371, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:12:38,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=216870.0, ans=0.2 2023-06-18 13:12:58,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=216930.0, ans=0.125 2023-06-18 13:13:00,610 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=15.0 2023-06-18 13:13:00,696 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-06-18 13:13:13,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=216990.0, ans=0.1 2023-06-18 13:13:53,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=217110.0, ans=0.0 2023-06-18 13:14:11,987 INFO [train.py:996] (2/4) Epoch 2, batch 5700, loss[loss=0.3796, simple_loss=0.4542, pruned_loss=0.1525, over 20796.00 frames. ], tot_loss[loss=0.3117, simple_loss=0.3728, pruned_loss=0.1253, over 4286688.42 frames. ], batch size: 608, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:14:17,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=217170.0, ans=0.125 2023-06-18 13:14:19,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=217170.0, ans=0.125 2023-06-18 13:14:24,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=217170.0, ans=0.2 2023-06-18 13:14:25,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=217170.0, ans=0.125 2023-06-18 13:14:44,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=217230.0, ans=0.125 2023-06-18 13:15:43,849 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.062e+02 3.148e+02 3.823e+02 4.974e+02 1.006e+03, threshold=7.646e+02, percent-clipped=5.0 2023-06-18 13:15:49,887 INFO [train.py:996] (2/4) Epoch 2, batch 5750, loss[loss=0.3242, simple_loss=0.3914, pruned_loss=0.1285, over 21482.00 frames. ], tot_loss[loss=0.3063, simple_loss=0.3694, pruned_loss=0.1216, over 4288038.31 frames. ], batch size: 508, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:16:16,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=217530.0, ans=0.1 2023-06-18 13:16:39,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=217590.0, ans=0.125 2023-06-18 13:17:07,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=217650.0, ans=0.1 2023-06-18 13:17:07,743 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.80 vs. limit=15.0 2023-06-18 13:17:23,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=217710.0, ans=0.125 2023-06-18 13:17:41,292 INFO [train.py:996] (2/4) Epoch 2, batch 5800, loss[loss=0.3211, simple_loss=0.3975, pruned_loss=0.1224, over 21607.00 frames. ], tot_loss[loss=0.3004, simple_loss=0.3656, pruned_loss=0.1176, over 4281792.14 frames. ], batch size: 441, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:19:13,432 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.714e+02 3.013e+02 4.071e+02 4.851e+02 8.760e+02, threshold=8.142e+02, percent-clipped=2.0 2023-06-18 13:19:19,688 INFO [train.py:996] (2/4) Epoch 2, batch 5850, loss[loss=0.2472, simple_loss=0.3462, pruned_loss=0.07406, over 21770.00 frames. ], tot_loss[loss=0.2941, simple_loss=0.3627, pruned_loss=0.1128, over 4285933.06 frames. ], batch size: 282, lr: 1.85e-02, grad_scale: 32.0 2023-06-18 13:19:42,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=218130.0, ans=0.125 2023-06-18 13:20:26,052 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 13:20:46,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=218310.0, ans=10.0 2023-06-18 13:20:49,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=218370.0, ans=0.1 2023-06-18 13:20:50,779 INFO [train.py:996] (2/4) Epoch 2, batch 5900, loss[loss=0.2817, simple_loss=0.3313, pruned_loss=0.1161, over 21206.00 frames. ], tot_loss[loss=0.2823, simple_loss=0.3541, pruned_loss=0.1052, over 4286251.00 frames. ], batch size: 143, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:21:55,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=218550.0, ans=0.0 2023-06-18 13:22:19,402 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 3.163e+02 4.084e+02 5.462e+02 1.507e+03, threshold=8.168e+02, percent-clipped=5.0 2023-06-18 13:22:25,397 INFO [train.py:996] (2/4) Epoch 2, batch 5950, loss[loss=0.2686, simple_loss=0.32, pruned_loss=0.1086, over 21565.00 frames. ], tot_loss[loss=0.2888, simple_loss=0.3555, pruned_loss=0.1111, over 4288168.49 frames. ], batch size: 195, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:23:00,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=218730.0, ans=0.0 2023-06-18 13:23:22,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=218850.0, ans=0.1 2023-06-18 13:23:44,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=218910.0, ans=0.125 2023-06-18 13:23:44,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=218910.0, ans=0.125 2023-06-18 13:23:59,303 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=22.5 2023-06-18 13:24:04,255 INFO [train.py:996] (2/4) Epoch 2, batch 6000, loss[loss=0.2992, simple_loss=0.3348, pruned_loss=0.1318, over 21763.00 frames. ], tot_loss[loss=0.2916, simple_loss=0.3525, pruned_loss=0.1154, over 4291994.54 frames. ], batch size: 371, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:24:04,255 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 13:24:20,122 INFO [train.py:1028] (2/4) Epoch 2, validation: loss=0.2916, simple_loss=0.3878, pruned_loss=0.09771, over 1796401.00 frames. 2023-06-18 13:24:20,123 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-18 13:24:43,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=219030.0, ans=0.125 2023-06-18 13:24:54,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=219090.0, ans=0.1 2023-06-18 13:25:24,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=219150.0, ans=0.0 2023-06-18 13:25:35,654 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.17 vs. limit=12.0 2023-06-18 13:25:44,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=219210.0, ans=0.125 2023-06-18 13:25:51,507 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.782e+02 3.962e+02 4.700e+02 6.169e+02 1.115e+03, threshold=9.400e+02, percent-clipped=12.0 2023-06-18 13:25:57,793 INFO [train.py:996] (2/4) Epoch 2, batch 6050, loss[loss=0.2518, simple_loss=0.3155, pruned_loss=0.09408, over 21602.00 frames. ], tot_loss[loss=0.2916, simple_loss=0.3483, pruned_loss=0.1175, over 4282070.40 frames. ], batch size: 391, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:26:32,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=219390.0, ans=0.125 2023-06-18 13:27:33,005 INFO [train.py:996] (2/4) Epoch 2, batch 6100, loss[loss=0.2538, simple_loss=0.2863, pruned_loss=0.1106, over 20030.00 frames. ], tot_loss[loss=0.2879, simple_loss=0.3456, pruned_loss=0.1151, over 4281133.17 frames. ], batch size: 703, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:28:06,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=219690.0, ans=0.0 2023-06-18 13:29:03,476 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 2.959e+02 3.646e+02 4.733e+02 1.048e+03, threshold=7.291e+02, percent-clipped=1.0 2023-06-18 13:29:09,576 INFO [train.py:996] (2/4) Epoch 2, batch 6150, loss[loss=0.2552, simple_loss=0.3185, pruned_loss=0.09599, over 21614.00 frames. ], tot_loss[loss=0.2945, simple_loss=0.3503, pruned_loss=0.1193, over 4285608.97 frames. ], batch size: 230, lr: 1.84e-02, grad_scale: 64.0 2023-06-18 13:29:19,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=219870.0, ans=0.2 2023-06-18 13:29:28,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=219930.0, ans=0.125 2023-06-18 13:30:48,843 INFO [train.py:996] (2/4) Epoch 2, batch 6200, loss[loss=0.3113, simple_loss=0.3972, pruned_loss=0.1127, over 19888.00 frames. ], tot_loss[loss=0.2962, simple_loss=0.354, pruned_loss=0.1192, over 4277445.74 frames. ], batch size: 702, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:32:20,922 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.417e+02 3.171e+02 4.018e+02 5.849e+02 1.001e+03, threshold=8.035e+02, percent-clipped=11.0 2023-06-18 13:32:25,127 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.55 vs. limit=6.0 2023-06-18 13:32:25,563 INFO [train.py:996] (2/4) Epoch 2, batch 6250, loss[loss=0.298, simple_loss=0.3817, pruned_loss=0.1072, over 21623.00 frames. ], tot_loss[loss=0.2966, simple_loss=0.3578, pruned_loss=0.1177, over 4278035.50 frames. ], batch size: 263, lr: 1.84e-02, grad_scale: 32.0 2023-06-18 13:32:49,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=220530.0, ans=0.125 2023-06-18 13:32:50,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=220530.0, ans=0.125 2023-06-18 13:32:57,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=220530.0, ans=0.125 2023-06-18 13:32:59,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=220530.0, ans=0.2 2023-06-18 13:33:29,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=220650.0, ans=0.0 2023-06-18 13:33:59,283 INFO [train.py:996] (2/4) Epoch 2, batch 6300, loss[loss=0.3199, simple_loss=0.3595, pruned_loss=0.1401, over 21582.00 frames. ], tot_loss[loss=0.296, simple_loss=0.3604, pruned_loss=0.1159, over 4282553.17 frames. ], batch size: 548, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:34:28,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=220830.0, ans=0.2 2023-06-18 13:34:30,317 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-06-18 13:35:23,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=221010.0, ans=0.125 2023-06-18 13:35:30,779 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 3.155e+02 3.817e+02 5.474e+02 1.365e+03, threshold=7.634e+02, percent-clipped=9.0 2023-06-18 13:35:35,472 INFO [train.py:996] (2/4) Epoch 2, batch 6350, loss[loss=0.3149, simple_loss=0.3801, pruned_loss=0.1249, over 21462.00 frames. ], tot_loss[loss=0.305, simple_loss=0.3668, pruned_loss=0.1217, over 4281814.84 frames. ], batch size: 194, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:36:04,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=221130.0, ans=0.0 2023-06-18 13:37:02,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=221310.0, ans=0.5 2023-06-18 13:37:17,743 INFO [train.py:996] (2/4) Epoch 2, batch 6400, loss[loss=0.3731, simple_loss=0.4155, pruned_loss=0.1654, over 21841.00 frames. ], tot_loss[loss=0.3139, simple_loss=0.3726, pruned_loss=0.1276, over 4285806.02 frames. ], batch size: 124, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:37:41,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=221430.0, ans=0.0 2023-06-18 13:37:51,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=221430.0, ans=0.125 2023-06-18 13:38:53,765 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.475e+02 3.333e+02 3.952e+02 5.090e+02 9.873e+02, threshold=7.903e+02, percent-clipped=3.0 2023-06-18 13:38:58,469 INFO [train.py:996] (2/4) Epoch 2, batch 6450, loss[loss=0.2807, simple_loss=0.3498, pruned_loss=0.1058, over 21724.00 frames. ], tot_loss[loss=0.3151, simple_loss=0.3755, pruned_loss=0.1273, over 4290144.06 frames. ], batch size: 332, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:39:04,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=221670.0, ans=0.125 2023-06-18 13:39:26,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=221730.0, ans=0.125 2023-06-18 13:39:34,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=221790.0, ans=0.0 2023-06-18 13:39:57,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=221850.0, ans=0.125 2023-06-18 13:40:35,150 INFO [train.py:996] (2/4) Epoch 2, batch 6500, loss[loss=0.2211, simple_loss=0.2789, pruned_loss=0.08162, over 14950.00 frames. ], tot_loss[loss=0.3084, simple_loss=0.3663, pruned_loss=0.1252, over 4285994.43 frames. ], batch size: 61, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:40:47,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=221970.0, ans=0.0 2023-06-18 13:40:55,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=222030.0, ans=0.0 2023-06-18 13:41:16,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=222090.0, ans=0.125 2023-06-18 13:41:33,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=222150.0, ans=0.1 2023-06-18 13:41:40,804 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.82 vs. limit=5.0 2023-06-18 13:42:06,231 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.480e+02 3.085e+02 3.485e+02 4.361e+02 6.672e+02, threshold=6.971e+02, percent-clipped=0.0 2023-06-18 13:42:09,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=222270.0, ans=0.125 2023-06-18 13:42:10,643 INFO [train.py:996] (2/4) Epoch 2, batch 6550, loss[loss=0.2606, simple_loss=0.3361, pruned_loss=0.09253, over 21761.00 frames. ], tot_loss[loss=0.3059, simple_loss=0.3638, pruned_loss=0.124, over 4273611.71 frames. ], batch size: 298, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:42:50,223 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=15.0 2023-06-18 13:42:52,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=222390.0, ans=0.125 2023-06-18 13:43:02,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=222390.0, ans=0.125 2023-06-18 13:43:08,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=222450.0, ans=0.07 2023-06-18 13:43:48,135 INFO [train.py:996] (2/4) Epoch 2, batch 6600, loss[loss=0.2486, simple_loss=0.2988, pruned_loss=0.09916, over 21150.00 frames. ], tot_loss[loss=0.3032, simple_loss=0.3585, pruned_loss=0.124, over 4278891.80 frames. ], batch size: 159, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:44:07,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=222630.0, ans=0.125 2023-06-18 13:44:37,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=222690.0, ans=0.5 2023-06-18 13:45:19,126 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 3.099e+02 3.990e+02 5.465e+02 1.147e+03, threshold=7.980e+02, percent-clipped=13.0 2023-06-18 13:45:27,866 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=22.5 2023-06-18 13:45:28,383 INFO [train.py:996] (2/4) Epoch 2, batch 6650, loss[loss=0.2847, simple_loss=0.3285, pruned_loss=0.1205, over 21686.00 frames. ], tot_loss[loss=0.295, simple_loss=0.3502, pruned_loss=0.1198, over 4272590.40 frames. ], batch size: 299, lr: 1.83e-02, grad_scale: 32.0 2023-06-18 13:45:39,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=222870.0, ans=0.1 2023-06-18 13:45:41,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=222870.0, ans=22.5 2023-06-18 13:45:44,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=222930.0, ans=0.125 2023-06-18 13:46:47,091 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.35 vs. limit=12.0 2023-06-18 13:47:06,262 INFO [train.py:996] (2/4) Epoch 2, batch 6700, loss[loss=0.3044, simple_loss=0.3371, pruned_loss=0.1358, over 21771.00 frames. ], tot_loss[loss=0.2943, simple_loss=0.3471, pruned_loss=0.1207, over 4262940.52 frames. ], batch size: 317, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:47:23,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=223230.0, ans=0.1 2023-06-18 13:48:22,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=223410.0, ans=0.125 2023-06-18 13:48:32,193 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.452e+02 3.773e+02 4.498e+02 5.331e+02 9.291e+02, threshold=8.996e+02, percent-clipped=2.0 2023-06-18 13:48:41,535 INFO [train.py:996] (2/4) Epoch 2, batch 6750, loss[loss=0.2817, simple_loss=0.3213, pruned_loss=0.1211, over 21442.00 frames. ], tot_loss[loss=0.2944, simple_loss=0.3464, pruned_loss=0.1212, over 4252484.77 frames. ], batch size: 212, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:48:56,626 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.31 vs. limit=15.0 2023-06-18 13:49:00,821 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=5.16 vs. limit=6.0 2023-06-18 13:49:15,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=223590.0, ans=0.2 2023-06-18 13:49:27,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=223590.0, ans=0.125 2023-06-18 13:49:57,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=223710.0, ans=0.0 2023-06-18 13:50:05,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=223710.0, ans=0.125 2023-06-18 13:50:17,682 INFO [train.py:996] (2/4) Epoch 2, batch 6800, loss[loss=0.2967, simple_loss=0.3389, pruned_loss=0.1272, over 21245.00 frames. ], tot_loss[loss=0.3009, simple_loss=0.35, pruned_loss=0.1259, over 4256200.17 frames. ], batch size: 176, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:51:39,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=224010.0, ans=0.2 2023-06-18 13:51:43,145 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.239e+02 3.005e+02 3.766e+02 4.478e+02 7.220e+02, threshold=7.533e+02, percent-clipped=0.0 2023-06-18 13:51:52,463 INFO [train.py:996] (2/4) Epoch 2, batch 6850, loss[loss=0.2879, simple_loss=0.3274, pruned_loss=0.1242, over 21508.00 frames. ], tot_loss[loss=0.3024, simple_loss=0.349, pruned_loss=0.1279, over 4266084.83 frames. ], batch size: 548, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:52:01,035 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.59 vs. limit=6.0 2023-06-18 13:52:21,521 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.08 vs. limit=22.5 2023-06-18 13:53:23,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=224370.0, ans=0.1 2023-06-18 13:53:28,620 INFO [train.py:996] (2/4) Epoch 2, batch 6900, loss[loss=0.2223, simple_loss=0.2969, pruned_loss=0.07386, over 21298.00 frames. ], tot_loss[loss=0.2998, simple_loss=0.3475, pruned_loss=0.126, over 4276116.83 frames. ], batch size: 176, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:54:17,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=224490.0, ans=0.125 2023-06-18 13:54:17,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=224490.0, ans=0.0 2023-06-18 13:54:35,609 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.34 vs. limit=6.0 2023-06-18 13:54:58,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=224610.0, ans=0.0 2023-06-18 13:55:00,775 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 3.118e+02 3.769e+02 5.122e+02 8.656e+02, threshold=7.539e+02, percent-clipped=2.0 2023-06-18 13:55:05,541 INFO [train.py:996] (2/4) Epoch 2, batch 6950, loss[loss=0.2313, simple_loss=0.3198, pruned_loss=0.07144, over 21639.00 frames. ], tot_loss[loss=0.2959, simple_loss=0.3483, pruned_loss=0.1218, over 4265077.49 frames. ], batch size: 263, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:55:20,310 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=15.0 2023-06-18 13:55:24,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=224730.0, ans=0.125 2023-06-18 13:56:03,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=224850.0, ans=0.125 2023-06-18 13:56:07,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=224850.0, ans=0.125 2023-06-18 13:56:10,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=224850.0, ans=0.2 2023-06-18 13:56:13,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=224850.0, ans=0.125 2023-06-18 13:56:32,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=224910.0, ans=0.125 2023-06-18 13:56:39,909 INFO [train.py:996] (2/4) Epoch 2, batch 7000, loss[loss=0.3076, simple_loss=0.3455, pruned_loss=0.1349, over 21752.00 frames. ], tot_loss[loss=0.3031, simple_loss=0.3546, pruned_loss=0.1257, over 4268015.30 frames. ], batch size: 112, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:56:42,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=224970.0, ans=15.0 2023-06-18 13:58:06,985 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=15.0 2023-06-18 13:58:12,222 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.411e+02 3.463e+02 4.256e+02 5.508e+02 8.252e+02, threshold=8.512e+02, percent-clipped=6.0 2023-06-18 13:58:17,011 INFO [train.py:996] (2/4) Epoch 2, batch 7050, loss[loss=0.2501, simple_loss=0.3513, pruned_loss=0.07445, over 21225.00 frames. ], tot_loss[loss=0.2998, simple_loss=0.3533, pruned_loss=0.1232, over 4267324.70 frames. ], batch size: 548, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 13:59:33,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=225450.0, ans=0.125 2023-06-18 13:59:33,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=225450.0, ans=0.125 2023-06-18 13:59:53,622 INFO [train.py:996] (2/4) Epoch 2, batch 7100, loss[loss=0.2805, simple_loss=0.3418, pruned_loss=0.1097, over 21254.00 frames. ], tot_loss[loss=0.3034, simple_loss=0.3582, pruned_loss=0.1243, over 4267420.46 frames. ], batch size: 159, lr: 1.82e-02, grad_scale: 32.0 2023-06-18 14:00:54,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=225690.0, ans=0.125 2023-06-18 14:01:28,275 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.318e+02 3.245e+02 4.248e+02 6.112e+02 1.073e+03, threshold=8.497e+02, percent-clipped=3.0 2023-06-18 14:01:31,269 INFO [train.py:996] (2/4) Epoch 2, batch 7150, loss[loss=0.3423, simple_loss=0.3943, pruned_loss=0.1451, over 21591.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.3558, pruned_loss=0.1209, over 4261489.49 frames. ], batch size: 389, lr: 1.81e-02, grad_scale: 16.0 2023-06-18 14:02:39,783 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.57 vs. limit=12.0 2023-06-18 14:02:44,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=226050.0, ans=0.1 2023-06-18 14:03:02,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=226110.0, ans=0.2 2023-06-18 14:03:08,091 INFO [train.py:996] (2/4) Epoch 2, batch 7200, loss[loss=0.3064, simple_loss=0.3463, pruned_loss=0.1332, over 21806.00 frames. ], tot_loss[loss=0.3038, simple_loss=0.3586, pruned_loss=0.1245, over 4263796.12 frames. ], batch size: 352, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:03:22,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=226170.0, ans=0.125 2023-06-18 14:03:52,202 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-18 14:04:23,259 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.09 vs. limit=15.0 2023-06-18 14:04:40,118 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.552e+02 3.455e+02 4.315e+02 5.205e+02 7.912e+02, threshold=8.629e+02, percent-clipped=0.0 2023-06-18 14:04:47,738 INFO [train.py:996] (2/4) Epoch 2, batch 7250, loss[loss=0.3427, simple_loss=0.3521, pruned_loss=0.1666, over 21426.00 frames. ], tot_loss[loss=0.3009, simple_loss=0.3524, pruned_loss=0.1247, over 4265527.44 frames. ], batch size: 509, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:05:10,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=226470.0, ans=0.0 2023-06-18 14:05:15,220 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-18 14:06:28,513 INFO [train.py:996] (2/4) Epoch 2, batch 7300, loss[loss=0.3142, simple_loss=0.3441, pruned_loss=0.1421, over 21685.00 frames. ], tot_loss[loss=0.2966, simple_loss=0.3464, pruned_loss=0.1235, over 4260741.66 frames. ], batch size: 417, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:06:45,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=226770.0, ans=0.125 2023-06-18 14:07:08,151 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.02 vs. limit=15.0 2023-06-18 14:07:41,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=227010.0, ans=0.125 2023-06-18 14:08:03,960 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.219e+02 3.035e+02 3.518e+02 4.361e+02 7.798e+02, threshold=7.035e+02, percent-clipped=0.0 2023-06-18 14:08:07,015 INFO [train.py:996] (2/4) Epoch 2, batch 7350, loss[loss=0.3302, simple_loss=0.3529, pruned_loss=0.1537, over 21200.00 frames. ], tot_loss[loss=0.2942, simple_loss=0.3427, pruned_loss=0.1228, over 4259004.55 frames. ], batch size: 608, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:09:00,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=227190.0, ans=0.025 2023-06-18 14:09:07,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=227250.0, ans=0.2 2023-06-18 14:09:43,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=227310.0, ans=0.125 2023-06-18 14:09:45,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=227310.0, ans=0.125 2023-06-18 14:09:50,803 INFO [train.py:996] (2/4) Epoch 2, batch 7400, loss[loss=0.2719, simple_loss=0.3141, pruned_loss=0.1148, over 21797.00 frames. ], tot_loss[loss=0.3031, simple_loss=0.3522, pruned_loss=0.127, over 4261525.28 frames. ], batch size: 107, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:09:51,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=227370.0, ans=0.0 2023-06-18 14:10:11,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=227430.0, ans=0.04949747468305833 2023-06-18 14:10:19,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=227430.0, ans=0.125 2023-06-18 14:10:48,699 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-18 14:11:25,764 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.736e+02 3.593e+02 4.529e+02 5.644e+02 1.003e+03, threshold=9.058e+02, percent-clipped=10.0 2023-06-18 14:11:29,122 INFO [train.py:996] (2/4) Epoch 2, batch 7450, loss[loss=0.3026, simple_loss=0.3353, pruned_loss=0.135, over 21246.00 frames. ], tot_loss[loss=0.3013, simple_loss=0.3523, pruned_loss=0.1251, over 4253607.67 frames. ], batch size: 144, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:11:38,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=227670.0, ans=0.0 2023-06-18 14:11:42,220 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.49 vs. limit=10.0 2023-06-18 14:11:51,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=227730.0, ans=0.1 2023-06-18 14:11:58,216 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-18 14:12:09,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=227790.0, ans=0.1 2023-06-18 14:12:10,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=227790.0, ans=0.2 2023-06-18 14:12:15,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=227790.0, ans=0.125 2023-06-18 14:12:44,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=227850.0, ans=0.125 2023-06-18 14:13:03,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=227910.0, ans=0.1 2023-06-18 14:13:07,167 INFO [train.py:996] (2/4) Epoch 2, batch 7500, loss[loss=0.3288, simple_loss=0.3868, pruned_loss=0.1354, over 21346.00 frames. ], tot_loss[loss=0.3055, simple_loss=0.3573, pruned_loss=0.1268, over 4256617.59 frames. ], batch size: 211, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:13:09,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=227970.0, ans=0.125 2023-06-18 14:14:42,787 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.477e+02 3.252e+02 3.894e+02 4.815e+02 8.018e+02, threshold=7.787e+02, percent-clipped=0.0 2023-06-18 14:14:45,739 INFO [train.py:996] (2/4) Epoch 2, batch 7550, loss[loss=0.2506, simple_loss=0.3376, pruned_loss=0.08177, over 21816.00 frames. ], tot_loss[loss=0.3048, simple_loss=0.3619, pruned_loss=0.1239, over 4254931.00 frames. ], batch size: 282, lr: 1.81e-02, grad_scale: 32.0 2023-06-18 14:14:50,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=228270.0, ans=0.2 2023-06-18 14:15:20,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=228390.0, ans=0.125 2023-06-18 14:16:22,912 INFO [train.py:996] (2/4) Epoch 2, batch 7600, loss[loss=0.2396, simple_loss=0.3088, pruned_loss=0.08526, over 21206.00 frames. ], tot_loss[loss=0.3017, simple_loss=0.3597, pruned_loss=0.1218, over 4259271.63 frames. ], batch size: 176, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:16:36,744 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-18 14:17:36,877 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.76 vs. limit=6.0 2023-06-18 14:17:41,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=228810.0, ans=0.125 2023-06-18 14:17:51,321 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 3.725e+02 4.604e+02 5.632e+02 9.928e+02, threshold=9.208e+02, percent-clipped=8.0 2023-06-18 14:17:51,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=228810.0, ans=0.1 2023-06-18 14:17:54,593 INFO [train.py:996] (2/4) Epoch 2, batch 7650, loss[loss=0.3502, simple_loss=0.3798, pruned_loss=0.1603, over 21883.00 frames. ], tot_loss[loss=0.3053, simple_loss=0.3606, pruned_loss=0.125, over 4272146.82 frames. ], batch size: 332, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:17:55,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=228870.0, ans=0.125 2023-06-18 14:18:08,301 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=12.0 2023-06-18 14:18:25,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=228990.0, ans=0.125 2023-06-18 14:18:25,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=228990.0, ans=0.125 2023-06-18 14:18:25,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=228990.0, ans=0.1 2023-06-18 14:18:54,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=229050.0, ans=0.0 2023-06-18 14:19:04,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=229050.0, ans=0.125 2023-06-18 14:19:27,532 INFO [train.py:996] (2/4) Epoch 2, batch 7700, loss[loss=0.336, simple_loss=0.379, pruned_loss=0.1465, over 21368.00 frames. ], tot_loss[loss=0.3126, simple_loss=0.365, pruned_loss=0.1301, over 4282296.98 frames. ], batch size: 159, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:19:43,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=229230.0, ans=0.125 2023-06-18 14:19:49,309 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=15.0 2023-06-18 14:20:14,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=229290.0, ans=0.125 2023-06-18 14:20:38,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=229350.0, ans=0.125 2023-06-18 14:20:54,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=229410.0, ans=0.125 2023-06-18 14:20:59,613 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.34 vs. limit=22.5 2023-06-18 14:20:59,789 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.151e+02 3.706e+02 4.565e+02 6.512e+02 1.080e+03, threshold=9.129e+02, percent-clipped=5.0 2023-06-18 14:21:02,954 INFO [train.py:996] (2/4) Epoch 2, batch 7750, loss[loss=0.417, simple_loss=0.5011, pruned_loss=0.1664, over 21205.00 frames. ], tot_loss[loss=0.3162, simple_loss=0.3703, pruned_loss=0.1311, over 4279291.82 frames. ], batch size: 549, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:21:26,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.64 vs. limit=6.0 2023-06-18 14:21:57,770 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.29 vs. limit=22.5 2023-06-18 14:22:25,333 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-18 14:22:40,252 INFO [train.py:996] (2/4) Epoch 2, batch 7800, loss[loss=0.3473, simple_loss=0.4042, pruned_loss=0.1452, over 21579.00 frames. ], tot_loss[loss=0.3205, simple_loss=0.3743, pruned_loss=0.1334, over 4272558.86 frames. ], batch size: 441, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:22:51,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=229770.0, ans=0.125 2023-06-18 14:22:58,565 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.23 vs. limit=22.5 2023-06-18 14:23:07,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=229830.0, ans=0.1 2023-06-18 14:23:07,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=229830.0, ans=0.125 2023-06-18 14:23:38,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=229890.0, ans=0.5 2023-06-18 14:23:48,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=229950.0, ans=0.125 2023-06-18 14:24:13,084 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.566e+02 4.138e+02 5.286e+02 1.209e+03, threshold=8.275e+02, percent-clipped=5.0 2023-06-18 14:24:16,391 INFO [train.py:996] (2/4) Epoch 2, batch 7850, loss[loss=0.2956, simple_loss=0.3598, pruned_loss=0.1157, over 21794.00 frames. ], tot_loss[loss=0.314, simple_loss=0.3658, pruned_loss=0.1311, over 4278470.98 frames. ], batch size: 352, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:24:24,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=230070.0, ans=0.125 2023-06-18 14:24:40,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=230130.0, ans=0.125 2023-06-18 14:25:55,477 INFO [train.py:996] (2/4) Epoch 2, batch 7900, loss[loss=0.281, simple_loss=0.3168, pruned_loss=0.1226, over 21878.00 frames. ], tot_loss[loss=0.3102, simple_loss=0.3608, pruned_loss=0.1298, over 4277511.99 frames. ], batch size: 98, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:25:57,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=230370.0, ans=0.1 2023-06-18 14:27:02,655 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.79 vs. limit=15.0 2023-06-18 14:27:29,828 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 3.545e+02 4.486e+02 5.981e+02 1.155e+03, threshold=8.972e+02, percent-clipped=9.0 2023-06-18 14:27:32,845 INFO [train.py:996] (2/4) Epoch 2, batch 7950, loss[loss=0.2863, simple_loss=0.3502, pruned_loss=0.1112, over 21431.00 frames. ], tot_loss[loss=0.3126, simple_loss=0.3664, pruned_loss=0.1294, over 4278513.07 frames. ], batch size: 176, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:28:29,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=230790.0, ans=0.125 2023-06-18 14:28:51,626 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.17 vs. limit=15.0 2023-06-18 14:29:26,740 INFO [train.py:996] (2/4) Epoch 2, batch 8000, loss[loss=0.373, simple_loss=0.418, pruned_loss=0.1639, over 21776.00 frames. ], tot_loss[loss=0.3183, simple_loss=0.3723, pruned_loss=0.1322, over 4266647.90 frames. ], batch size: 441, lr: 1.80e-02, grad_scale: 32.0 2023-06-18 14:29:29,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=230970.0, ans=0.125 2023-06-18 14:30:45,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=231210.0, ans=0.0 2023-06-18 14:30:58,750 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.385e+02 3.226e+02 3.981e+02 5.095e+02 8.184e+02, threshold=7.963e+02, percent-clipped=0.0 2023-06-18 14:31:02,184 INFO [train.py:996] (2/4) Epoch 2, batch 8050, loss[loss=0.4216, simple_loss=0.4728, pruned_loss=0.1852, over 21461.00 frames. ], tot_loss[loss=0.3176, simple_loss=0.3738, pruned_loss=0.1307, over 4264586.94 frames. ], batch size: 507, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:31:04,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=231270.0, ans=0.2 2023-06-18 14:31:07,893 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-18 14:32:43,069 INFO [train.py:996] (2/4) Epoch 2, batch 8100, loss[loss=0.3051, simple_loss=0.362, pruned_loss=0.1241, over 21793.00 frames. ], tot_loss[loss=0.318, simple_loss=0.3726, pruned_loss=0.1317, over 4270372.46 frames. ], batch size: 124, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:33:01,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=231630.0, ans=0.125 2023-06-18 14:33:44,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=231690.0, ans=0.125 2023-06-18 14:34:21,410 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.452e+02 3.933e+02 5.153e+02 6.580e+02 1.761e+03, threshold=1.031e+03, percent-clipped=12.0 2023-06-18 14:34:24,599 INFO [train.py:996] (2/4) Epoch 2, batch 8150, loss[loss=0.1886, simple_loss=0.2398, pruned_loss=0.06871, over 16694.00 frames. ], tot_loss[loss=0.3219, simple_loss=0.3777, pruned_loss=0.133, over 4255304.78 frames. ], batch size: 60, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:34:34,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=231870.0, ans=0.1 2023-06-18 14:34:39,851 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=15.0 2023-06-18 14:35:12,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=231990.0, ans=0.1 2023-06-18 14:35:16,241 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.54 vs. limit=15.0 2023-06-18 14:35:27,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=232050.0, ans=0.1 2023-06-18 14:35:30,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=232050.0, ans=0.1 2023-06-18 14:35:35,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=232050.0, ans=0.1 2023-06-18 14:35:56,577 INFO [train.py:996] (2/4) Epoch 2, batch 8200, loss[loss=0.3318, simple_loss=0.3648, pruned_loss=0.1495, over 21612.00 frames. ], tot_loss[loss=0.3133, simple_loss=0.3684, pruned_loss=0.1291, over 4258076.63 frames. ], batch size: 415, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:36:08,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=232170.0, ans=0.1 2023-06-18 14:36:58,248 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.94 vs. limit=15.0 2023-06-18 14:37:22,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=232410.0, ans=0.0 2023-06-18 14:37:26,544 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 3.508e+02 4.438e+02 6.300e+02 1.246e+03, threshold=8.875e+02, percent-clipped=2.0 2023-06-18 14:37:29,933 INFO [train.py:996] (2/4) Epoch 2, batch 8250, loss[loss=0.3033, simple_loss=0.3825, pruned_loss=0.112, over 21699.00 frames. ], tot_loss[loss=0.3126, simple_loss=0.3684, pruned_loss=0.1284, over 4255578.02 frames. ], batch size: 247, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:38:12,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=232530.0, ans=0.125 2023-06-18 14:38:28,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=232590.0, ans=0.0 2023-06-18 14:39:04,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=232710.0, ans=0.125 2023-06-18 14:39:08,486 INFO [train.py:996] (2/4) Epoch 2, batch 8300, loss[loss=0.3002, simple_loss=0.3828, pruned_loss=0.1088, over 21201.00 frames. ], tot_loss[loss=0.3076, simple_loss=0.3652, pruned_loss=0.125, over 4257357.74 frames. ], batch size: 548, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:39:09,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=232770.0, ans=15.0 2023-06-18 14:39:10,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=232770.0, ans=0.2 2023-06-18 14:40:38,171 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 3.024e+02 4.156e+02 5.477e+02 9.498e+02, threshold=8.312e+02, percent-clipped=2.0 2023-06-18 14:40:46,286 INFO [train.py:996] (2/4) Epoch 2, batch 8350, loss[loss=0.258, simple_loss=0.3254, pruned_loss=0.09531, over 21544.00 frames. ], tot_loss[loss=0.3037, simple_loss=0.3635, pruned_loss=0.122, over 4253515.39 frames. ], batch size: 230, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:41:15,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=233130.0, ans=0.1 2023-06-18 14:42:00,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=233310.0, ans=0.1 2023-06-18 14:42:13,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=233310.0, ans=0.125 2023-06-18 14:42:18,639 INFO [train.py:996] (2/4) Epoch 2, batch 8400, loss[loss=0.2533, simple_loss=0.327, pruned_loss=0.08976, over 21375.00 frames. ], tot_loss[loss=0.3013, simple_loss=0.3632, pruned_loss=0.1197, over 4257194.83 frames. ], batch size: 194, lr: 1.79e-02, grad_scale: 32.0 2023-06-18 14:43:47,336 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 3.200e+02 3.844e+02 5.205e+02 8.692e+02, threshold=7.689e+02, percent-clipped=1.0 2023-06-18 14:43:55,609 INFO [train.py:996] (2/4) Epoch 2, batch 8450, loss[loss=0.3217, simple_loss=0.3752, pruned_loss=0.1341, over 20771.00 frames. ], tot_loss[loss=0.3002, simple_loss=0.3611, pruned_loss=0.1196, over 4271192.18 frames. ], batch size: 608, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:44:34,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=233790.0, ans=0.0 2023-06-18 14:44:48,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=233850.0, ans=0.0 2023-06-18 14:44:51,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=233850.0, ans=0.0 2023-06-18 14:45:12,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=233910.0, ans=0.125 2023-06-18 14:45:21,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=233970.0, ans=0.1 2023-06-18 14:45:22,575 INFO [train.py:996] (2/4) Epoch 2, batch 8500, loss[loss=0.2738, simple_loss=0.3203, pruned_loss=0.1137, over 21767.00 frames. ], tot_loss[loss=0.3009, simple_loss=0.3572, pruned_loss=0.1223, over 4267493.34 frames. ], batch size: 351, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:45:29,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=233970.0, ans=0.125 2023-06-18 14:45:42,396 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=12.0 2023-06-18 14:45:52,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=234030.0, ans=0.125 2023-06-18 14:46:56,928 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.289e+02 3.316e+02 3.972e+02 4.532e+02 9.950e+02, threshold=7.945e+02, percent-clipped=2.0 2023-06-18 14:47:05,343 INFO [train.py:996] (2/4) Epoch 2, batch 8550, loss[loss=0.3778, simple_loss=0.4444, pruned_loss=0.1555, over 21684.00 frames. ], tot_loss[loss=0.309, simple_loss=0.3638, pruned_loss=0.1271, over 4269097.03 frames. ], batch size: 414, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:47:05,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=234270.0, ans=0.125 2023-06-18 14:47:10,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=234270.0, ans=0.1 2023-06-18 14:47:27,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=234330.0, ans=0.2 2023-06-18 14:48:15,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=234450.0, ans=0.0 2023-06-18 14:48:16,414 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.95 vs. limit=15.0 2023-06-18 14:48:42,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=234570.0, ans=0.1 2023-06-18 14:48:43,739 INFO [train.py:996] (2/4) Epoch 2, batch 8600, loss[loss=0.2991, simple_loss=0.3432, pruned_loss=0.1275, over 21083.00 frames. ], tot_loss[loss=0.3168, simple_loss=0.3713, pruned_loss=0.1311, over 4272135.75 frames. ], batch size: 607, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:49:11,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=234630.0, ans=0.125 2023-06-18 14:49:59,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=234810.0, ans=0.2 2023-06-18 14:50:10,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=234810.0, ans=0.125 2023-06-18 14:50:17,551 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 3.454e+02 4.150e+02 5.051e+02 9.343e+02, threshold=8.300e+02, percent-clipped=1.0 2023-06-18 14:50:20,580 INFO [train.py:996] (2/4) Epoch 2, batch 8650, loss[loss=0.2494, simple_loss=0.3166, pruned_loss=0.09105, over 21796.00 frames. ], tot_loss[loss=0.3191, simple_loss=0.3766, pruned_loss=0.1307, over 4271210.50 frames. ], batch size: 124, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:50:28,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=234870.0, ans=0.0 2023-06-18 14:51:21,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=235050.0, ans=0.0 2023-06-18 14:51:34,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=235110.0, ans=0.125 2023-06-18 14:51:55,573 INFO [train.py:996] (2/4) Epoch 2, batch 8700, loss[loss=0.2733, simple_loss=0.315, pruned_loss=0.1158, over 21461.00 frames. ], tot_loss[loss=0.3107, simple_loss=0.368, pruned_loss=0.1267, over 4272787.87 frames. ], batch size: 131, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:52:03,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=235170.0, ans=0.125 2023-06-18 14:52:39,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=235290.0, ans=0.125 2023-06-18 14:53:29,263 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.871e+02 3.324e+02 3.894e+02 5.284e+02 1.235e+03, threshold=7.788e+02, percent-clipped=5.0 2023-06-18 14:53:29,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=235410.0, ans=0.125 2023-06-18 14:53:32,122 INFO [train.py:996] (2/4) Epoch 2, batch 8750, loss[loss=0.3062, simple_loss=0.3469, pruned_loss=0.1327, over 21242.00 frames. ], tot_loss[loss=0.3108, simple_loss=0.3664, pruned_loss=0.1276, over 4266210.79 frames. ], batch size: 159, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:53:45,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=235470.0, ans=0.2 2023-06-18 14:54:10,671 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.88 vs. limit=15.0 2023-06-18 14:54:18,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=235590.0, ans=0.125 2023-06-18 14:54:27,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=235650.0, ans=0.0 2023-06-18 14:54:53,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=235710.0, ans=0.125 2023-06-18 14:55:09,953 INFO [train.py:996] (2/4) Epoch 2, batch 8800, loss[loss=0.3821, simple_loss=0.4545, pruned_loss=0.1549, over 19855.00 frames. ], tot_loss[loss=0.3181, simple_loss=0.3739, pruned_loss=0.1312, over 4270407.54 frames. ], batch size: 702, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:55:42,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=235890.0, ans=0.1 2023-06-18 14:55:59,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=235890.0, ans=0.1 2023-06-18 14:56:45,182 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 3.886e+02 4.865e+02 6.860e+02 1.473e+03, threshold=9.729e+02, percent-clipped=14.0 2023-06-18 14:56:48,273 INFO [train.py:996] (2/4) Epoch 2, batch 8850, loss[loss=0.3342, simple_loss=0.3921, pruned_loss=0.1382, over 21850.00 frames. ], tot_loss[loss=0.3246, simple_loss=0.3818, pruned_loss=0.1337, over 4271216.40 frames. ], batch size: 107, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:56:57,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=236070.0, ans=0.95 2023-06-18 14:57:10,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=236130.0, ans=0.015 2023-06-18 14:57:28,474 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.55 vs. limit=10.0 2023-06-18 14:57:59,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=236250.0, ans=0.2 2023-06-18 14:58:10,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=236310.0, ans=0.0 2023-06-18 14:58:15,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=236310.0, ans=0.125 2023-06-18 14:58:23,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=236310.0, ans=0.125 2023-06-18 14:58:26,546 INFO [train.py:996] (2/4) Epoch 2, batch 8900, loss[loss=0.3136, simple_loss=0.3788, pruned_loss=0.1242, over 21247.00 frames. ], tot_loss[loss=0.3196, simple_loss=0.3751, pruned_loss=0.1321, over 4272710.25 frames. ], batch size: 548, lr: 1.78e-02, grad_scale: 32.0 2023-06-18 14:58:39,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=236370.0, ans=15.0 2023-06-18 14:58:45,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=236430.0, ans=0.125 2023-06-18 14:59:58,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=236610.0, ans=0.125 2023-06-18 15:00:03,011 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 3.292e+02 4.166e+02 5.426e+02 1.146e+03, threshold=8.333e+02, percent-clipped=5.0 2023-06-18 15:00:05,990 INFO [train.py:996] (2/4) Epoch 2, batch 8950, loss[loss=0.3335, simple_loss=0.4277, pruned_loss=0.1197, over 21214.00 frames. ], tot_loss[loss=0.3165, simple_loss=0.3738, pruned_loss=0.1296, over 4273213.24 frames. ], batch size: 549, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:00:06,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=236670.0, ans=0.1 2023-06-18 15:00:11,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=236670.0, ans=0.0 2023-06-18 15:00:20,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=236670.0, ans=0.5 2023-06-18 15:00:21,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=236670.0, ans=0.0 2023-06-18 15:00:25,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=236730.0, ans=0.125 2023-06-18 15:00:45,779 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=15.0 2023-06-18 15:01:07,304 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 15:01:42,179 INFO [train.py:996] (2/4) Epoch 2, batch 9000, loss[loss=0.402, simple_loss=0.5087, pruned_loss=0.1477, over 19801.00 frames. ], tot_loss[loss=0.3127, simple_loss=0.3672, pruned_loss=0.1291, over 4276446.86 frames. ], batch size: 702, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:01:42,179 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 15:02:02,161 INFO [train.py:1028] (2/4) Epoch 2, validation: loss=0.2979, simple_loss=0.3967, pruned_loss=0.09958, over 1796401.00 frames. 2023-06-18 15:02:02,162 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-18 15:02:42,849 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=22.5 2023-06-18 15:03:33,928 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=15.0 2023-06-18 15:03:36,050 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.853e+02 3.310e+02 4.099e+02 5.036e+02 9.465e+02, threshold=8.198e+02, percent-clipped=3.0 2023-06-18 15:03:39,422 INFO [train.py:996] (2/4) Epoch 2, batch 9050, loss[loss=0.2508, simple_loss=0.2883, pruned_loss=0.1067, over 20776.00 frames. ], tot_loss[loss=0.3067, simple_loss=0.362, pruned_loss=0.1257, over 4269862.43 frames. ], batch size: 608, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:03:53,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=237270.0, ans=0.1 2023-06-18 15:04:34,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=237390.0, ans=0.125 2023-06-18 15:05:00,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=237450.0, ans=10.0 2023-06-18 15:05:23,370 INFO [train.py:996] (2/4) Epoch 2, batch 9100, loss[loss=0.315, simple_loss=0.3844, pruned_loss=0.1228, over 19869.00 frames. ], tot_loss[loss=0.316, simple_loss=0.371, pruned_loss=0.1305, over 4272310.83 frames. ], batch size: 703, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:06:58,941 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.924e+02 3.002e+02 3.899e+02 5.912e+02 1.285e+03, threshold=7.799e+02, percent-clipped=7.0 2023-06-18 15:06:59,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=237870.0, ans=0.1 2023-06-18 15:07:05,480 INFO [train.py:996] (2/4) Epoch 2, batch 9150, loss[loss=0.3146, simple_loss=0.3899, pruned_loss=0.1197, over 21807.00 frames. ], tot_loss[loss=0.3143, simple_loss=0.375, pruned_loss=0.1268, over 4274056.77 frames. ], batch size: 282, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:07:09,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=237870.0, ans=0.125 2023-06-18 15:07:29,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.65 vs. limit=22.5 2023-06-18 15:08:16,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=238110.0, ans=0.125 2023-06-18 15:08:40,493 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 15:08:43,031 INFO [train.py:996] (2/4) Epoch 2, batch 9200, loss[loss=0.3579, simple_loss=0.4182, pruned_loss=0.1488, over 21741.00 frames. ], tot_loss[loss=0.3139, simple_loss=0.3776, pruned_loss=0.1251, over 4271349.50 frames. ], batch size: 351, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:09:58,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=238410.0, ans=0.0 2023-06-18 15:10:11,737 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 15:10:17,232 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 3.265e+02 3.893e+02 4.706e+02 1.094e+03, threshold=7.786e+02, percent-clipped=2.0 2023-06-18 15:10:18,932 INFO [train.py:996] (2/4) Epoch 2, batch 9250, loss[loss=0.3257, simple_loss=0.375, pruned_loss=0.1382, over 21502.00 frames. ], tot_loss[loss=0.317, simple_loss=0.3776, pruned_loss=0.1282, over 4263203.73 frames. ], batch size: 131, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:10:41,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=238530.0, ans=0.125 2023-06-18 15:10:50,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=238590.0, ans=0.125 2023-06-18 15:10:59,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=238590.0, ans=0.0 2023-06-18 15:11:45,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=238710.0, ans=0.1 2023-06-18 15:11:53,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=238710.0, ans=0.0 2023-06-18 15:11:59,738 INFO [train.py:996] (2/4) Epoch 2, batch 9300, loss[loss=0.3518, simple_loss=0.3915, pruned_loss=0.1561, over 21812.00 frames. ], tot_loss[loss=0.315, simple_loss=0.3725, pruned_loss=0.1287, over 4267576.19 frames. ], batch size: 372, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:12:17,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=238830.0, ans=0.125 2023-06-18 15:12:58,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=238890.0, ans=0.2 2023-06-18 15:13:03,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=238950.0, ans=0.125 2023-06-18 15:13:37,187 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.699e+02 3.726e+02 4.567e+02 5.377e+02 1.117e+03, threshold=9.135e+02, percent-clipped=5.0 2023-06-18 15:13:38,732 INFO [train.py:996] (2/4) Epoch 2, batch 9350, loss[loss=0.3527, simple_loss=0.4049, pruned_loss=0.1502, over 21873.00 frames. ], tot_loss[loss=0.3196, simple_loss=0.3802, pruned_loss=0.1295, over 4268346.44 frames. ], batch size: 118, lr: 1.77e-02, grad_scale: 32.0 2023-06-18 15:13:40,049 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.45 vs. limit=6.0 2023-06-18 15:14:16,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=239130.0, ans=0.125 2023-06-18 15:14:49,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=239250.0, ans=0.125 2023-06-18 15:15:13,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=239310.0, ans=0.0 2023-06-18 15:15:13,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=239310.0, ans=0.125 2023-06-18 15:15:17,461 INFO [train.py:996] (2/4) Epoch 2, batch 9400, loss[loss=0.3339, simple_loss=0.3705, pruned_loss=0.1486, over 20110.00 frames. ], tot_loss[loss=0.3207, simple_loss=0.3814, pruned_loss=0.13, over 4267679.12 frames. ], batch size: 702, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:15:19,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=239370.0, ans=0.125 2023-06-18 15:16:00,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=239430.0, ans=0.125 2023-06-18 15:16:30,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=239550.0, ans=0.1 2023-06-18 15:16:30,642 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.297e-03 2023-06-18 15:16:53,062 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.296e+02 4.208e+02 5.207e+02 1.060e+03, threshold=8.416e+02, percent-clipped=2.0 2023-06-18 15:16:53,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=239670.0, ans=0.2 2023-06-18 15:16:54,642 INFO [train.py:996] (2/4) Epoch 2, batch 9450, loss[loss=0.2686, simple_loss=0.3146, pruned_loss=0.1113, over 21149.00 frames. ], tot_loss[loss=0.3134, simple_loss=0.3708, pruned_loss=0.128, over 4256198.60 frames. ], batch size: 176, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:18:17,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=239910.0, ans=0.125 2023-06-18 15:18:31,553 INFO [train.py:996] (2/4) Epoch 2, batch 9500, loss[loss=0.3046, simple_loss=0.3595, pruned_loss=0.1248, over 21162.00 frames. ], tot_loss[loss=0.3075, simple_loss=0.3631, pruned_loss=0.1259, over 4260326.85 frames. ], batch size: 143, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:18:54,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=240030.0, ans=0.125 2023-06-18 15:19:00,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=240030.0, ans=0.125 2023-06-18 15:19:02,737 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.31 vs. limit=10.0 2023-06-18 15:19:14,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=240030.0, ans=0.07 2023-06-18 15:19:53,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=240210.0, ans=0.0 2023-06-18 15:20:02,358 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.526e+02 3.477e+02 4.438e+02 5.411e+02 9.373e+02, threshold=8.876e+02, percent-clipped=3.0 2023-06-18 15:20:04,081 INFO [train.py:996] (2/4) Epoch 2, batch 9550, loss[loss=0.3343, simple_loss=0.3981, pruned_loss=0.1352, over 21782.00 frames. ], tot_loss[loss=0.3123, simple_loss=0.3682, pruned_loss=0.1282, over 4267576.11 frames. ], batch size: 124, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:20:45,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=240330.0, ans=0.95 2023-06-18 15:20:50,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=240330.0, ans=0.125 2023-06-18 15:20:51,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=240330.0, ans=0.0 2023-06-18 15:21:17,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=240450.0, ans=0.125 2023-06-18 15:21:40,140 INFO [train.py:996] (2/4) Epoch 2, batch 9600, loss[loss=0.2554, simple_loss=0.3206, pruned_loss=0.09511, over 21658.00 frames. ], tot_loss[loss=0.3155, simple_loss=0.3708, pruned_loss=0.1301, over 4265403.22 frames. ], batch size: 230, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:21:54,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=240570.0, ans=0.125 2023-06-18 15:22:04,520 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-18 15:22:46,024 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.79 vs. limit=15.0 2023-06-18 15:22:49,110 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=12.0 2023-06-18 15:22:50,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=240750.0, ans=0.0 2023-06-18 15:23:06,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=240810.0, ans=0.125 2023-06-18 15:23:16,552 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 3.171e+02 3.689e+02 4.506e+02 8.293e+02, threshold=7.377e+02, percent-clipped=0.0 2023-06-18 15:23:18,130 INFO [train.py:996] (2/4) Epoch 2, batch 9650, loss[loss=0.3167, simple_loss=0.3712, pruned_loss=0.1312, over 21941.00 frames. ], tot_loss[loss=0.3158, simple_loss=0.3708, pruned_loss=0.1304, over 4268433.27 frames. ], batch size: 316, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:23:22,536 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.89 vs. limit=8.0 2023-06-18 15:23:59,359 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.88 vs. limit=15.0 2023-06-18 15:25:00,614 INFO [train.py:996] (2/4) Epoch 2, batch 9700, loss[loss=0.387, simple_loss=0.4659, pruned_loss=0.154, over 20851.00 frames. ], tot_loss[loss=0.3189, simple_loss=0.3751, pruned_loss=0.1314, over 4275698.04 frames. ], batch size: 608, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:25:07,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=241170.0, ans=0.1 2023-06-18 15:25:48,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=241290.0, ans=0.125 2023-06-18 15:25:51,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=241290.0, ans=0.2 2023-06-18 15:25:57,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=241350.0, ans=0.0 2023-06-18 15:26:16,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=241410.0, ans=0.125 2023-06-18 15:26:35,695 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.360e+02 3.207e+02 3.701e+02 4.556e+02 8.027e+02, threshold=7.401e+02, percent-clipped=3.0 2023-06-18 15:26:37,173 INFO [train.py:996] (2/4) Epoch 2, batch 9750, loss[loss=0.2948, simple_loss=0.3269, pruned_loss=0.1313, over 21333.00 frames. ], tot_loss[loss=0.313, simple_loss=0.3676, pruned_loss=0.1292, over 4273128.04 frames. ], batch size: 473, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:26:43,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=241470.0, ans=0.125 2023-06-18 15:27:19,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=241590.0, ans=0.125 2023-06-18 15:28:08,543 INFO [train.py:996] (2/4) Epoch 2, batch 9800, loss[loss=0.326, simple_loss=0.3652, pruned_loss=0.1433, over 21427.00 frames. ], tot_loss[loss=0.313, simple_loss=0.3671, pruned_loss=0.1294, over 4260600.16 frames. ], batch size: 194, lr: 1.76e-02, grad_scale: 32.0 2023-06-18 15:28:36,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=241830.0, ans=0.1 2023-06-18 15:29:17,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=241950.0, ans=0.09899494936611666 2023-06-18 15:29:19,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=241950.0, ans=0.1 2023-06-18 15:29:22,637 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-06-18 15:29:43,750 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.224e+02 3.313e+02 4.009e+02 5.228e+02 9.511e+02, threshold=8.018e+02, percent-clipped=4.0 2023-06-18 15:29:45,212 INFO [train.py:996] (2/4) Epoch 2, batch 9850, loss[loss=0.3029, simple_loss=0.3466, pruned_loss=0.1296, over 21834.00 frames. ], tot_loss[loss=0.3102, simple_loss=0.3624, pruned_loss=0.129, over 4264439.60 frames. ], batch size: 414, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:29:50,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=242070.0, ans=0.0 2023-06-18 15:31:22,242 INFO [train.py:996] (2/4) Epoch 2, batch 9900, loss[loss=0.2692, simple_loss=0.3066, pruned_loss=0.1158, over 21322.00 frames. ], tot_loss[loss=0.3056, simple_loss=0.3573, pruned_loss=0.1269, over 4260216.65 frames. ], batch size: 144, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:32:23,256 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-18 15:33:02,432 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 3.487e+02 4.462e+02 5.702e+02 1.060e+03, threshold=8.923e+02, percent-clipped=2.0 2023-06-18 15:33:03,928 INFO [train.py:996] (2/4) Epoch 2, batch 9950, loss[loss=0.3366, simple_loss=0.3782, pruned_loss=0.1475, over 19926.00 frames. ], tot_loss[loss=0.3118, simple_loss=0.3614, pruned_loss=0.1311, over 4257393.57 frames. ], batch size: 702, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:33:33,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=242730.0, ans=0.125 2023-06-18 15:33:44,330 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=22.5 2023-06-18 15:33:50,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=242790.0, ans=0.1 2023-06-18 15:34:28,900 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=15.0 2023-06-18 15:34:41,507 INFO [train.py:996] (2/4) Epoch 2, batch 10000, loss[loss=0.3498, simple_loss=0.3994, pruned_loss=0.1501, over 21673.00 frames. ], tot_loss[loss=0.3113, simple_loss=0.3601, pruned_loss=0.1312, over 4261906.86 frames. ], batch size: 441, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:35:13,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=243030.0, ans=0.125 2023-06-18 15:35:36,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=243090.0, ans=0.2 2023-06-18 15:36:11,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=243210.0, ans=0.125 2023-06-18 15:36:14,713 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.481e+02 3.362e+02 4.103e+02 5.165e+02 9.257e+02, threshold=8.205e+02, percent-clipped=2.0 2023-06-18 15:36:16,265 INFO [train.py:996] (2/4) Epoch 2, batch 10050, loss[loss=0.2304, simple_loss=0.2925, pruned_loss=0.08416, over 21416.00 frames. ], tot_loss[loss=0.3109, simple_loss=0.3606, pruned_loss=0.1306, over 4262540.63 frames. ], batch size: 211, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:37:00,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=243390.0, ans=0.125 2023-06-18 15:37:15,423 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 15:37:39,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=243510.0, ans=0.1 2023-06-18 15:38:03,649 INFO [train.py:996] (2/4) Epoch 2, batch 10100, loss[loss=0.2897, simple_loss=0.3205, pruned_loss=0.1295, over 20881.00 frames. ], tot_loss[loss=0.3061, simple_loss=0.3568, pruned_loss=0.1277, over 4264520.63 frames. ], batch size: 613, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:38:15,408 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2023-06-18 15:38:19,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=243630.0, ans=0.1 2023-06-18 15:38:20,081 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.87 vs. limit=15.0 2023-06-18 15:39:39,573 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.265e+02 3.952e+02 5.116e+02 8.346e+02, threshold=7.904e+02, percent-clipped=1.0 2023-06-18 15:39:41,261 INFO [train.py:996] (2/4) Epoch 2, batch 10150, loss[loss=0.33, simple_loss=0.3975, pruned_loss=0.1313, over 19973.00 frames. ], tot_loss[loss=0.3123, simple_loss=0.3638, pruned_loss=0.1304, over 4267007.62 frames. ], batch size: 702, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:40:13,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=243930.0, ans=0.1 2023-06-18 15:40:43,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=244050.0, ans=0.2 2023-06-18 15:41:09,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=244110.0, ans=0.0 2023-06-18 15:41:19,222 INFO [train.py:996] (2/4) Epoch 2, batch 10200, loss[loss=0.2717, simple_loss=0.3497, pruned_loss=0.09687, over 21619.00 frames. ], tot_loss[loss=0.3097, simple_loss=0.3636, pruned_loss=0.1279, over 4271185.67 frames. ], batch size: 389, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:41:22,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=244170.0, ans=0.0 2023-06-18 15:41:47,177 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=22.5 2023-06-18 15:41:49,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=244230.0, ans=0.1 2023-06-18 15:41:51,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=244230.0, ans=0.125 2023-06-18 15:42:03,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=244290.0, ans=0.1 2023-06-18 15:42:37,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=244350.0, ans=0.0 2023-06-18 15:42:45,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=244410.0, ans=10.0 2023-06-18 15:42:49,770 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.05 vs. limit=22.5 2023-06-18 15:42:54,653 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.912e+02 2.941e+02 3.489e+02 4.418e+02 6.706e+02, threshold=6.977e+02, percent-clipped=0.0 2023-06-18 15:42:56,139 INFO [train.py:996] (2/4) Epoch 2, batch 10250, loss[loss=0.338, simple_loss=0.3935, pruned_loss=0.1413, over 21315.00 frames. ], tot_loss[loss=0.2993, simple_loss=0.357, pruned_loss=0.1208, over 4273044.54 frames. ], batch size: 143, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:44:15,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=244650.0, ans=0.0 2023-06-18 15:44:34,317 INFO [train.py:996] (2/4) Epoch 2, batch 10300, loss[loss=0.274, simple_loss=0.3643, pruned_loss=0.09179, over 21735.00 frames. ], tot_loss[loss=0.2993, simple_loss=0.3587, pruned_loss=0.1199, over 4277746.16 frames. ], batch size: 247, lr: 1.75e-02, grad_scale: 32.0 2023-06-18 15:46:17,223 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 3.346e+02 4.331e+02 5.577e+02 1.197e+03, threshold=8.662e+02, percent-clipped=10.0 2023-06-18 15:46:18,824 INFO [train.py:996] (2/4) Epoch 2, batch 10350, loss[loss=0.3509, simple_loss=0.458, pruned_loss=0.1219, over 19732.00 frames. ], tot_loss[loss=0.2995, simple_loss=0.3611, pruned_loss=0.1189, over 4271268.77 frames. ], batch size: 702, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:46:19,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=245070.0, ans=0.09899494936611666 2023-06-18 15:46:37,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=245130.0, ans=0.0 2023-06-18 15:47:03,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=245190.0, ans=0.125 2023-06-18 15:47:08,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=245190.0, ans=0.95 2023-06-18 15:47:24,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=245250.0, ans=0.2 2023-06-18 15:48:00,715 INFO [train.py:996] (2/4) Epoch 2, batch 10400, loss[loss=0.2029, simple_loss=0.248, pruned_loss=0.07896, over 21200.00 frames. ], tot_loss[loss=0.2924, simple_loss=0.3525, pruned_loss=0.1162, over 4266421.88 frames. ], batch size: 176, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:49:07,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=245550.0, ans=0.125 2023-06-18 15:49:21,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=245550.0, ans=0.125 2023-06-18 15:49:26,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=245610.0, ans=0.125 2023-06-18 15:49:28,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=245610.0, ans=0.125 2023-06-18 15:49:40,528 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 3.447e+02 4.106e+02 4.896e+02 8.870e+02, threshold=8.213e+02, percent-clipped=2.0 2023-06-18 15:49:42,077 INFO [train.py:996] (2/4) Epoch 2, batch 10450, loss[loss=0.346, simple_loss=0.4028, pruned_loss=0.1446, over 21845.00 frames. ], tot_loss[loss=0.3004, simple_loss=0.3576, pruned_loss=0.1216, over 4265478.25 frames. ], batch size: 316, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:50:50,521 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 15:51:02,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=245910.0, ans=0.2 2023-06-18 15:51:19,849 INFO [train.py:996] (2/4) Epoch 2, batch 10500, loss[loss=0.2914, simple_loss=0.3352, pruned_loss=0.1238, over 21820.00 frames. ], tot_loss[loss=0.2984, simple_loss=0.3559, pruned_loss=0.1205, over 4264375.66 frames. ], batch size: 102, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:51:47,091 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-18 15:52:08,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=246090.0, ans=10.0 2023-06-18 15:52:09,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=246090.0, ans=0.2 2023-06-18 15:52:15,040 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.44 vs. limit=6.0 2023-06-18 15:52:27,826 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-18 15:52:44,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=246210.0, ans=0.125 2023-06-18 15:52:48,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=246210.0, ans=0.125 2023-06-18 15:52:54,403 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 3.201e+02 3.705e+02 4.440e+02 6.098e+02, threshold=7.409e+02, percent-clipped=0.0 2023-06-18 15:52:55,955 INFO [train.py:996] (2/4) Epoch 2, batch 10550, loss[loss=0.3146, simple_loss=0.3459, pruned_loss=0.1417, over 21913.00 frames. ], tot_loss[loss=0.2954, simple_loss=0.3505, pruned_loss=0.1201, over 4254628.49 frames. ], batch size: 373, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:53:32,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=246330.0, ans=0.07 2023-06-18 15:54:04,512 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=22.5 2023-06-18 15:54:14,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=246450.0, ans=0.125 2023-06-18 15:54:21,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=246510.0, ans=0.125 2023-06-18 15:54:32,895 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=15.0 2023-06-18 15:54:33,562 INFO [train.py:996] (2/4) Epoch 2, batch 10600, loss[loss=0.2533, simple_loss=0.3308, pruned_loss=0.08785, over 21723.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.346, pruned_loss=0.1179, over 4261072.26 frames. ], batch size: 247, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:54:35,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=246570.0, ans=0.125 2023-06-18 15:54:37,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=246570.0, ans=0.125 2023-06-18 15:55:42,459 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.48 vs. limit=15.0 2023-06-18 15:55:54,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=246750.0, ans=0.025 2023-06-18 15:56:20,166 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=12.0 2023-06-18 15:56:21,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=246810.0, ans=0.0 2023-06-18 15:56:22,346 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.069e+02 3.234e+02 3.580e+02 4.539e+02 8.323e+02, threshold=7.159e+02, percent-clipped=4.0 2023-06-18 15:56:23,912 INFO [train.py:996] (2/4) Epoch 2, batch 10650, loss[loss=0.3568, simple_loss=0.4071, pruned_loss=0.1532, over 21451.00 frames. ], tot_loss[loss=0.2922, simple_loss=0.3505, pruned_loss=0.1169, over 4255918.45 frames. ], batch size: 507, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:57:10,185 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=12.0 2023-06-18 15:57:25,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=247050.0, ans=0.1 2023-06-18 15:57:26,292 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-06-18 15:57:50,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=247110.0, ans=0.125 2023-06-18 15:58:01,639 INFO [train.py:996] (2/4) Epoch 2, batch 10700, loss[loss=0.3324, simple_loss=0.3722, pruned_loss=0.1463, over 19864.00 frames. ], tot_loss[loss=0.2926, simple_loss=0.3491, pruned_loss=0.1181, over 4262968.64 frames. ], batch size: 702, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 15:58:14,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=247170.0, ans=0.125 2023-06-18 15:58:17,939 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=16.67 vs. limit=15.0 2023-06-18 15:59:17,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=247350.0, ans=6.0 2023-06-18 15:59:43,243 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 3.410e+02 4.130e+02 4.973e+02 8.640e+02, threshold=8.260e+02, percent-clipped=4.0 2023-06-18 15:59:44,830 INFO [train.py:996] (2/4) Epoch 2, batch 10750, loss[loss=0.3391, simple_loss=0.4077, pruned_loss=0.1352, over 21766.00 frames. ], tot_loss[loss=0.3055, simple_loss=0.3617, pruned_loss=0.1247, over 4262553.44 frames. ], batch size: 332, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 16:00:30,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=247590.0, ans=0.05 2023-06-18 16:01:18,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=247710.0, ans=0.125 2023-06-18 16:01:30,549 INFO [train.py:996] (2/4) Epoch 2, batch 10800, loss[loss=0.3873, simple_loss=0.4248, pruned_loss=0.1749, over 21416.00 frames. ], tot_loss[loss=0.3107, simple_loss=0.3689, pruned_loss=0.1263, over 4271622.94 frames. ], batch size: 471, lr: 1.74e-02, grad_scale: 32.0 2023-06-18 16:01:51,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=247830.0, ans=0.125 2023-06-18 16:03:08,334 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 3.160e+02 3.815e+02 4.913e+02 8.496e+02, threshold=7.629e+02, percent-clipped=1.0 2023-06-18 16:03:08,367 INFO [train.py:996] (2/4) Epoch 2, batch 10850, loss[loss=0.2814, simple_loss=0.3306, pruned_loss=0.1161, over 22032.00 frames. ], tot_loss[loss=0.3103, simple_loss=0.3683, pruned_loss=0.1261, over 4277115.64 frames. ], batch size: 103, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:03:24,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=248130.0, ans=0.125 2023-06-18 16:04:05,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=248250.0, ans=0.0 2023-06-18 16:04:17,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=248250.0, ans=0.0 2023-06-18 16:04:34,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=248310.0, ans=0.0 2023-06-18 16:04:38,334 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.93 vs. limit=15.0 2023-06-18 16:04:46,956 INFO [train.py:996] (2/4) Epoch 2, batch 10900, loss[loss=0.2878, simple_loss=0.3602, pruned_loss=0.1077, over 21568.00 frames. ], tot_loss[loss=0.3043, simple_loss=0.3616, pruned_loss=0.1235, over 4270482.84 frames. ], batch size: 263, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:04:47,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=248370.0, ans=0.125 2023-06-18 16:04:54,242 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.86 vs. limit=10.0 2023-06-18 16:05:15,827 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.50 vs. limit=6.0 2023-06-18 16:05:43,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=248550.0, ans=0.1 2023-06-18 16:06:18,629 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.191e+02 2.990e+02 3.670e+02 4.688e+02 1.000e+03, threshold=7.341e+02, percent-clipped=2.0 2023-06-18 16:06:18,659 INFO [train.py:996] (2/4) Epoch 2, batch 10950, loss[loss=0.2975, simple_loss=0.3495, pruned_loss=0.1227, over 20051.00 frames. ], tot_loss[loss=0.2987, simple_loss=0.3555, pruned_loss=0.1209, over 4262171.32 frames. ], batch size: 703, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:07:06,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=248790.0, ans=0.2 2023-06-18 16:07:52,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=248910.0, ans=0.1 2023-06-18 16:07:55,484 INFO [train.py:996] (2/4) Epoch 2, batch 11000, loss[loss=0.3094, simple_loss=0.3442, pruned_loss=0.1373, over 21502.00 frames. ], tot_loss[loss=0.3006, simple_loss=0.3552, pruned_loss=0.123, over 4262197.12 frames. ], batch size: 442, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:08:06,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=248970.0, ans=0.025 2023-06-18 16:08:42,151 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.71 vs. limit=10.0 2023-06-18 16:08:57,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=249150.0, ans=0.0 2023-06-18 16:09:32,788 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.258e+02 3.446e+02 4.232e+02 5.447e+02 9.802e+02, threshold=8.463e+02, percent-clipped=9.0 2023-06-18 16:09:32,818 INFO [train.py:996] (2/4) Epoch 2, batch 11050, loss[loss=0.3163, simple_loss=0.3816, pruned_loss=0.1254, over 20755.00 frames. ], tot_loss[loss=0.3011, simple_loss=0.3532, pruned_loss=0.1245, over 4267367.01 frames. ], batch size: 607, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:09:42,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=249270.0, ans=0.125 2023-06-18 16:09:57,297 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.24 vs. limit=15.0 2023-06-18 16:10:36,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=249450.0, ans=0.0 2023-06-18 16:10:58,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=249510.0, ans=0.0 2023-06-18 16:11:09,744 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=22.5 2023-06-18 16:11:10,297 INFO [train.py:996] (2/4) Epoch 2, batch 11100, loss[loss=0.2865, simple_loss=0.3255, pruned_loss=0.1238, over 21253.00 frames. ], tot_loss[loss=0.3007, simple_loss=0.3513, pruned_loss=0.125, over 4268968.30 frames. ], batch size: 144, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:11:24,071 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=15.0 2023-06-18 16:11:28,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=249630.0, ans=0.125 2023-06-18 16:11:48,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=249690.0, ans=0.0 2023-06-18 16:12:24,107 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=15.0 2023-06-18 16:12:27,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=249750.0, ans=0.2 2023-06-18 16:12:29,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=249750.0, ans=0.0 2023-06-18 16:12:47,656 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 3.019e+02 3.669e+02 4.475e+02 9.197e+02, threshold=7.338e+02, percent-clipped=1.0 2023-06-18 16:12:47,686 INFO [train.py:996] (2/4) Epoch 2, batch 11150, loss[loss=0.2677, simple_loss=0.3386, pruned_loss=0.09846, over 21718.00 frames. ], tot_loss[loss=0.2984, simple_loss=0.3487, pruned_loss=0.1241, over 4269945.15 frames. ], batch size: 282, lr: 1.73e-02, grad_scale: 16.0 2023-06-18 16:14:23,568 INFO [train.py:996] (2/4) Epoch 2, batch 11200, loss[loss=0.2994, simple_loss=0.3345, pruned_loss=0.1322, over 21794.00 frames. ], tot_loss[loss=0.2965, simple_loss=0.3469, pruned_loss=0.123, over 4263476.21 frames. ], batch size: 118, lr: 1.73e-02, grad_scale: 32.0 2023-06-18 16:14:25,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=250170.0, ans=0.0 2023-06-18 16:14:30,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=250170.0, ans=0.125 2023-06-18 16:14:43,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=250230.0, ans=0.0 2023-06-18 16:14:46,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=250230.0, ans=0.0 2023-06-18 16:15:23,566 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.68 vs. limit=22.5 2023-06-18 16:15:26,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=250350.0, ans=0.125 2023-06-18 16:15:59,450 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 3.566e+02 4.194e+02 5.517e+02 1.156e+03, threshold=8.389e+02, percent-clipped=11.0 2023-06-18 16:15:59,480 INFO [train.py:996] (2/4) Epoch 2, batch 11250, loss[loss=0.305, simple_loss=0.3485, pruned_loss=0.1308, over 21206.00 frames. ], tot_loss[loss=0.2959, simple_loss=0.3459, pruned_loss=0.123, over 4267837.58 frames. ], batch size: 176, lr: 1.73e-02, grad_scale: 32.0 2023-06-18 16:16:47,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=250590.0, ans=0.1 2023-06-18 16:17:36,329 INFO [train.py:996] (2/4) Epoch 2, batch 11300, loss[loss=0.2932, simple_loss=0.3473, pruned_loss=0.1195, over 21865.00 frames. ], tot_loss[loss=0.2959, simple_loss=0.3471, pruned_loss=0.1224, over 4271586.32 frames. ], batch size: 124, lr: 1.73e-02, grad_scale: 32.0 2023-06-18 16:17:41,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=250770.0, ans=0.5 2023-06-18 16:18:41,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=250950.0, ans=0.125 2023-06-18 16:19:13,154 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 3.183e+02 3.739e+02 4.623e+02 9.049e+02, threshold=7.478e+02, percent-clipped=1.0 2023-06-18 16:19:13,175 INFO [train.py:996] (2/4) Epoch 2, batch 11350, loss[loss=0.3895, simple_loss=0.4332, pruned_loss=0.1729, over 21569.00 frames. ], tot_loss[loss=0.2961, simple_loss=0.348, pruned_loss=0.1221, over 4274365.70 frames. ], batch size: 389, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:19:21,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=251070.0, ans=0.2 2023-06-18 16:19:25,081 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 16:20:04,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=251190.0, ans=0.1 2023-06-18 16:20:11,715 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.03 vs. limit=5.0 2023-06-18 16:20:17,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=251190.0, ans=0.125 2023-06-18 16:20:26,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=251250.0, ans=0.125 2023-06-18 16:20:53,114 INFO [train.py:996] (2/4) Epoch 2, batch 11400, loss[loss=0.2918, simple_loss=0.3426, pruned_loss=0.1205, over 21108.00 frames. ], tot_loss[loss=0.3044, simple_loss=0.3566, pruned_loss=0.1261, over 4273090.03 frames. ], batch size: 143, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:21:02,572 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.08 vs. limit=10.0 2023-06-18 16:21:05,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=251370.0, ans=0.0 2023-06-18 16:22:08,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=251550.0, ans=0.125 2023-06-18 16:22:18,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=251610.0, ans=0.2 2023-06-18 16:22:34,136 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.406e+02 3.522e+02 4.244e+02 5.675e+02 1.170e+03, threshold=8.488e+02, percent-clipped=5.0 2023-06-18 16:22:34,156 INFO [train.py:996] (2/4) Epoch 2, batch 11450, loss[loss=0.3389, simple_loss=0.386, pruned_loss=0.1459, over 21777.00 frames. ], tot_loss[loss=0.3065, simple_loss=0.3601, pruned_loss=0.1264, over 4271838.17 frames. ], batch size: 298, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:22:58,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=251730.0, ans=0.0 2023-06-18 16:23:22,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=251730.0, ans=0.125 2023-06-18 16:23:45,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=251850.0, ans=0.5 2023-06-18 16:24:04,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=251910.0, ans=0.04949747468305833 2023-06-18 16:24:12,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=251970.0, ans=0.1 2023-06-18 16:24:13,575 INFO [train.py:996] (2/4) Epoch 2, batch 11500, loss[loss=0.2682, simple_loss=0.3518, pruned_loss=0.09233, over 21869.00 frames. ], tot_loss[loss=0.3091, simple_loss=0.3635, pruned_loss=0.1274, over 4273098.17 frames. ], batch size: 316, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:24:20,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=251970.0, ans=0.1 2023-06-18 16:24:21,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=251970.0, ans=15.0 2023-06-18 16:24:28,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=251970.0, ans=0.2 2023-06-18 16:24:59,731 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.29 vs. limit=15.0 2023-06-18 16:25:52,845 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.222e+02 3.283e+02 4.041e+02 4.776e+02 1.091e+03, threshold=8.082e+02, percent-clipped=3.0 2023-06-18 16:25:52,875 INFO [train.py:996] (2/4) Epoch 2, batch 11550, loss[loss=0.2773, simple_loss=0.352, pruned_loss=0.1013, over 21646.00 frames. ], tot_loss[loss=0.312, simple_loss=0.3695, pruned_loss=0.1273, over 4268327.74 frames. ], batch size: 263, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:25:53,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=252270.0, ans=0.125 2023-06-18 16:27:48,291 INFO [train.py:996] (2/4) Epoch 2, batch 11600, loss[loss=0.4431, simple_loss=0.4993, pruned_loss=0.1935, over 21454.00 frames. ], tot_loss[loss=0.3196, simple_loss=0.3822, pruned_loss=0.1285, over 4268299.64 frames. ], batch size: 507, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:27:57,941 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 16:28:28,269 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=22.5 2023-06-18 16:28:32,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=252690.0, ans=0.125 2023-06-18 16:29:00,619 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.10 vs. limit=15.0 2023-06-18 16:29:25,174 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.182e+02 3.666e+02 4.957e+02 6.331e+02 1.126e+03, threshold=9.914e+02, percent-clipped=8.0 2023-06-18 16:29:25,198 INFO [train.py:996] (2/4) Epoch 2, batch 11650, loss[loss=0.3795, simple_loss=0.4349, pruned_loss=0.162, over 21587.00 frames. ], tot_loss[loss=0.3199, simple_loss=0.3859, pruned_loss=0.127, over 4266381.94 frames. ], batch size: 414, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:30:01,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=252990.0, ans=0.1 2023-06-18 16:30:06,183 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-18 16:30:16,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=253050.0, ans=0.125 2023-06-18 16:30:48,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=253110.0, ans=0.125 2023-06-18 16:31:02,390 INFO [train.py:996] (2/4) Epoch 2, batch 11700, loss[loss=0.3456, simple_loss=0.3755, pruned_loss=0.1578, over 20064.00 frames. ], tot_loss[loss=0.3153, simple_loss=0.3779, pruned_loss=0.1263, over 4264474.40 frames. ], batch size: 702, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:31:10,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=253170.0, ans=0.125 2023-06-18 16:31:15,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=253170.0, ans=0.2 2023-06-18 16:31:21,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=253230.0, ans=0.125 2023-06-18 16:31:47,982 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.66 vs. limit=10.0 2023-06-18 16:31:57,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=253350.0, ans=0.0 2023-06-18 16:32:38,119 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.450e+02 3.237e+02 3.960e+02 5.340e+02 1.578e+03, threshold=7.920e+02, percent-clipped=3.0 2023-06-18 16:32:38,140 INFO [train.py:996] (2/4) Epoch 2, batch 11750, loss[loss=0.313, simple_loss=0.3351, pruned_loss=0.1455, over 21644.00 frames. ], tot_loss[loss=0.3119, simple_loss=0.3691, pruned_loss=0.1273, over 4258294.04 frames. ], batch size: 445, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:32:43,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=253470.0, ans=0.0 2023-06-18 16:33:13,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=253590.0, ans=0.2 2023-06-18 16:33:24,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=253650.0, ans=0.125 2023-06-18 16:33:45,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=253650.0, ans=0.125 2023-06-18 16:34:17,528 INFO [train.py:996] (2/4) Epoch 2, batch 11800, loss[loss=0.4039, simple_loss=0.4665, pruned_loss=0.1706, over 19742.00 frames. ], tot_loss[loss=0.3159, simple_loss=0.3715, pruned_loss=0.1301, over 4263165.11 frames. ], batch size: 702, lr: 1.72e-02, grad_scale: 32.0 2023-06-18 16:34:25,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=253770.0, ans=0.125 2023-06-18 16:34:28,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=253770.0, ans=0.1 2023-06-18 16:34:35,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=253830.0, ans=0.125 2023-06-18 16:35:08,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=253950.0, ans=0.0 2023-06-18 16:35:41,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=254010.0, ans=22.5 2023-06-18 16:35:57,245 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 3.227e+02 4.040e+02 5.078e+02 7.033e+02, threshold=8.080e+02, percent-clipped=0.0 2023-06-18 16:35:57,275 INFO [train.py:996] (2/4) Epoch 2, batch 11850, loss[loss=0.2548, simple_loss=0.3393, pruned_loss=0.08519, over 21799.00 frames. ], tot_loss[loss=0.3144, simple_loss=0.3723, pruned_loss=0.1282, over 4276661.53 frames. ], batch size: 332, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:36:26,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=254130.0, ans=0.125 2023-06-18 16:37:09,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=254250.0, ans=0.1 2023-06-18 16:37:14,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=254310.0, ans=0.035 2023-06-18 16:37:34,978 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=15.0 2023-06-18 16:37:35,383 INFO [train.py:996] (2/4) Epoch 2, batch 11900, loss[loss=0.3512, simple_loss=0.4107, pruned_loss=0.1459, over 21403.00 frames. ], tot_loss[loss=0.3106, simple_loss=0.3717, pruned_loss=0.1247, over 4274473.47 frames. ], batch size: 507, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:37:38,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=254370.0, ans=0.1 2023-06-18 16:38:03,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=254430.0, ans=0.0 2023-06-18 16:38:08,394 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.75 vs. limit=15.0 2023-06-18 16:39:13,281 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 3.213e+02 3.818e+02 4.903e+02 8.116e+02, threshold=7.635e+02, percent-clipped=1.0 2023-06-18 16:39:13,311 INFO [train.py:996] (2/4) Epoch 2, batch 11950, loss[loss=0.2281, simple_loss=0.3011, pruned_loss=0.07752, over 21226.00 frames. ], tot_loss[loss=0.3069, simple_loss=0.3722, pruned_loss=0.1208, over 4268426.52 frames. ], batch size: 176, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:39:13,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=254670.0, ans=0.2 2023-06-18 16:39:29,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=254730.0, ans=0.125 2023-06-18 16:40:49,306 INFO [train.py:996] (2/4) Epoch 2, batch 12000, loss[loss=0.2782, simple_loss=0.3201, pruned_loss=0.1181, over 19958.00 frames. ], tot_loss[loss=0.3054, simple_loss=0.3708, pruned_loss=0.12, over 4268482.35 frames. ], batch size: 702, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:40:49,307 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 16:41:05,157 INFO [train.py:1028] (2/4) Epoch 2, validation: loss=0.2926, simple_loss=0.3848, pruned_loss=0.1002, over 1796401.00 frames. 2023-06-18 16:41:05,158 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-18 16:41:11,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=254970.0, ans=0.125 2023-06-18 16:42:08,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=255090.0, ans=0.0 2023-06-18 16:42:12,372 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.17 vs. limit=22.5 2023-06-18 16:42:26,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=255210.0, ans=0.1 2023-06-18 16:42:36,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=255210.0, ans=0.125 2023-06-18 16:42:42,374 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.269e+02 3.604e+02 5.059e+02 6.079e+02 1.381e+03, threshold=1.012e+03, percent-clipped=10.0 2023-06-18 16:42:42,395 INFO [train.py:996] (2/4) Epoch 2, batch 12050, loss[loss=0.3389, simple_loss=0.3891, pruned_loss=0.1444, over 21447.00 frames. ], tot_loss[loss=0.3075, simple_loss=0.3692, pruned_loss=0.1229, over 4277714.82 frames. ], batch size: 131, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:43:55,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=255450.0, ans=0.0 2023-06-18 16:43:58,659 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 16:44:05,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=255510.0, ans=0.125 2023-06-18 16:44:06,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=255510.0, ans=0.125 2023-06-18 16:44:15,719 INFO [train.py:996] (2/4) Epoch 2, batch 12100, loss[loss=0.3382, simple_loss=0.3765, pruned_loss=0.1499, over 21786.00 frames. ], tot_loss[loss=0.3154, simple_loss=0.3742, pruned_loss=0.1282, over 4277088.93 frames. ], batch size: 247, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:44:51,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=255630.0, ans=0.125 2023-06-18 16:45:03,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=255690.0, ans=0.2 2023-06-18 16:46:01,040 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.581e+02 3.947e+02 4.820e+02 5.748e+02 9.180e+02, threshold=9.640e+02, percent-clipped=0.0 2023-06-18 16:46:01,069 INFO [train.py:996] (2/4) Epoch 2, batch 12150, loss[loss=0.3059, simple_loss=0.3977, pruned_loss=0.107, over 21264.00 frames. ], tot_loss[loss=0.3163, simple_loss=0.3759, pruned_loss=0.1283, over 4278580.85 frames. ], batch size: 548, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:46:25,937 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.72 vs. limit=15.0 2023-06-18 16:46:42,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=255990.0, ans=0.125 2023-06-18 16:47:17,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=256110.0, ans=0.1 2023-06-18 16:47:47,407 INFO [train.py:996] (2/4) Epoch 2, batch 12200, loss[loss=0.3019, simple_loss=0.3491, pruned_loss=0.1273, over 21629.00 frames. ], tot_loss[loss=0.3129, simple_loss=0.3725, pruned_loss=0.1267, over 4271129.81 frames. ], batch size: 332, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:47:47,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=256170.0, ans=0.1 2023-06-18 16:47:57,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=256170.0, ans=0.05 2023-06-18 16:48:07,515 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=15.0 2023-06-18 16:48:15,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=256230.0, ans=0.1 2023-06-18 16:48:32,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=256290.0, ans=0.0 2023-06-18 16:48:44,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=256350.0, ans=0.0 2023-06-18 16:48:46,945 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.39 vs. limit=15.0 2023-06-18 16:48:54,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=256410.0, ans=0.1 2023-06-18 16:49:18,490 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=22.5 2023-06-18 16:49:24,897 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.073e+02 2.962e+02 3.747e+02 4.961e+02 1.098e+03, threshold=7.494e+02, percent-clipped=1.0 2023-06-18 16:49:24,927 INFO [train.py:996] (2/4) Epoch 2, batch 12250, loss[loss=0.183, simple_loss=0.2401, pruned_loss=0.06299, over 21787.00 frames. ], tot_loss[loss=0.3038, simple_loss=0.3637, pruned_loss=0.1219, over 4268113.64 frames. ], batch size: 107, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:49:28,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=256470.0, ans=0.125 2023-06-18 16:49:41,087 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-06-18 16:49:48,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=256530.0, ans=0.125 2023-06-18 16:50:07,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=256590.0, ans=0.04949747468305833 2023-06-18 16:50:25,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=256710.0, ans=0.1 2023-06-18 16:50:39,517 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=15.0 2023-06-18 16:51:02,004 INFO [train.py:996] (2/4) Epoch 2, batch 12300, loss[loss=0.2845, simple_loss=0.3478, pruned_loss=0.1107, over 21497.00 frames. ], tot_loss[loss=0.2874, simple_loss=0.3509, pruned_loss=0.1119, over 4268247.76 frames. ], batch size: 471, lr: 1.71e-02, grad_scale: 32.0 2023-06-18 16:51:29,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=256830.0, ans=0.125 2023-06-18 16:51:47,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=256950.0, ans=0.1 2023-06-18 16:52:37,996 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.933e+02 2.926e+02 3.592e+02 4.510e+02 1.066e+03, threshold=7.183e+02, percent-clipped=4.0 2023-06-18 16:52:38,017 INFO [train.py:996] (2/4) Epoch 2, batch 12350, loss[loss=0.3431, simple_loss=0.3991, pruned_loss=0.1435, over 21852.00 frames. ], tot_loss[loss=0.2901, simple_loss=0.3547, pruned_loss=0.1128, over 4270083.48 frames. ], batch size: 351, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 16:52:38,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=257070.0, ans=0.125 2023-06-18 16:53:00,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=257130.0, ans=0.0 2023-06-18 16:53:06,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=257130.0, ans=0.2 2023-06-18 16:53:09,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=257190.0, ans=0.09899494936611666 2023-06-18 16:53:25,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=257250.0, ans=0.125 2023-06-18 16:53:26,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=257250.0, ans=0.1 2023-06-18 16:53:48,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=257310.0, ans=0.0 2023-06-18 16:53:48,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=257310.0, ans=0.2 2023-06-18 16:54:09,390 INFO [train.py:996] (2/4) Epoch 2, batch 12400, loss[loss=0.309, simple_loss=0.3637, pruned_loss=0.1272, over 21197.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3572, pruned_loss=0.1172, over 4282029.65 frames. ], batch size: 176, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 16:54:12,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=257370.0, ans=0.95 2023-06-18 16:55:41,138 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=15.0 2023-06-18 16:55:43,215 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.681e+02 3.570e+02 4.078e+02 4.958e+02 7.763e+02, threshold=8.156e+02, percent-clipped=1.0 2023-06-18 16:55:43,245 INFO [train.py:996] (2/4) Epoch 2, batch 12450, loss[loss=0.3981, simple_loss=0.4466, pruned_loss=0.1748, over 21276.00 frames. ], tot_loss[loss=0.306, simple_loss=0.3642, pruned_loss=0.1239, over 4286842.61 frames. ], batch size: 159, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 16:56:29,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=257790.0, ans=0.05 2023-06-18 16:56:37,964 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.37 vs. limit=10.0 2023-06-18 16:57:07,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=257910.0, ans=0.0 2023-06-18 16:57:15,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=257910.0, ans=0.1 2023-06-18 16:57:20,742 INFO [train.py:996] (2/4) Epoch 2, batch 12500, loss[loss=0.4016, simple_loss=0.4505, pruned_loss=0.1764, over 21502.00 frames. ], tot_loss[loss=0.3178, simple_loss=0.3758, pruned_loss=0.1299, over 4288876.38 frames. ], batch size: 471, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 16:57:26,808 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.27 vs. limit=8.0 2023-06-18 16:58:17,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=258090.0, ans=0.0 2023-06-18 16:58:34,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=258150.0, ans=0.0 2023-06-18 16:58:36,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=258150.0, ans=0.1 2023-06-18 16:58:58,484 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.621e+02 3.350e+02 3.939e+02 4.917e+02 9.519e+02, threshold=7.878e+02, percent-clipped=2.0 2023-06-18 16:58:58,509 INFO [train.py:996] (2/4) Epoch 2, batch 12550, loss[loss=0.3248, simple_loss=0.3962, pruned_loss=0.1268, over 21322.00 frames. ], tot_loss[loss=0.3237, simple_loss=0.3818, pruned_loss=0.1328, over 4286166.76 frames. ], batch size: 549, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 16:59:13,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=258330.0, ans=0.0 2023-06-18 16:59:15,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=258330.0, ans=0.2 2023-06-18 16:59:16,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=258330.0, ans=0.1 2023-06-18 16:59:23,448 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.49 vs. limit=15.0 2023-06-18 16:59:31,342 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=18.56 vs. limit=15.0 2023-06-18 16:59:35,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=258330.0, ans=0.125 2023-06-18 17:00:07,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=258450.0, ans=0.1 2023-06-18 17:00:36,889 INFO [train.py:996] (2/4) Epoch 2, batch 12600, loss[loss=0.2649, simple_loss=0.3312, pruned_loss=0.09929, over 20762.00 frames. ], tot_loss[loss=0.3193, simple_loss=0.3798, pruned_loss=0.1294, over 4279594.39 frames. ], batch size: 608, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 17:01:24,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=258630.0, ans=0.125 2023-06-18 17:01:25,005 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.56 vs. limit=15.0 2023-06-18 17:01:36,066 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.64 vs. limit=6.0 2023-06-18 17:01:52,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=258750.0, ans=10.0 2023-06-18 17:01:53,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=258750.0, ans=0.04949747468305833 2023-06-18 17:01:54,173 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=22.5 2023-06-18 17:02:04,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=258810.0, ans=0.0 2023-06-18 17:02:12,921 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.876e+02 3.494e+02 4.502e+02 7.452e+02, threshold=6.987e+02, percent-clipped=0.0 2023-06-18 17:02:12,950 INFO [train.py:996] (2/4) Epoch 2, batch 12650, loss[loss=0.317, simple_loss=0.3659, pruned_loss=0.1341, over 21671.00 frames. ], tot_loss[loss=0.3096, simple_loss=0.3707, pruned_loss=0.1243, over 4274152.43 frames. ], batch size: 473, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 17:02:14,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=258870.0, ans=0.125 2023-06-18 17:02:27,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=258870.0, ans=0.5 2023-06-18 17:02:30,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=258870.0, ans=0.125 2023-06-18 17:02:30,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=258870.0, ans=0.0 2023-06-18 17:02:30,722 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2023-06-18 17:03:03,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=258990.0, ans=0.0 2023-06-18 17:03:35,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=259110.0, ans=0.125 2023-06-18 17:03:40,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=259110.0, ans=0.0 2023-06-18 17:03:49,463 INFO [train.py:996] (2/4) Epoch 2, batch 12700, loss[loss=0.2906, simple_loss=0.3493, pruned_loss=0.1159, over 21641.00 frames. ], tot_loss[loss=0.3125, simple_loss=0.3712, pruned_loss=0.1269, over 4283076.38 frames. ], batch size: 230, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 17:04:24,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=259230.0, ans=0.125 2023-06-18 17:04:26,503 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=12.0 2023-06-18 17:04:59,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=259350.0, ans=0.125 2023-06-18 17:05:13,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=259410.0, ans=0.0 2023-06-18 17:05:15,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=259410.0, ans=0.125 2023-06-18 17:05:25,059 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 3.415e+02 4.071e+02 4.986e+02 8.988e+02, threshold=8.142e+02, percent-clipped=6.0 2023-06-18 17:05:25,081 INFO [train.py:996] (2/4) Epoch 2, batch 12750, loss[loss=0.2856, simple_loss=0.3479, pruned_loss=0.1117, over 21789.00 frames. ], tot_loss[loss=0.3162, simple_loss=0.3756, pruned_loss=0.1284, over 4287338.05 frames. ], batch size: 351, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 17:05:44,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=259530.0, ans=0.125 2023-06-18 17:05:53,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=259530.0, ans=0.125 2023-06-18 17:06:25,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=259590.0, ans=0.125 2023-06-18 17:06:29,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=259650.0, ans=0.2 2023-06-18 17:06:31,876 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=22.5 2023-06-18 17:06:42,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=259710.0, ans=0.125 2023-06-18 17:06:57,107 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-18 17:07:08,582 INFO [train.py:996] (2/4) Epoch 2, batch 12800, loss[loss=0.3805, simple_loss=0.4232, pruned_loss=0.1689, over 21855.00 frames. ], tot_loss[loss=0.3147, simple_loss=0.373, pruned_loss=0.1282, over 4287904.32 frames. ], batch size: 118, lr: 1.70e-02, grad_scale: 32.0 2023-06-18 17:07:21,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=259770.0, ans=0.95 2023-06-18 17:07:23,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=259770.0, ans=10.0 2023-06-18 17:07:40,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=259830.0, ans=0.07 2023-06-18 17:08:09,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=259950.0, ans=0.02 2023-06-18 17:08:46,748 INFO [train.py:996] (2/4) Epoch 2, batch 12850, loss[loss=0.3059, simple_loss=0.3963, pruned_loss=0.1077, over 19913.00 frames. ], tot_loss[loss=0.3188, simple_loss=0.3754, pruned_loss=0.131, over 4289215.48 frames. ], batch size: 703, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:08:48,203 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.193e+02 3.179e+02 3.774e+02 4.668e+02 7.829e+02, threshold=7.547e+02, percent-clipped=0.0 2023-06-18 17:10:37,524 INFO [train.py:996] (2/4) Epoch 2, batch 12900, loss[loss=0.2889, simple_loss=0.3609, pruned_loss=0.1084, over 21747.00 frames. ], tot_loss[loss=0.3105, simple_loss=0.3708, pruned_loss=0.1251, over 4278416.28 frames. ], batch size: 351, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:10:38,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=260370.0, ans=0.125 2023-06-18 17:11:52,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=260550.0, ans=0.125 2023-06-18 17:12:15,683 INFO [train.py:996] (2/4) Epoch 2, batch 12950, loss[loss=0.3573, simple_loss=0.4051, pruned_loss=0.1548, over 21709.00 frames. ], tot_loss[loss=0.3085, simple_loss=0.3712, pruned_loss=0.1229, over 4272407.26 frames. ], batch size: 298, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:12:17,128 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.935e+02 3.080e+02 3.556e+02 4.378e+02 7.837e+02, threshold=7.111e+02, percent-clipped=1.0 2023-06-18 17:12:30,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=260730.0, ans=0.1 2023-06-18 17:12:37,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=260730.0, ans=0.125 2023-06-18 17:13:48,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=260910.0, ans=0.1 2023-06-18 17:13:51,819 INFO [train.py:996] (2/4) Epoch 2, batch 13000, loss[loss=0.1897, simple_loss=0.2571, pruned_loss=0.06116, over 21098.00 frames. ], tot_loss[loss=0.306, simple_loss=0.3682, pruned_loss=0.1219, over 4261722.10 frames. ], batch size: 143, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:13:54,335 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-18 17:14:12,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=261030.0, ans=0.125 2023-06-18 17:14:16,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=261030.0, ans=0.125 2023-06-18 17:15:15,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=261210.0, ans=0.0 2023-06-18 17:15:18,112 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.80 vs. limit=10.0 2023-06-18 17:15:20,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=261210.0, ans=0.1 2023-06-18 17:15:27,592 INFO [train.py:996] (2/4) Epoch 2, batch 13050, loss[loss=0.2936, simple_loss=0.3441, pruned_loss=0.1215, over 21919.00 frames. ], tot_loss[loss=0.3004, simple_loss=0.3623, pruned_loss=0.1193, over 4271133.49 frames. ], batch size: 316, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:15:28,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=261270.0, ans=0.0 2023-06-18 17:15:29,100 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 3.088e+02 4.287e+02 5.215e+02 1.044e+03, threshold=8.575e+02, percent-clipped=6.0 2023-06-18 17:15:35,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=261270.0, ans=0.1 2023-06-18 17:15:38,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=261270.0, ans=0.1 2023-06-18 17:15:40,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=261270.0, ans=0.1 2023-06-18 17:16:01,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=261390.0, ans=0.0 2023-06-18 17:16:22,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=261450.0, ans=0.125 2023-06-18 17:16:49,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=261510.0, ans=0.125 2023-06-18 17:17:02,242 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2023-06-18 17:17:04,248 INFO [train.py:996] (2/4) Epoch 2, batch 13100, loss[loss=0.3057, simple_loss=0.3665, pruned_loss=0.1224, over 21788.00 frames. ], tot_loss[loss=0.2993, simple_loss=0.3615, pruned_loss=0.1185, over 4275835.39 frames. ], batch size: 247, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:17:06,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=261570.0, ans=0.1 2023-06-18 17:17:07,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=261570.0, ans=0.0 2023-06-18 17:17:19,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=261630.0, ans=0.125 2023-06-18 17:17:22,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=261630.0, ans=0.07 2023-06-18 17:18:31,105 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.27 vs. limit=22.5 2023-06-18 17:18:44,172 INFO [train.py:996] (2/4) Epoch 2, batch 13150, loss[loss=0.3364, simple_loss=0.3882, pruned_loss=0.1423, over 21359.00 frames. ], tot_loss[loss=0.3064, simple_loss=0.3666, pruned_loss=0.123, over 4273546.74 frames. ], batch size: 471, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:18:45,947 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.361e+02 3.617e+02 4.529e+02 5.724e+02 9.376e+02, threshold=9.058e+02, percent-clipped=0.0 2023-06-18 17:18:49,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=261870.0, ans=0.125 2023-06-18 17:19:04,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=261930.0, ans=0.0 2023-06-18 17:19:57,479 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-06-18 17:20:05,072 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=12.0 2023-06-18 17:20:07,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=262110.0, ans=0.1 2023-06-18 17:20:18,165 INFO [train.py:996] (2/4) Epoch 2, batch 13200, loss[loss=0.2963, simple_loss=0.3546, pruned_loss=0.119, over 21268.00 frames. ], tot_loss[loss=0.3069, simple_loss=0.3667, pruned_loss=0.1235, over 4274502.47 frames. ], batch size: 143, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:20:26,961 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=22.5 2023-06-18 17:20:38,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=262170.0, ans=0.1 2023-06-18 17:21:09,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=262290.0, ans=0.2 2023-06-18 17:21:09,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=262290.0, ans=0.0 2023-06-18 17:21:39,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=262410.0, ans=0.0 2023-06-18 17:21:41,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=262410.0, ans=0.125 2023-06-18 17:21:59,604 INFO [train.py:996] (2/4) Epoch 2, batch 13250, loss[loss=0.3051, simple_loss=0.3545, pruned_loss=0.1278, over 21806.00 frames. ], tot_loss[loss=0.3085, simple_loss=0.3664, pruned_loss=0.1253, over 4275215.90 frames. ], batch size: 112, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:22:06,466 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 3.187e+02 3.809e+02 4.597e+02 7.682e+02, threshold=7.618e+02, percent-clipped=1.0 2023-06-18 17:22:13,666 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.77 vs. limit=6.0 2023-06-18 17:23:15,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=262650.0, ans=0.1 2023-06-18 17:23:21,835 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:23:24,003 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=12.0 2023-06-18 17:23:43,658 INFO [train.py:996] (2/4) Epoch 2, batch 13300, loss[loss=0.254, simple_loss=0.3286, pruned_loss=0.08974, over 21746.00 frames. ], tot_loss[loss=0.3097, simple_loss=0.3689, pruned_loss=0.1252, over 4272792.30 frames. ], batch size: 247, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:24:22,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=262830.0, ans=0.1 2023-06-18 17:25:09,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.whiten.whitening_limit, batch_count=263010.0, ans=12.0 2023-06-18 17:25:25,056 INFO [train.py:996] (2/4) Epoch 2, batch 13350, loss[loss=0.2939, simple_loss=0.3413, pruned_loss=0.1233, over 16080.00 frames. ], tot_loss[loss=0.3153, simple_loss=0.3728, pruned_loss=0.1289, over 4269634.67 frames. ], batch size: 60, lr: 1.69e-02, grad_scale: 32.0 2023-06-18 17:25:26,614 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.186e+02 3.887e+02 4.954e+02 1.112e+03, threshold=7.774e+02, percent-clipped=6.0 2023-06-18 17:25:27,313 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:25:44,068 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-18 17:26:12,431 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=15.36 vs. limit=15.0 2023-06-18 17:26:25,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=263250.0, ans=0.2 2023-06-18 17:26:49,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=263310.0, ans=0.125 2023-06-18 17:27:08,152 INFO [train.py:996] (2/4) Epoch 2, batch 13400, loss[loss=0.3267, simple_loss=0.3725, pruned_loss=0.1404, over 21414.00 frames. ], tot_loss[loss=0.3194, simple_loss=0.3743, pruned_loss=0.1323, over 4271296.01 frames. ], batch size: 194, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:27:12,389 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-06-18 17:27:12,403 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=22.5 2023-06-18 17:27:33,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=263430.0, ans=0.05 2023-06-18 17:27:43,507 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-06-18 17:27:56,371 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.95 vs. limit=6.0 2023-06-18 17:28:37,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=263610.0, ans=0.125 2023-06-18 17:28:45,350 INFO [train.py:996] (2/4) Epoch 2, batch 13450, loss[loss=0.3125, simple_loss=0.3677, pruned_loss=0.1287, over 21530.00 frames. ], tot_loss[loss=0.323, simple_loss=0.3764, pruned_loss=0.1348, over 4275048.79 frames. ], batch size: 389, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:28:45,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=263670.0, ans=0.2 2023-06-18 17:28:46,843 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.450e+02 3.497e+02 3.954e+02 4.827e+02 1.042e+03, threshold=7.908e+02, percent-clipped=7.0 2023-06-18 17:28:47,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=263670.0, ans=0.125 2023-06-18 17:28:56,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=263670.0, ans=10.0 2023-06-18 17:28:58,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=263670.0, ans=0.125 2023-06-18 17:29:23,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=263790.0, ans=0.125 2023-06-18 17:30:23,570 INFO [train.py:996] (2/4) Epoch 2, batch 13500, loss[loss=0.2452, simple_loss=0.2938, pruned_loss=0.09829, over 21484.00 frames. ], tot_loss[loss=0.3117, simple_loss=0.3654, pruned_loss=0.1291, over 4260782.63 frames. ], batch size: 195, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:31:41,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=264150.0, ans=0.2 2023-06-18 17:31:48,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=264210.0, ans=0.035 2023-06-18 17:32:03,928 INFO [train.py:996] (2/4) Epoch 2, batch 13550, loss[loss=0.2838, simple_loss=0.366, pruned_loss=0.1008, over 21238.00 frames. ], tot_loss[loss=0.3121, simple_loss=0.3693, pruned_loss=0.1274, over 4266566.12 frames. ], batch size: 176, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:32:05,763 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 3.329e+02 4.198e+02 5.480e+02 1.124e+03, threshold=8.396e+02, percent-clipped=8.0 2023-06-18 17:32:10,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=264270.0, ans=0.125 2023-06-18 17:32:38,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=264390.0, ans=0.125 2023-06-18 17:33:42,111 INFO [train.py:996] (2/4) Epoch 2, batch 13600, loss[loss=0.331, simple_loss=0.3857, pruned_loss=0.1381, over 21493.00 frames. ], tot_loss[loss=0.3148, simple_loss=0.371, pruned_loss=0.1293, over 4275145.41 frames. ], batch size: 548, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:33:51,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=264570.0, ans=0.2 2023-06-18 17:33:58,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=264630.0, ans=0.0 2023-06-18 17:34:27,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=264690.0, ans=0.0 2023-06-18 17:35:10,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=264810.0, ans=0.125 2023-06-18 17:35:19,313 INFO [train.py:996] (2/4) Epoch 2, batch 13650, loss[loss=0.2796, simple_loss=0.3186, pruned_loss=0.1203, over 21587.00 frames. ], tot_loss[loss=0.3083, simple_loss=0.3655, pruned_loss=0.1255, over 4266939.76 frames. ], batch size: 247, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:35:20,646 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 3.001e+02 3.620e+02 4.450e+02 8.511e+02, threshold=7.240e+02, percent-clipped=1.0 2023-06-18 17:35:24,672 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-06-18 17:35:33,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=264930.0, ans=0.95 2023-06-18 17:35:58,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=264990.0, ans=0.04949747468305833 2023-06-18 17:36:03,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=264990.0, ans=0.04949747468305833 2023-06-18 17:36:47,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=265110.0, ans=0.0 2023-06-18 17:36:48,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=265110.0, ans=0.0 2023-06-18 17:36:55,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=265110.0, ans=0.1 2023-06-18 17:36:59,393 INFO [train.py:996] (2/4) Epoch 2, batch 13700, loss[loss=0.2846, simple_loss=0.3338, pruned_loss=0.1177, over 21699.00 frames. ], tot_loss[loss=0.3051, simple_loss=0.3595, pruned_loss=0.1254, over 4260755.39 frames. ], batch size: 263, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:37:01,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=265170.0, ans=0.0 2023-06-18 17:37:29,899 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.70 vs. limit=10.0 2023-06-18 17:37:48,271 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:38:39,045 INFO [train.py:996] (2/4) Epoch 2, batch 13750, loss[loss=0.2532, simple_loss=0.3052, pruned_loss=0.1006, over 21545.00 frames. ], tot_loss[loss=0.3007, simple_loss=0.3562, pruned_loss=0.1226, over 4266448.95 frames. ], batch size: 195, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:38:45,114 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.304e+02 3.651e+02 4.578e+02 5.768e+02 1.165e+03, threshold=9.156e+02, percent-clipped=11.0 2023-06-18 17:38:47,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=265470.0, ans=0.0 2023-06-18 17:39:35,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=265590.0, ans=0.125 2023-06-18 17:39:43,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=265590.0, ans=0.125 2023-06-18 17:39:43,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=265590.0, ans=0.2 2023-06-18 17:39:58,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=265650.0, ans=0.2 2023-06-18 17:40:03,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=265650.0, ans=0.125 2023-06-18 17:40:32,095 INFO [train.py:996] (2/4) Epoch 2, batch 13800, loss[loss=0.3115, simple_loss=0.3997, pruned_loss=0.1116, over 21768.00 frames. ], tot_loss[loss=0.3014, simple_loss=0.3611, pruned_loss=0.1208, over 4265882.90 frames. ], batch size: 282, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:40:32,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=265770.0, ans=0.035 2023-06-18 17:40:48,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=265770.0, ans=0.0 2023-06-18 17:41:01,901 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.58 vs. limit=15.0 2023-06-18 17:41:27,213 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=12.0 2023-06-18 17:41:50,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=266010.0, ans=0.125 2023-06-18 17:42:15,754 INFO [train.py:996] (2/4) Epoch 2, batch 13850, loss[loss=0.3946, simple_loss=0.4474, pruned_loss=0.1709, over 21303.00 frames. ], tot_loss[loss=0.3069, simple_loss=0.3679, pruned_loss=0.123, over 4259731.27 frames. ], batch size: 548, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:42:17,193 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.209e+02 3.095e+02 3.814e+02 4.955e+02 1.017e+03, threshold=7.628e+02, percent-clipped=1.0 2023-06-18 17:42:21,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=266070.0, ans=0.1 2023-06-18 17:42:46,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=266130.0, ans=0.0 2023-06-18 17:43:01,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=266190.0, ans=0.1 2023-06-18 17:43:03,737 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=22.5 2023-06-18 17:43:07,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=266250.0, ans=0.09899494936611666 2023-06-18 17:43:32,415 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=15.0 2023-06-18 17:43:36,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=266310.0, ans=0.0 2023-06-18 17:43:39,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=266310.0, ans=0.2 2023-06-18 17:43:53,129 INFO [train.py:996] (2/4) Epoch 2, batch 13900, loss[loss=0.3196, simple_loss=0.364, pruned_loss=0.1376, over 21881.00 frames. ], tot_loss[loss=0.3131, simple_loss=0.3728, pruned_loss=0.1267, over 4266506.62 frames. ], batch size: 351, lr: 1.68e-02, grad_scale: 32.0 2023-06-18 17:44:04,110 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.67 vs. limit=6.0 2023-06-18 17:44:09,736 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:44:22,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=266430.0, ans=0.0 2023-06-18 17:44:22,909 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.94 vs. limit=6.0 2023-06-18 17:45:09,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=266610.0, ans=0.125 2023-06-18 17:45:34,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=266670.0, ans=0.0 2023-06-18 17:45:35,912 INFO [train.py:996] (2/4) Epoch 2, batch 13950, loss[loss=0.2969, simple_loss=0.3538, pruned_loss=0.12, over 21911.00 frames. ], tot_loss[loss=0.3146, simple_loss=0.3732, pruned_loss=0.128, over 4268746.17 frames. ], batch size: 316, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:45:37,808 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.495e+02 3.843e+02 4.662e+02 6.006e+02 1.294e+03, threshold=9.323e+02, percent-clipped=7.0 2023-06-18 17:45:38,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=266670.0, ans=0.1 2023-06-18 17:46:00,793 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.58 vs. limit=22.5 2023-06-18 17:46:03,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=266730.0, ans=0.125 2023-06-18 17:47:08,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=266910.0, ans=0.0 2023-06-18 17:47:14,084 INFO [train.py:996] (2/4) Epoch 2, batch 14000, loss[loss=0.2522, simple_loss=0.3429, pruned_loss=0.08079, over 21418.00 frames. ], tot_loss[loss=0.3088, simple_loss=0.3678, pruned_loss=0.1249, over 4265209.30 frames. ], batch size: 211, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:47:28,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=267030.0, ans=0.125 2023-06-18 17:47:50,208 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-18 17:48:03,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=267150.0, ans=0.2 2023-06-18 17:48:09,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=267150.0, ans=0.125 2023-06-18 17:48:29,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=267210.0, ans=0.125 2023-06-18 17:48:38,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=267210.0, ans=0.015 2023-06-18 17:48:48,759 INFO [train.py:996] (2/4) Epoch 2, batch 14050, loss[loss=0.2389, simple_loss=0.2987, pruned_loss=0.08955, over 21343.00 frames. ], tot_loss[loss=0.2981, simple_loss=0.3595, pruned_loss=0.1183, over 4261306.84 frames. ], batch size: 176, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:48:50,174 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.618e+02 2.887e+02 3.656e+02 4.389e+02 9.715e+02, threshold=7.312e+02, percent-clipped=1.0 2023-06-18 17:49:01,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=267270.0, ans=0.2 2023-06-18 17:49:19,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=267390.0, ans=0.0 2023-06-18 17:49:48,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=267450.0, ans=0.2 2023-06-18 17:50:23,577 INFO [train.py:996] (2/4) Epoch 2, batch 14100, loss[loss=0.3116, simple_loss=0.3605, pruned_loss=0.1313, over 15075.00 frames. ], tot_loss[loss=0.2951, simple_loss=0.3531, pruned_loss=0.1185, over 4253167.65 frames. ], batch size: 61, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:50:38,180 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.40 vs. limit=15.0 2023-06-18 17:51:20,832 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.85 vs. limit=12.0 2023-06-18 17:51:26,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=267750.0, ans=0.1 2023-06-18 17:51:27,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=267750.0, ans=0.2 2023-06-18 17:51:52,569 INFO [train.py:996] (2/4) Epoch 2, batch 14150, loss[loss=0.3356, simple_loss=0.3946, pruned_loss=0.1382, over 21447.00 frames. ], tot_loss[loss=0.2988, simple_loss=0.3573, pruned_loss=0.1201, over 4257234.29 frames. ], batch size: 471, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:51:59,080 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 3.602e+02 4.448e+02 5.500e+02 9.616e+02, threshold=8.896e+02, percent-clipped=7.0 2023-06-18 17:52:10,259 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-06-18 17:52:12,908 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.77 vs. limit=22.5 2023-06-18 17:52:27,144 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:52:30,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=267990.0, ans=0.0 2023-06-18 17:52:37,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=267990.0, ans=0.0 2023-06-18 17:52:42,328 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.64 vs. limit=15.0 2023-06-18 17:52:44,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=268050.0, ans=0.0 2023-06-18 17:53:21,315 INFO [train.py:996] (2/4) Epoch 2, batch 14200, loss[loss=0.305, simple_loss=0.3975, pruned_loss=0.1062, over 20823.00 frames. ], tot_loss[loss=0.2955, simple_loss=0.3551, pruned_loss=0.1179, over 4257703.68 frames. ], batch size: 608, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:53:30,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=268170.0, ans=0.1 2023-06-18 17:54:23,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=268350.0, ans=0.025 2023-06-18 17:54:46,603 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.64 vs. limit=10.0 2023-06-18 17:54:54,797 INFO [train.py:996] (2/4) Epoch 2, batch 14250, loss[loss=0.2427, simple_loss=0.2959, pruned_loss=0.09475, over 21231.00 frames. ], tot_loss[loss=0.2919, simple_loss=0.3494, pruned_loss=0.1172, over 4263373.35 frames. ], batch size: 159, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:54:56,158 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 3.115e+02 4.292e+02 5.783e+02 1.043e+03, threshold=8.584e+02, percent-clipped=1.0 2023-06-18 17:55:35,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=268590.0, ans=0.0 2023-06-18 17:55:59,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=268650.0, ans=0.125 2023-06-18 17:56:24,963 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.64 vs. limit=22.5 2023-06-18 17:56:33,173 INFO [train.py:996] (2/4) Epoch 2, batch 14300, loss[loss=0.4367, simple_loss=0.4928, pruned_loss=0.1903, over 21671.00 frames. ], tot_loss[loss=0.2949, simple_loss=0.3549, pruned_loss=0.1175, over 4265383.54 frames. ], batch size: 414, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:56:38,898 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-18 17:57:18,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=268890.0, ans=0.1 2023-06-18 17:57:36,835 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 17:58:04,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=269010.0, ans=0.125 2023-06-18 17:58:09,493 INFO [train.py:996] (2/4) Epoch 2, batch 14350, loss[loss=0.3259, simple_loss=0.3832, pruned_loss=0.1343, over 21458.00 frames. ], tot_loss[loss=0.3003, simple_loss=0.3625, pruned_loss=0.1191, over 4256185.94 frames. ], batch size: 548, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 17:58:11,151 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 3.215e+02 4.287e+02 5.391e+02 1.265e+03, threshold=8.575e+02, percent-clipped=5.0 2023-06-18 17:58:39,054 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.54 vs. limit=22.5 2023-06-18 17:59:07,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=269250.0, ans=0.125 2023-06-18 17:59:50,474 INFO [train.py:996] (2/4) Epoch 2, batch 14400, loss[loss=0.3164, simple_loss=0.3587, pruned_loss=0.1371, over 21857.00 frames. ], tot_loss[loss=0.3028, simple_loss=0.3612, pruned_loss=0.1222, over 4265074.58 frames. ], batch size: 107, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 18:00:02,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=269370.0, ans=0.04949747468305833 2023-06-18 18:00:13,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=269430.0, ans=0.125 2023-06-18 18:00:21,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=269430.0, ans=0.5 2023-06-18 18:00:38,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=269490.0, ans=0.2 2023-06-18 18:00:56,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=269550.0, ans=0.0 2023-06-18 18:01:15,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=269610.0, ans=0.2 2023-06-18 18:01:25,596 INFO [train.py:996] (2/4) Epoch 2, batch 14450, loss[loss=0.2839, simple_loss=0.3273, pruned_loss=0.1203, over 21251.00 frames. ], tot_loss[loss=0.2996, simple_loss=0.3544, pruned_loss=0.1224, over 4267759.18 frames. ], batch size: 176, lr: 1.67e-02, grad_scale: 32.0 2023-06-18 18:01:26,962 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.143e+02 3.300e+02 3.933e+02 4.836e+02 8.413e+02, threshold=7.867e+02, percent-clipped=0.0 2023-06-18 18:01:56,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=269730.0, ans=0.2 2023-06-18 18:02:06,637 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=22.5 2023-06-18 18:02:11,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=269790.0, ans=0.1 2023-06-18 18:02:35,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=269910.0, ans=0.2 2023-06-18 18:03:01,255 INFO [train.py:996] (2/4) Epoch 2, batch 14500, loss[loss=0.297, simple_loss=0.363, pruned_loss=0.1155, over 21246.00 frames. ], tot_loss[loss=0.2961, simple_loss=0.3503, pruned_loss=0.1209, over 4255078.20 frames. ], batch size: 548, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:03:01,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=269970.0, ans=0.1 2023-06-18 18:03:34,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=270030.0, ans=0.2 2023-06-18 18:04:43,612 INFO [train.py:996] (2/4) Epoch 2, batch 14550, loss[loss=0.345, simple_loss=0.3896, pruned_loss=0.1502, over 21361.00 frames. ], tot_loss[loss=0.3033, simple_loss=0.3574, pruned_loss=0.1246, over 4264557.04 frames. ], batch size: 549, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:04:45,448 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.923e+02 3.262e+02 3.772e+02 5.838e+02, threshold=6.523e+02, percent-clipped=0.0 2023-06-18 18:06:21,035 INFO [train.py:996] (2/4) Epoch 2, batch 14600, loss[loss=0.3242, simple_loss=0.3867, pruned_loss=0.1309, over 21560.00 frames. ], tot_loss[loss=0.3128, simple_loss=0.3663, pruned_loss=0.1297, over 4272173.66 frames. ], batch size: 230, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:06:30,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=270570.0, ans=0.0 2023-06-18 18:07:15,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=270750.0, ans=0.125 2023-06-18 18:07:49,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=270810.0, ans=0.125 2023-06-18 18:07:58,641 INFO [train.py:996] (2/4) Epoch 2, batch 14650, loss[loss=0.2627, simple_loss=0.3017, pruned_loss=0.1118, over 20880.00 frames. ], tot_loss[loss=0.3125, simple_loss=0.3686, pruned_loss=0.1282, over 4271716.66 frames. ], batch size: 608, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:08:00,049 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 3.284e+02 3.921e+02 4.845e+02 9.187e+02, threshold=7.842e+02, percent-clipped=12.0 2023-06-18 18:08:19,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=270930.0, ans=0.125 2023-06-18 18:09:08,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=271050.0, ans=0.2 2023-06-18 18:09:26,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=271110.0, ans=0.07 2023-06-18 18:09:34,803 INFO [train.py:996] (2/4) Epoch 2, batch 14700, loss[loss=0.2387, simple_loss=0.3243, pruned_loss=0.07654, over 21682.00 frames. ], tot_loss[loss=0.2962, simple_loss=0.3571, pruned_loss=0.1177, over 4259837.26 frames. ], batch size: 247, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:10:37,546 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.25 vs. limit=15.0 2023-06-18 18:10:59,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=271410.0, ans=0.125 2023-06-18 18:11:07,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=271410.0, ans=0.2 2023-06-18 18:11:13,586 INFO [train.py:996] (2/4) Epoch 2, batch 14750, loss[loss=0.5271, simple_loss=0.5459, pruned_loss=0.2541, over 21384.00 frames. ], tot_loss[loss=0.3059, simple_loss=0.3664, pruned_loss=0.1228, over 4266901.02 frames. ], batch size: 507, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:11:15,470 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.827e+02 3.023e+02 4.317e+02 6.398e+02 9.994e+02, threshold=8.633e+02, percent-clipped=9.0 2023-06-18 18:11:23,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=271470.0, ans=0.125 2023-06-18 18:11:28,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=271530.0, ans=0.0 2023-06-18 18:12:03,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=271590.0, ans=0.1 2023-06-18 18:12:14,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=271590.0, ans=0.0 2023-06-18 18:12:51,298 INFO [train.py:996] (2/4) Epoch 2, batch 14800, loss[loss=0.3219, simple_loss=0.3695, pruned_loss=0.1372, over 19980.00 frames. ], tot_loss[loss=0.321, simple_loss=0.3799, pruned_loss=0.1311, over 4267155.83 frames. ], batch size: 702, lr: 1.66e-02, grad_scale: 32.0 2023-06-18 18:14:06,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=271950.0, ans=22.5 2023-06-18 18:14:19,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=272010.0, ans=0.125 2023-06-18 18:14:31,716 INFO [train.py:996] (2/4) Epoch 2, batch 14850, loss[loss=0.2577, simple_loss=0.321, pruned_loss=0.09718, over 21661.00 frames. ], tot_loss[loss=0.3163, simple_loss=0.3723, pruned_loss=0.1302, over 4268660.86 frames. ], batch size: 247, lr: 1.66e-02, grad_scale: 64.0 2023-06-18 18:14:33,120 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.400e+02 3.362e+02 3.948e+02 5.141e+02 8.278e+02, threshold=7.896e+02, percent-clipped=0.0 2023-06-18 18:15:40,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=272250.0, ans=0.125 2023-06-18 18:16:00,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=272310.0, ans=0.04949747468305833 2023-06-18 18:16:12,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=272370.0, ans=0.07 2023-06-18 18:16:13,885 INFO [train.py:996] (2/4) Epoch 2, batch 14900, loss[loss=0.3446, simple_loss=0.3967, pruned_loss=0.1463, over 21497.00 frames. ], tot_loss[loss=0.3212, simple_loss=0.376, pruned_loss=0.1332, over 4267233.15 frames. ], batch size: 194, lr: 1.66e-02, grad_scale: 64.0 2023-06-18 18:16:40,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=272370.0, ans=0.0 2023-06-18 18:16:47,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=272430.0, ans=0.2 2023-06-18 18:17:18,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=272550.0, ans=0.1 2023-06-18 18:17:26,957 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.65 vs. limit=12.0 2023-06-18 18:18:07,144 INFO [train.py:996] (2/4) Epoch 2, batch 14950, loss[loss=0.291, simple_loss=0.3543, pruned_loss=0.1138, over 21286.00 frames. ], tot_loss[loss=0.3194, simple_loss=0.3754, pruned_loss=0.1317, over 4272209.27 frames. ], batch size: 176, lr: 1.66e-02, grad_scale: 64.0 2023-06-18 18:18:08,887 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.099e+02 3.898e+02 5.275e+02 1.469e+03, threshold=7.796e+02, percent-clipped=9.0 2023-06-18 18:18:17,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=272670.0, ans=0.125 2023-06-18 18:18:40,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=272790.0, ans=0.1 2023-06-18 18:19:44,472 INFO [train.py:996] (2/4) Epoch 2, batch 15000, loss[loss=0.3301, simple_loss=0.3692, pruned_loss=0.1454, over 21513.00 frames. ], tot_loss[loss=0.3224, simple_loss=0.3779, pruned_loss=0.1334, over 4263961.40 frames. ], batch size: 194, lr: 1.66e-02, grad_scale: 64.0 2023-06-18 18:19:44,473 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 18:19:59,949 INFO [train.py:1028] (2/4) Epoch 2, validation: loss=0.2784, simple_loss=0.3732, pruned_loss=0.09186, over 1796401.00 frames. 2023-06-18 18:19:59,949 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-18 18:20:08,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=272970.0, ans=0.125 2023-06-18 18:20:26,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=273030.0, ans=0.125 2023-06-18 18:21:39,235 INFO [train.py:996] (2/4) Epoch 2, batch 15050, loss[loss=0.3707, simple_loss=0.4545, pruned_loss=0.1434, over 21233.00 frames. ], tot_loss[loss=0.3251, simple_loss=0.3807, pruned_loss=0.1348, over 4263954.52 frames. ], batch size: 548, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:21:42,439 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.981e+02 3.625e+02 4.289e+02 5.445e+02 1.034e+03, threshold=8.577e+02, percent-clipped=3.0 2023-06-18 18:21:44,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=273270.0, ans=0.0 2023-06-18 18:21:52,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=273270.0, ans=0.0 2023-06-18 18:22:18,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=273390.0, ans=0.0 2023-06-18 18:22:41,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=273450.0, ans=0.125 2023-06-18 18:23:16,042 INFO [train.py:996] (2/4) Epoch 2, batch 15100, loss[loss=0.3515, simple_loss=0.4017, pruned_loss=0.1507, over 21856.00 frames. ], tot_loss[loss=0.3258, simple_loss=0.3833, pruned_loss=0.1342, over 4265945.35 frames. ], batch size: 371, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:23:29,399 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.60 vs. limit=10.0 2023-06-18 18:23:31,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=273630.0, ans=0.1 2023-06-18 18:24:44,535 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=15.0 2023-06-18 18:24:52,916 INFO [train.py:996] (2/4) Epoch 2, batch 15150, loss[loss=0.2555, simple_loss=0.3071, pruned_loss=0.102, over 21543.00 frames. ], tot_loss[loss=0.3244, simple_loss=0.3793, pruned_loss=0.1348, over 4267658.41 frames. ], batch size: 132, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:24:56,352 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.278e+02 3.235e+02 3.924e+02 4.419e+02 1.242e+03, threshold=7.848e+02, percent-clipped=3.0 2023-06-18 18:25:28,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=273930.0, ans=0.0 2023-06-18 18:25:59,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=274050.0, ans=0.07 2023-06-18 18:26:12,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=274110.0, ans=0.125 2023-06-18 18:26:14,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=274110.0, ans=0.125 2023-06-18 18:26:22,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=274110.0, ans=0.0 2023-06-18 18:26:29,419 INFO [train.py:996] (2/4) Epoch 2, batch 15200, loss[loss=0.3045, simple_loss=0.3739, pruned_loss=0.1176, over 21418.00 frames. ], tot_loss[loss=0.3136, simple_loss=0.3694, pruned_loss=0.1289, over 4265239.94 frames. ], batch size: 507, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:26:33,518 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.96 vs. limit=22.5 2023-06-18 18:26:44,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=274230.0, ans=0.95 2023-06-18 18:27:02,510 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=15.0 2023-06-18 18:27:08,403 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=12.0 2023-06-18 18:27:30,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=274290.0, ans=0.0 2023-06-18 18:28:05,734 INFO [train.py:996] (2/4) Epoch 2, batch 15250, loss[loss=0.2714, simple_loss=0.3249, pruned_loss=0.1089, over 21568.00 frames. ], tot_loss[loss=0.3079, simple_loss=0.3621, pruned_loss=0.1268, over 4262045.24 frames. ], batch size: 263, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:28:08,693 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.957e+02 3.425e+02 4.072e+02 6.895e+02, threshold=6.850e+02, percent-clipped=0.0 2023-06-18 18:28:09,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=274470.0, ans=0.125 2023-06-18 18:29:01,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=274590.0, ans=0.2 2023-06-18 18:29:24,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=274650.0, ans=0.125 2023-06-18 18:29:43,131 INFO [train.py:996] (2/4) Epoch 2, batch 15300, loss[loss=0.3366, simple_loss=0.3888, pruned_loss=0.1421, over 21319.00 frames. ], tot_loss[loss=0.3139, simple_loss=0.3665, pruned_loss=0.1307, over 4260629.34 frames. ], batch size: 176, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:30:28,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=274830.0, ans=0.125 2023-06-18 18:31:02,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=275010.0, ans=0.125 2023-06-18 18:31:19,243 INFO [train.py:996] (2/4) Epoch 2, batch 15350, loss[loss=0.3028, simple_loss=0.3815, pruned_loss=0.112, over 21769.00 frames. ], tot_loss[loss=0.3205, simple_loss=0.3737, pruned_loss=0.1337, over 4267999.55 frames. ], batch size: 247, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:31:22,366 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.274e+02 3.404e+02 4.041e+02 5.108e+02 1.058e+03, threshold=8.082e+02, percent-clipped=7.0 2023-06-18 18:32:06,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=275190.0, ans=0.125 2023-06-18 18:32:08,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=275190.0, ans=0.0 2023-06-18 18:32:12,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=275190.0, ans=0.1 2023-06-18 18:32:15,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=275190.0, ans=0.0 2023-06-18 18:32:18,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=275250.0, ans=0.1 2023-06-18 18:32:33,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=275310.0, ans=0.04949747468305833 2023-06-18 18:32:49,586 INFO [train.py:996] (2/4) Epoch 2, batch 15400, loss[loss=0.2814, simple_loss=0.3353, pruned_loss=0.1138, over 21852.00 frames. ], tot_loss[loss=0.3179, simple_loss=0.3735, pruned_loss=0.1311, over 4265079.51 frames. ], batch size: 282, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:33:14,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=275430.0, ans=0.1 2023-06-18 18:33:53,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=275490.0, ans=0.5 2023-06-18 18:33:55,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=275550.0, ans=0.125 2023-06-18 18:34:27,593 INFO [train.py:996] (2/4) Epoch 2, batch 15450, loss[loss=0.3116, simple_loss=0.3591, pruned_loss=0.132, over 21900.00 frames. ], tot_loss[loss=0.3139, simple_loss=0.3691, pruned_loss=0.1293, over 4270705.65 frames. ], batch size: 107, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:34:30,685 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.346e+02 3.287e+02 3.877e+02 4.804e+02 8.434e+02, threshold=7.754e+02, percent-clipped=1.0 2023-06-18 18:34:36,431 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-18 18:34:48,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=275730.0, ans=0.0 2023-06-18 18:35:48,837 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=22.5 2023-06-18 18:35:49,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=275910.0, ans=0.0 2023-06-18 18:36:05,300 INFO [train.py:996] (2/4) Epoch 2, batch 15500, loss[loss=0.2884, simple_loss=0.3787, pruned_loss=0.09906, over 21321.00 frames. ], tot_loss[loss=0.3123, simple_loss=0.3683, pruned_loss=0.1281, over 4264913.14 frames. ], batch size: 548, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:36:13,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=275970.0, ans=0.0 2023-06-18 18:37:03,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=276090.0, ans=0.125 2023-06-18 18:37:11,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=276150.0, ans=0.125 2023-06-18 18:37:24,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=276150.0, ans=0.1 2023-06-18 18:37:28,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=276210.0, ans=0.125 2023-06-18 18:37:47,912 INFO [train.py:996] (2/4) Epoch 2, batch 15550, loss[loss=0.2617, simple_loss=0.3244, pruned_loss=0.09949, over 21207.00 frames. ], tot_loss[loss=0.3083, simple_loss=0.3664, pruned_loss=0.1251, over 4268096.97 frames. ], batch size: 159, lr: 1.65e-02, grad_scale: 32.0 2023-06-18 18:37:51,080 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 3.141e+02 3.857e+02 4.872e+02 7.208e+02, threshold=7.715e+02, percent-clipped=0.0 2023-06-18 18:38:24,117 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.44 vs. limit=6.0 2023-06-18 18:38:25,339 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=19.66 vs. limit=15.0 2023-06-18 18:38:40,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=276390.0, ans=0.125 2023-06-18 18:38:45,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=276390.0, ans=0.5 2023-06-18 18:39:10,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=276510.0, ans=0.125 2023-06-18 18:39:24,630 INFO [train.py:996] (2/4) Epoch 2, batch 15600, loss[loss=0.3204, simple_loss=0.3795, pruned_loss=0.1306, over 21550.00 frames. ], tot_loss[loss=0.3042, simple_loss=0.3605, pruned_loss=0.124, over 4270791.94 frames. ], batch size: 230, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:40:18,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=276690.0, ans=0.125 2023-06-18 18:40:22,370 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.73 vs. limit=6.0 2023-06-18 18:41:03,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=276810.0, ans=0.125 2023-06-18 18:41:13,315 INFO [train.py:996] (2/4) Epoch 2, batch 15650, loss[loss=0.3284, simple_loss=0.369, pruned_loss=0.1439, over 20777.00 frames. ], tot_loss[loss=0.3022, simple_loss=0.3589, pruned_loss=0.1228, over 4269972.69 frames. ], batch size: 611, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:41:16,416 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.176e+02 3.146e+02 3.937e+02 5.420e+02 1.080e+03, threshold=7.874e+02, percent-clipped=10.0 2023-06-18 18:42:49,657 INFO [train.py:996] (2/4) Epoch 2, batch 15700, loss[loss=0.2285, simple_loss=0.3118, pruned_loss=0.07262, over 21420.00 frames. ], tot_loss[loss=0.2997, simple_loss=0.3557, pruned_loss=0.1219, over 4265444.28 frames. ], batch size: 211, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:42:56,549 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-06-18 18:42:57,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=277170.0, ans=0.125 2023-06-18 18:43:14,119 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:43:20,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=277230.0, ans=0.5 2023-06-18 18:43:34,752 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.82 vs. limit=6.0 2023-06-18 18:43:46,860 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-18 18:44:19,895 INFO [train.py:996] (2/4) Epoch 2, batch 15750, loss[loss=0.2757, simple_loss=0.3284, pruned_loss=0.1115, over 21756.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3503, pruned_loss=0.1211, over 4274143.75 frames. ], batch size: 112, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:44:28,128 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.168e+02 3.151e+02 3.926e+02 5.162e+02 7.477e+02, threshold=7.853e+02, percent-clipped=0.0 2023-06-18 18:44:45,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=277530.0, ans=0.125 2023-06-18 18:44:47,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=277530.0, ans=0.2 2023-06-18 18:45:17,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=277650.0, ans=0.125 2023-06-18 18:45:17,735 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.72 vs. limit=10.0 2023-06-18 18:45:30,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=277710.0, ans=0.0 2023-06-18 18:45:48,061 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.85 vs. limit=15.0 2023-06-18 18:46:01,542 INFO [train.py:996] (2/4) Epoch 2, batch 15800, loss[loss=0.2437, simple_loss=0.3034, pruned_loss=0.09197, over 21696.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.3458, pruned_loss=0.1198, over 4278598.84 frames. ], batch size: 282, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:46:36,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=277890.0, ans=0.125 2023-06-18 18:46:41,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=277890.0, ans=0.1 2023-06-18 18:46:43,469 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.08 vs. limit=22.5 2023-06-18 18:47:12,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=278010.0, ans=0.035 2023-06-18 18:47:34,335 INFO [train.py:996] (2/4) Epoch 2, batch 15850, loss[loss=0.3106, simple_loss=0.3562, pruned_loss=0.1325, over 21928.00 frames. ], tot_loss[loss=0.2995, simple_loss=0.3518, pruned_loss=0.1236, over 4264968.40 frames. ], batch size: 317, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:47:37,252 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.427e+02 3.232e+02 3.999e+02 5.094e+02 1.481e+03, threshold=7.998e+02, percent-clipped=6.0 2023-06-18 18:47:51,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=278070.0, ans=0.1 2023-06-18 18:48:36,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=278250.0, ans=0.0 2023-06-18 18:49:10,699 INFO [train.py:996] (2/4) Epoch 2, batch 15900, loss[loss=0.2984, simple_loss=0.3529, pruned_loss=0.122, over 21865.00 frames. ], tot_loss[loss=0.2973, simple_loss=0.3491, pruned_loss=0.1227, over 4267001.10 frames. ], batch size: 98, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:49:14,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=278370.0, ans=0.125 2023-06-18 18:49:53,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=278490.0, ans=0.125 2023-06-18 18:50:26,644 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 18:50:41,668 INFO [train.py:996] (2/4) Epoch 2, batch 15950, loss[loss=0.3145, simple_loss=0.3672, pruned_loss=0.1309, over 21775.00 frames. ], tot_loss[loss=0.2937, simple_loss=0.3488, pruned_loss=0.1193, over 4266662.22 frames. ], batch size: 112, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:50:49,857 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.071e+02 3.268e+02 3.930e+02 5.229e+02 1.203e+03, threshold=7.860e+02, percent-clipped=8.0 2023-06-18 18:51:49,970 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=12.09 vs. limit=10.0 2023-06-18 18:52:18,713 INFO [train.py:996] (2/4) Epoch 2, batch 16000, loss[loss=0.2608, simple_loss=0.342, pruned_loss=0.08975, over 21886.00 frames. ], tot_loss[loss=0.2912, simple_loss=0.3486, pruned_loss=0.1169, over 4262925.19 frames. ], batch size: 316, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:52:30,825 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-18 18:53:13,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=279150.0, ans=0.125 2023-06-18 18:53:24,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=279150.0, ans=0.2 2023-06-18 18:53:49,674 INFO [train.py:996] (2/4) Epoch 2, batch 16050, loss[loss=0.2924, simple_loss=0.3721, pruned_loss=0.1063, over 21717.00 frames. ], tot_loss[loss=0.2883, simple_loss=0.349, pruned_loss=0.1138, over 4267726.22 frames. ], batch size: 298, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:53:57,492 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.839e+02 3.161e+02 3.861e+02 4.688e+02 7.896e+02, threshold=7.722e+02, percent-clipped=1.0 2023-06-18 18:54:09,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=279270.0, ans=0.2 2023-06-18 18:54:27,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=279330.0, ans=0.2 2023-06-18 18:54:29,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=279330.0, ans=0.125 2023-06-18 18:54:41,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=279390.0, ans=0.09899494936611666 2023-06-18 18:54:55,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=279450.0, ans=0.1 2023-06-18 18:55:25,229 INFO [train.py:996] (2/4) Epoch 2, batch 16100, loss[loss=0.2706, simple_loss=0.3323, pruned_loss=0.1044, over 21831.00 frames. ], tot_loss[loss=0.2926, simple_loss=0.3524, pruned_loss=0.1164, over 4276642.51 frames. ], batch size: 282, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:55:25,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=279570.0, ans=0.2 2023-06-18 18:56:23,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=279750.0, ans=0.05 2023-06-18 18:56:49,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=279810.0, ans=0.125 2023-06-18 18:56:52,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=279810.0, ans=0.125 2023-06-18 18:56:55,120 INFO [train.py:996] (2/4) Epoch 2, batch 16150, loss[loss=0.3377, simple_loss=0.3848, pruned_loss=0.1453, over 21892.00 frames. ], tot_loss[loss=0.2964, simple_loss=0.3533, pruned_loss=0.1197, over 4288950.94 frames. ], batch size: 124, lr: 1.64e-02, grad_scale: 32.0 2023-06-18 18:57:02,508 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.627e+02 3.360e+02 4.498e+02 6.275e+02 1.287e+03, threshold=8.996e+02, percent-clipped=10.0 2023-06-18 18:57:14,956 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=22.5 2023-06-18 18:57:31,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=279930.0, ans=0.125 2023-06-18 18:57:34,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=279930.0, ans=0.1 2023-06-18 18:57:36,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=279930.0, ans=0.125 2023-06-18 18:57:50,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=279990.0, ans=0.125 2023-06-18 18:58:06,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=280050.0, ans=0.125 2023-06-18 18:58:09,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=280050.0, ans=0.1 2023-06-18 18:58:11,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=280050.0, ans=0.125 2023-06-18 18:58:21,735 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.88 vs. limit=15.0 2023-06-18 18:58:22,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=280110.0, ans=0.0 2023-06-18 18:58:35,870 INFO [train.py:996] (2/4) Epoch 2, batch 16200, loss[loss=0.319, simple_loss=0.3823, pruned_loss=0.1279, over 21882.00 frames. ], tot_loss[loss=0.3037, simple_loss=0.3607, pruned_loss=0.1234, over 4291965.81 frames. ], batch size: 371, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 18:58:58,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=280170.0, ans=0.0 2023-06-18 18:59:17,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=280290.0, ans=0.125 2023-06-18 18:59:18,301 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.27 vs. limit=6.0 2023-06-18 18:59:29,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=280290.0, ans=0.125 2023-06-18 19:00:00,811 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=15.0 2023-06-18 19:00:17,022 INFO [train.py:996] (2/4) Epoch 2, batch 16250, loss[loss=0.2017, simple_loss=0.2666, pruned_loss=0.06839, over 16225.00 frames. ], tot_loss[loss=0.3038, simple_loss=0.3608, pruned_loss=0.1235, over 4278883.75 frames. ], batch size: 60, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:00:19,948 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 2.939e+02 3.512e+02 5.146e+02 1.306e+03, threshold=7.023e+02, percent-clipped=4.0 2023-06-18 19:01:16,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=280650.0, ans=0.025 2023-06-18 19:01:22,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=280710.0, ans=0.0 2023-06-18 19:01:48,560 INFO [train.py:996] (2/4) Epoch 2, batch 16300, loss[loss=0.333, simple_loss=0.4002, pruned_loss=0.1329, over 20988.00 frames. ], tot_loss[loss=0.2956, simple_loss=0.3549, pruned_loss=0.1181, over 4265714.16 frames. ], batch size: 607, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:02:06,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=280770.0, ans=0.1 2023-06-18 19:02:17,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=280830.0, ans=0.125 2023-06-18 19:02:44,977 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.90 vs. limit=10.0 2023-06-18 19:03:31,626 INFO [train.py:996] (2/4) Epoch 2, batch 16350, loss[loss=0.3139, simple_loss=0.3698, pruned_loss=0.1291, over 21714.00 frames. ], tot_loss[loss=0.2969, simple_loss=0.3554, pruned_loss=0.1192, over 4262879.85 frames. ], batch size: 298, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:03:34,722 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 3.295e+02 4.517e+02 5.318e+02 1.033e+03, threshold=9.034e+02, percent-clipped=9.0 2023-06-18 19:03:55,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=281130.0, ans=0.2 2023-06-18 19:04:46,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=281310.0, ans=0.1 2023-06-18 19:04:54,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=281310.0, ans=0.1 2023-06-18 19:05:08,347 INFO [train.py:996] (2/4) Epoch 2, batch 16400, loss[loss=0.3824, simple_loss=0.4089, pruned_loss=0.1779, over 21673.00 frames. ], tot_loss[loss=0.3025, simple_loss=0.3606, pruned_loss=0.1222, over 4272571.49 frames. ], batch size: 507, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:05:17,956 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:05:33,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=281430.0, ans=0.0 2023-06-18 19:06:14,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=281610.0, ans=0.0 2023-06-18 19:06:43,720 INFO [train.py:996] (2/4) Epoch 2, batch 16450, loss[loss=0.2523, simple_loss=0.3064, pruned_loss=0.09907, over 21264.00 frames. ], tot_loss[loss=0.3052, simple_loss=0.3615, pruned_loss=0.1244, over 4281701.09 frames. ], batch size: 608, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:06:46,796 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.379e+02 3.250e+02 3.733e+02 4.517e+02 7.172e+02, threshold=7.466e+02, percent-clipped=0.0 2023-06-18 19:07:00,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=281730.0, ans=0.2 2023-06-18 19:07:10,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=281730.0, ans=0.0 2023-06-18 19:07:34,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=281850.0, ans=0.125 2023-06-18 19:07:57,334 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=15.30 vs. limit=15.0 2023-06-18 19:08:10,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=281910.0, ans=0.2 2023-06-18 19:08:18,624 INFO [train.py:996] (2/4) Epoch 2, batch 16500, loss[loss=0.2252, simple_loss=0.2745, pruned_loss=0.08796, over 21352.00 frames. ], tot_loss[loss=0.3072, simple_loss=0.3629, pruned_loss=0.1258, over 4278308.73 frames. ], batch size: 194, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:08:19,900 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=15.0 2023-06-18 19:08:35,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=282030.0, ans=0.125 2023-06-18 19:08:50,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=282090.0, ans=0.125 2023-06-18 19:08:57,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=282090.0, ans=0.0 2023-06-18 19:09:34,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=282150.0, ans=0.0 2023-06-18 19:09:37,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=282150.0, ans=0.05 2023-06-18 19:09:37,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=282150.0, ans=0.1 2023-06-18 19:09:51,375 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:09:57,180 INFO [train.py:996] (2/4) Epoch 2, batch 16550, loss[loss=0.3292, simple_loss=0.3948, pruned_loss=0.1318, over 21275.00 frames. ], tot_loss[loss=0.3013, simple_loss=0.3599, pruned_loss=0.1214, over 4277183.53 frames. ], batch size: 548, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:10:00,342 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 3.451e+02 4.279e+02 5.005e+02 9.425e+02, threshold=8.558e+02, percent-clipped=2.0 2023-06-18 19:10:13,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=282330.0, ans=0.1 2023-06-18 19:10:44,928 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.65 vs. limit=12.0 2023-06-18 19:11:03,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=282450.0, ans=0.125 2023-06-18 19:11:36,200 INFO [train.py:996] (2/4) Epoch 2, batch 16600, loss[loss=0.3821, simple_loss=0.4529, pruned_loss=0.1557, over 21636.00 frames. ], tot_loss[loss=0.3113, simple_loss=0.37, pruned_loss=0.1263, over 4283257.32 frames. ], batch size: 389, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:11:50,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=282630.0, ans=0.0 2023-06-18 19:11:55,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=282630.0, ans=0.125 2023-06-18 19:11:57,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=282630.0, ans=0.0 2023-06-18 19:12:11,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=282630.0, ans=0.05 2023-06-18 19:12:24,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=282690.0, ans=0.125 2023-06-18 19:12:34,819 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.23 vs. limit=6.0 2023-06-18 19:12:35,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=282690.0, ans=0.125 2023-06-18 19:13:12,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=282810.0, ans=0.125 2023-06-18 19:13:15,419 INFO [train.py:996] (2/4) Epoch 2, batch 16650, loss[loss=0.4119, simple_loss=0.4433, pruned_loss=0.1902, over 21310.00 frames. ], tot_loss[loss=0.3207, simple_loss=0.38, pruned_loss=0.1307, over 4283659.08 frames. ], batch size: 507, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:13:18,774 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 3.486e+02 3.999e+02 5.162e+02 7.260e+02, threshold=7.998e+02, percent-clipped=0.0 2023-06-18 19:13:19,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=282870.0, ans=0.2 2023-06-18 19:13:58,707 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=14.59 vs. limit=15.0 2023-06-18 19:14:18,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=282990.0, ans=0.125 2023-06-18 19:15:10,029 INFO [train.py:996] (2/4) Epoch 2, batch 16700, loss[loss=0.2477, simple_loss=0.3014, pruned_loss=0.09702, over 21120.00 frames. ], tot_loss[loss=0.3186, simple_loss=0.3775, pruned_loss=0.1298, over 4277660.24 frames. ], batch size: 143, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:15:33,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=283230.0, ans=0.0 2023-06-18 19:16:52,011 INFO [train.py:996] (2/4) Epoch 2, batch 16750, loss[loss=0.3365, simple_loss=0.3863, pruned_loss=0.1433, over 21257.00 frames. ], tot_loss[loss=0.3226, simple_loss=0.3815, pruned_loss=0.1318, over 4272410.71 frames. ], batch size: 143, lr: 1.63e-02, grad_scale: 32.0 2023-06-18 19:16:55,209 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.344e+02 3.957e+02 4.930e+02 8.837e+02, threshold=7.914e+02, percent-clipped=3.0 2023-06-18 19:16:57,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=283470.0, ans=0.0 2023-06-18 19:17:01,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=283470.0, ans=0.125 2023-06-18 19:17:09,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=283530.0, ans=0.125 2023-06-18 19:18:14,427 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=15.0 2023-06-18 19:18:30,467 INFO [train.py:996] (2/4) Epoch 2, batch 16800, loss[loss=0.3183, simple_loss=0.3835, pruned_loss=0.1265, over 20618.00 frames. ], tot_loss[loss=0.3247, simple_loss=0.3849, pruned_loss=0.1323, over 4272667.35 frames. ], batch size: 607, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:19:17,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=283890.0, ans=0.1 2023-06-18 19:19:54,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=284010.0, ans=0.125 2023-06-18 19:19:58,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=284010.0, ans=0.0 2023-06-18 19:20:06,235 INFO [train.py:996] (2/4) Epoch 2, batch 16850, loss[loss=0.3075, simple_loss=0.3582, pruned_loss=0.1284, over 21920.00 frames. ], tot_loss[loss=0.3235, simple_loss=0.3809, pruned_loss=0.133, over 4276693.61 frames. ], batch size: 414, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:20:09,187 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.662e+02 3.797e+02 4.349e+02 5.361e+02 8.347e+02, threshold=8.698e+02, percent-clipped=3.0 2023-06-18 19:20:22,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=284130.0, ans=0.125 2023-06-18 19:20:31,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=284130.0, ans=0.125 2023-06-18 19:20:53,947 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=12.0 2023-06-18 19:21:13,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=284250.0, ans=0.1 2023-06-18 19:21:16,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=284250.0, ans=0.125 2023-06-18 19:21:37,633 INFO [train.py:996] (2/4) Epoch 2, batch 16900, loss[loss=0.2164, simple_loss=0.2734, pruned_loss=0.07971, over 21201.00 frames. ], tot_loss[loss=0.3177, simple_loss=0.374, pruned_loss=0.1307, over 4285523.22 frames. ], batch size: 176, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:22:04,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=284430.0, ans=0.0 2023-06-18 19:22:05,239 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=15.0 2023-06-18 19:23:02,683 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-18 19:23:02,735 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.80 vs. limit=6.0 2023-06-18 19:23:13,328 INFO [train.py:996] (2/4) Epoch 2, batch 16950, loss[loss=0.2646, simple_loss=0.3304, pruned_loss=0.09938, over 20083.00 frames. ], tot_loss[loss=0.3136, simple_loss=0.3695, pruned_loss=0.1289, over 4280417.74 frames. ], batch size: 703, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:23:16,638 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 3.116e+02 4.142e+02 5.213e+02 8.477e+02, threshold=8.284e+02, percent-clipped=0.0 2023-06-18 19:23:20,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=284670.0, ans=0.125 2023-06-18 19:23:23,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=284670.0, ans=0.2 2023-06-18 19:24:06,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=284790.0, ans=0.125 2023-06-18 19:24:35,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=284910.0, ans=0.2 2023-06-18 19:24:38,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=284910.0, ans=0.0 2023-06-18 19:24:39,126 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.61 vs. limit=22.5 2023-06-18 19:24:41,988 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-18 19:24:45,721 INFO [train.py:996] (2/4) Epoch 2, batch 17000, loss[loss=0.3146, simple_loss=0.3663, pruned_loss=0.1314, over 21860.00 frames. ], tot_loss[loss=0.3125, simple_loss=0.3664, pruned_loss=0.1293, over 4282977.20 frames. ], batch size: 118, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:25:02,242 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2023-06-18 19:25:31,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=285090.0, ans=0.5 2023-06-18 19:26:22,077 INFO [train.py:996] (2/4) Epoch 2, batch 17050, loss[loss=0.3076, simple_loss=0.3759, pruned_loss=0.1197, over 21244.00 frames. ], tot_loss[loss=0.3201, simple_loss=0.375, pruned_loss=0.1326, over 4288248.92 frames. ], batch size: 159, lr: 1.62e-02, grad_scale: 64.0 2023-06-18 19:26:25,254 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 3.160e+02 3.853e+02 4.663e+02 1.166e+03, threshold=7.706e+02, percent-clipped=2.0 2023-06-18 19:26:27,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=285270.0, ans=0.125 2023-06-18 19:26:30,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=285270.0, ans=0.0 2023-06-18 19:26:42,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=285330.0, ans=0.0 2023-06-18 19:26:57,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=285330.0, ans=0.125 2023-06-18 19:27:17,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=285390.0, ans=0.0 2023-06-18 19:27:43,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=285510.0, ans=0.1 2023-06-18 19:27:54,701 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-06-18 19:27:56,607 INFO [train.py:996] (2/4) Epoch 2, batch 17100, loss[loss=0.3303, simple_loss=0.375, pruned_loss=0.1428, over 21850.00 frames. ], tot_loss[loss=0.3197, simple_loss=0.3728, pruned_loss=0.1333, over 4282187.58 frames. ], batch size: 414, lr: 1.62e-02, grad_scale: 64.0 2023-06-18 19:28:09,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=285570.0, ans=0.1 2023-06-18 19:28:10,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=285630.0, ans=0.0 2023-06-18 19:28:17,381 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-06-18 19:28:19,064 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-06-18 19:28:19,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=285630.0, ans=0.125 2023-06-18 19:29:05,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=285750.0, ans=0.125 2023-06-18 19:29:14,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=285810.0, ans=0.125 2023-06-18 19:29:26,177 INFO [train.py:996] (2/4) Epoch 2, batch 17150, loss[loss=0.275, simple_loss=0.3377, pruned_loss=0.1061, over 21707.00 frames. ], tot_loss[loss=0.3153, simple_loss=0.3679, pruned_loss=0.1314, over 4289611.29 frames. ], batch size: 389, lr: 1.62e-02, grad_scale: 64.0 2023-06-18 19:29:29,305 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.291e+02 3.459e+02 4.285e+02 5.134e+02 9.644e+02, threshold=8.570e+02, percent-clipped=5.0 2023-06-18 19:29:32,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=285870.0, ans=0.0 2023-06-18 19:29:36,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=285870.0, ans=0.0 2023-06-18 19:30:59,423 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.70 vs. limit=6.0 2023-06-18 19:31:03,091 INFO [train.py:996] (2/4) Epoch 2, batch 17200, loss[loss=0.3503, simple_loss=0.4435, pruned_loss=0.1286, over 19730.00 frames. ], tot_loss[loss=0.3139, simple_loss=0.3666, pruned_loss=0.1305, over 4283930.04 frames. ], batch size: 703, lr: 1.62e-02, grad_scale: 64.0 2023-06-18 19:31:06,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=286170.0, ans=0.1 2023-06-18 19:31:20,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=286170.0, ans=0.0 2023-06-18 19:31:38,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=286230.0, ans=0.125 2023-06-18 19:31:54,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=286290.0, ans=0.125 2023-06-18 19:32:11,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=286350.0, ans=0.0 2023-06-18 19:32:50,092 INFO [train.py:996] (2/4) Epoch 2, batch 17250, loss[loss=0.3932, simple_loss=0.4326, pruned_loss=0.1768, over 21337.00 frames. ], tot_loss[loss=0.3176, simple_loss=0.3699, pruned_loss=0.1326, over 4286517.00 frames. ], batch size: 507, lr: 1.62e-02, grad_scale: 64.0 2023-06-18 19:32:53,221 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.355e+02 3.381e+02 4.035e+02 5.054e+02 8.566e+02, threshold=8.070e+02, percent-clipped=0.0 2023-06-18 19:33:10,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=286530.0, ans=0.0 2023-06-18 19:34:33,915 INFO [train.py:996] (2/4) Epoch 2, batch 17300, loss[loss=0.3336, simple_loss=0.3971, pruned_loss=0.135, over 21783.00 frames. ], tot_loss[loss=0.3253, simple_loss=0.3777, pruned_loss=0.1364, over 4287906.29 frames. ], batch size: 124, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:34:36,198 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.07 vs. limit=15.0 2023-06-18 19:35:07,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=286830.0, ans=0.125 2023-06-18 19:36:18,576 INFO [train.py:996] (2/4) Epoch 2, batch 17350, loss[loss=0.3066, simple_loss=0.3981, pruned_loss=0.1076, over 21261.00 frames. ], tot_loss[loss=0.3253, simple_loss=0.3787, pruned_loss=0.136, over 4283854.21 frames. ], batch size: 548, lr: 1.62e-02, grad_scale: 32.0 2023-06-18 19:36:23,406 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.030e+02 3.256e+02 3.949e+02 5.005e+02 1.104e+03, threshold=7.898e+02, percent-clipped=4.0 2023-06-18 19:36:23,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=287070.0, ans=0.125 2023-06-18 19:36:36,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=287130.0, ans=0.125 2023-06-18 19:37:27,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=287250.0, ans=0.125 2023-06-18 19:37:50,494 INFO [train.py:996] (2/4) Epoch 2, batch 17400, loss[loss=0.2548, simple_loss=0.3158, pruned_loss=0.09685, over 21715.00 frames. ], tot_loss[loss=0.318, simple_loss=0.3739, pruned_loss=0.131, over 4265808.77 frames. ], batch size: 247, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:38:08,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=287430.0, ans=0.2 2023-06-18 19:38:15,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=287430.0, ans=0.125 2023-06-18 19:39:02,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=287550.0, ans=0.0 2023-06-18 19:39:07,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=287610.0, ans=0.1 2023-06-18 19:39:24,238 INFO [train.py:996] (2/4) Epoch 2, batch 17450, loss[loss=0.2541, simple_loss=0.3073, pruned_loss=0.1004, over 21105.00 frames. ], tot_loss[loss=0.3104, simple_loss=0.3684, pruned_loss=0.1262, over 4259622.22 frames. ], batch size: 143, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:39:24,695 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:39:28,533 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 2.947e+02 3.682e+02 4.725e+02 7.588e+02, threshold=7.364e+02, percent-clipped=0.0 2023-06-18 19:39:32,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=287670.0, ans=0.125 2023-06-18 19:40:07,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=287790.0, ans=0.2 2023-06-18 19:40:43,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=287910.0, ans=0.0 2023-06-18 19:40:54,709 INFO [train.py:996] (2/4) Epoch 2, batch 17500, loss[loss=0.2896, simple_loss=0.3458, pruned_loss=0.1167, over 21893.00 frames. ], tot_loss[loss=0.3053, simple_loss=0.3643, pruned_loss=0.1232, over 4258932.46 frames. ], batch size: 351, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:40:58,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=287970.0, ans=0.0 2023-06-18 19:41:06,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=287970.0, ans=0.1 2023-06-18 19:41:18,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=288030.0, ans=0.1 2023-06-18 19:42:08,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=288150.0, ans=0.0 2023-06-18 19:42:12,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=288210.0, ans=0.125 2023-06-18 19:42:25,492 INFO [train.py:996] (2/4) Epoch 2, batch 17550, loss[loss=0.3044, simple_loss=0.3745, pruned_loss=0.1172, over 21804.00 frames. ], tot_loss[loss=0.3068, simple_loss=0.3661, pruned_loss=0.1237, over 4258357.86 frames. ], batch size: 371, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:42:30,242 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 3.260e+02 4.297e+02 5.739e+02 1.320e+03, threshold=8.594e+02, percent-clipped=8.0 2023-06-18 19:43:42,685 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:44:00,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=288570.0, ans=0.125 2023-06-18 19:44:01,362 INFO [train.py:996] (2/4) Epoch 2, batch 17600, loss[loss=0.327, simple_loss=0.3734, pruned_loss=0.1403, over 21810.00 frames. ], tot_loss[loss=0.3058, simple_loss=0.3664, pruned_loss=0.1226, over 4261461.63 frames. ], batch size: 247, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:44:10,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=288570.0, ans=0.125 2023-06-18 19:44:50,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=288690.0, ans=0.1 2023-06-18 19:45:39,885 INFO [train.py:996] (2/4) Epoch 2, batch 17650, loss[loss=0.2371, simple_loss=0.2879, pruned_loss=0.09312, over 21593.00 frames. ], tot_loss[loss=0.3044, simple_loss=0.3638, pruned_loss=0.1226, over 4260785.32 frames. ], batch size: 230, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:45:44,496 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.848e+02 2.915e+02 4.145e+02 5.677e+02 8.803e+02, threshold=8.289e+02, percent-clipped=2.0 2023-06-18 19:45:48,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.whiten.whitening_limit, batch_count=288870.0, ans=15.0 2023-06-18 19:45:51,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=288870.0, ans=0.0 2023-06-18 19:46:23,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=288990.0, ans=0.125 2023-06-18 19:46:27,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=288990.0, ans=0.125 2023-06-18 19:46:38,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=288990.0, ans=0.125 2023-06-18 19:46:57,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=289050.0, ans=0.0 2023-06-18 19:47:17,261 INFO [train.py:996] (2/4) Epoch 2, batch 17700, loss[loss=0.3237, simple_loss=0.3896, pruned_loss=0.1289, over 21337.00 frames. ], tot_loss[loss=0.2949, simple_loss=0.3551, pruned_loss=0.1174, over 4266727.39 frames. ], batch size: 549, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:47:17,815 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:47:28,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=289170.0, ans=0.2 2023-06-18 19:47:30,996 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=15.0 2023-06-18 19:48:03,505 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=16.92 vs. limit=15.0 2023-06-18 19:48:55,991 INFO [train.py:996] (2/4) Epoch 2, batch 17750, loss[loss=0.307, simple_loss=0.3528, pruned_loss=0.1306, over 20017.00 frames. ], tot_loss[loss=0.3052, simple_loss=0.3643, pruned_loss=0.123, over 4258594.60 frames. ], batch size: 702, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:49:10,432 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 3.081e+02 4.126e+02 5.116e+02 1.162e+03, threshold=8.253e+02, percent-clipped=4.0 2023-06-18 19:49:30,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=289530.0, ans=0.125 2023-06-18 19:49:44,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=289590.0, ans=0.0 2023-06-18 19:49:52,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=289590.0, ans=0.1 2023-06-18 19:50:14,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=289710.0, ans=0.125 2023-06-18 19:50:25,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=289710.0, ans=0.2 2023-06-18 19:50:40,990 INFO [train.py:996] (2/4) Epoch 2, batch 17800, loss[loss=0.2301, simple_loss=0.2996, pruned_loss=0.0803, over 21415.00 frames. ], tot_loss[loss=0.3059, simple_loss=0.3658, pruned_loss=0.123, over 4263805.87 frames. ], batch size: 211, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:51:03,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=289770.0, ans=0.09899494936611666 2023-06-18 19:51:31,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=289890.0, ans=0.125 2023-06-18 19:51:34,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=289890.0, ans=0.125 2023-06-18 19:51:38,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=289950.0, ans=0.125 2023-06-18 19:51:49,457 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 19:52:03,308 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-18 19:52:28,627 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.94 vs. limit=15.0 2023-06-18 19:52:29,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=290070.0, ans=0.125 2023-06-18 19:52:30,859 INFO [train.py:996] (2/4) Epoch 2, batch 17850, loss[loss=0.3057, simple_loss=0.3688, pruned_loss=0.1213, over 20707.00 frames. ], tot_loss[loss=0.3067, simple_loss=0.3664, pruned_loss=0.1235, over 4257563.88 frames. ], batch size: 607, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:52:32,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=290070.0, ans=0.125 2023-06-18 19:52:34,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=290070.0, ans=0.0 2023-06-18 19:52:35,410 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 3.113e+02 3.565e+02 4.214e+02 1.146e+03, threshold=7.130e+02, percent-clipped=1.0 2023-06-18 19:54:09,667 INFO [train.py:996] (2/4) Epoch 2, batch 17900, loss[loss=0.3255, simple_loss=0.4009, pruned_loss=0.125, over 21743.00 frames. ], tot_loss[loss=0.3135, simple_loss=0.3735, pruned_loss=0.1268, over 4263941.10 frames. ], batch size: 332, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:54:11,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=290370.0, ans=0.1 2023-06-18 19:55:02,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=290490.0, ans=0.2 2023-06-18 19:55:21,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=290550.0, ans=0.2 2023-06-18 19:55:45,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=290610.0, ans=0.0 2023-06-18 19:55:47,649 INFO [train.py:996] (2/4) Epoch 2, batch 17950, loss[loss=0.3119, simple_loss=0.3799, pruned_loss=0.1219, over 21467.00 frames. ], tot_loss[loss=0.3078, simple_loss=0.3715, pruned_loss=0.122, over 4258418.87 frames. ], batch size: 507, lr: 1.61e-02, grad_scale: 32.0 2023-06-18 19:55:51,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=290670.0, ans=0.2 2023-06-18 19:55:51,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=290670.0, ans=0.2 2023-06-18 19:55:52,413 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 3.329e+02 4.288e+02 5.696e+02 7.966e+02, threshold=8.576e+02, percent-clipped=5.0 2023-06-18 19:56:03,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=290730.0, ans=0.125 2023-06-18 19:56:11,789 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=22.5 2023-06-18 19:56:23,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=290790.0, ans=0.125 2023-06-18 19:57:04,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=290850.0, ans=0.1 2023-06-18 19:57:22,923 INFO [train.py:996] (2/4) Epoch 2, batch 18000, loss[loss=0.2443, simple_loss=0.301, pruned_loss=0.09378, over 21782.00 frames. ], tot_loss[loss=0.303, simple_loss=0.3646, pruned_loss=0.1207, over 4266621.26 frames. ], batch size: 317, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 19:57:22,924 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 19:57:38,937 INFO [train.py:1028] (2/4) Epoch 2, validation: loss=0.2951, simple_loss=0.3927, pruned_loss=0.09871, over 1796401.00 frames. 2023-06-18 19:57:38,937 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-18 19:58:22,359 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-18 19:59:15,547 INFO [train.py:996] (2/4) Epoch 2, batch 18050, loss[loss=0.2773, simple_loss=0.33, pruned_loss=0.1123, over 21369.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.3584, pruned_loss=0.1197, over 4268333.60 frames. ], batch size: 211, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 19:59:24,569 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 3.442e+02 4.328e+02 5.219e+02 8.565e+02, threshold=8.656e+02, percent-clipped=0.0 2023-06-18 19:59:31,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=291270.0, ans=0.0 2023-06-18 19:59:36,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=291330.0, ans=0.125 2023-06-18 20:00:19,597 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.72 vs. limit=10.0 2023-06-18 20:00:28,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=291450.0, ans=0.2 2023-06-18 20:00:43,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=291510.0, ans=0.0 2023-06-18 20:00:54,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=291510.0, ans=0.0 2023-06-18 20:00:57,639 INFO [train.py:996] (2/4) Epoch 2, batch 18100, loss[loss=0.3497, simple_loss=0.415, pruned_loss=0.1422, over 21496.00 frames. ], tot_loss[loss=0.3066, simple_loss=0.3647, pruned_loss=0.1242, over 4268614.43 frames. ], batch size: 471, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:01:13,310 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.86 vs. limit=15.0 2023-06-18 20:02:34,132 INFO [train.py:996] (2/4) Epoch 2, batch 18150, loss[loss=0.3025, simple_loss=0.3591, pruned_loss=0.123, over 21662.00 frames. ], tot_loss[loss=0.3053, simple_loss=0.3645, pruned_loss=0.123, over 4270511.41 frames. ], batch size: 332, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:02:38,923 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.520e+02 3.253e+02 3.964e+02 4.730e+02 7.645e+02, threshold=7.929e+02, percent-clipped=0.0 2023-06-18 20:02:52,667 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.61 vs. limit=22.5 2023-06-18 20:03:59,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=292110.0, ans=0.125 2023-06-18 20:04:05,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=292110.0, ans=0.1 2023-06-18 20:04:09,555 INFO [train.py:996] (2/4) Epoch 2, batch 18200, loss[loss=0.2526, simple_loss=0.3152, pruned_loss=0.09497, over 21853.00 frames. ], tot_loss[loss=0.3007, simple_loss=0.358, pruned_loss=0.1217, over 4278598.90 frames. ], batch size: 102, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:04:20,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=292170.0, ans=0.1 2023-06-18 20:05:39,496 INFO [train.py:996] (2/4) Epoch 2, batch 18250, loss[loss=0.2131, simple_loss=0.2733, pruned_loss=0.07643, over 21857.00 frames. ], tot_loss[loss=0.2929, simple_loss=0.3497, pruned_loss=0.118, over 4276238.01 frames. ], batch size: 107, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:05:44,124 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.858e+02 2.938e+02 3.797e+02 4.811e+02 7.621e+02, threshold=7.594e+02, percent-clipped=0.0 2023-06-18 20:06:00,293 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.58 vs. limit=15.0 2023-06-18 20:06:50,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=292650.0, ans=0.0 2023-06-18 20:06:59,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=292710.0, ans=0.125 2023-06-18 20:07:15,051 INFO [train.py:996] (2/4) Epoch 2, batch 18300, loss[loss=0.2737, simple_loss=0.3286, pruned_loss=0.1094, over 21920.00 frames. ], tot_loss[loss=0.2909, simple_loss=0.3487, pruned_loss=0.1165, over 4271096.67 frames. ], batch size: 124, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:08:01,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=292890.0, ans=0.1 2023-06-18 20:08:49,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=293070.0, ans=0.125 2023-06-18 20:08:50,533 INFO [train.py:996] (2/4) Epoch 2, batch 18350, loss[loss=0.2936, simple_loss=0.3417, pruned_loss=0.1227, over 21297.00 frames. ], tot_loss[loss=0.2951, simple_loss=0.3547, pruned_loss=0.1178, over 4272782.68 frames. ], batch size: 144, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:08:55,236 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 3.344e+02 4.081e+02 5.632e+02 1.157e+03, threshold=8.162e+02, percent-clipped=13.0 2023-06-18 20:10:27,242 INFO [train.py:996] (2/4) Epoch 2, batch 18400, loss[loss=0.2729, simple_loss=0.3467, pruned_loss=0.09952, over 21497.00 frames. ], tot_loss[loss=0.2901, simple_loss=0.3505, pruned_loss=0.1148, over 4267930.79 frames. ], batch size: 473, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:11:35,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=293550.0, ans=0.1 2023-06-18 20:11:46,263 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-18 20:11:47,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=293610.0, ans=0.0 2023-06-18 20:12:08,226 INFO [train.py:996] (2/4) Epoch 2, batch 18450, loss[loss=0.2943, simple_loss=0.342, pruned_loss=0.1234, over 21603.00 frames. ], tot_loss[loss=0.2837, simple_loss=0.3469, pruned_loss=0.1103, over 4247662.86 frames. ], batch size: 415, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:12:12,711 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.065e+02 2.811e+02 3.741e+02 4.962e+02 8.715e+02, threshold=7.483e+02, percent-clipped=1.0 2023-06-18 20:12:49,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=293790.0, ans=0.125 2023-06-18 20:13:12,581 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.18 vs. limit=6.0 2023-06-18 20:13:45,351 INFO [train.py:996] (2/4) Epoch 2, batch 18500, loss[loss=0.2978, simple_loss=0.3313, pruned_loss=0.1321, over 21375.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3402, pruned_loss=0.1094, over 4257325.15 frames. ], batch size: 473, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:13:47,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=293970.0, ans=0.125 2023-06-18 20:14:19,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=294030.0, ans=0.95 2023-06-18 20:14:23,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=294030.0, ans=0.0 2023-06-18 20:14:56,040 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-06-18 20:15:01,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=294210.0, ans=0.1 2023-06-18 20:15:05,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=294210.0, ans=0.1 2023-06-18 20:15:17,116 INFO [train.py:996] (2/4) Epoch 2, batch 18550, loss[loss=0.2271, simple_loss=0.2965, pruned_loss=0.07889, over 21507.00 frames. ], tot_loss[loss=0.2773, simple_loss=0.3383, pruned_loss=0.1082, over 4252813.01 frames. ], batch size: 230, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:15:26,538 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 3.027e+02 3.740e+02 5.224e+02 1.354e+03, threshold=7.479e+02, percent-clipped=4.0 2023-06-18 20:15:42,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=294330.0, ans=0.5 2023-06-18 20:16:58,792 INFO [train.py:996] (2/4) Epoch 2, batch 18600, loss[loss=0.2565, simple_loss=0.3102, pruned_loss=0.1014, over 21255.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.3364, pruned_loss=0.1084, over 4254470.51 frames. ], batch size: 176, lr: 1.60e-02, grad_scale: 32.0 2023-06-18 20:17:00,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=294570.0, ans=0.125 2023-06-18 20:17:28,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=294630.0, ans=0.0 2023-06-18 20:17:28,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=294630.0, ans=0.125 2023-06-18 20:17:39,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=294690.0, ans=0.2 2023-06-18 20:17:50,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=294690.0, ans=0.04949747468305833 2023-06-18 20:18:15,438 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 20:18:35,047 INFO [train.py:996] (2/4) Epoch 2, batch 18650, loss[loss=0.2603, simple_loss=0.3158, pruned_loss=0.1024, over 21976.00 frames. ], tot_loss[loss=0.2777, simple_loss=0.3365, pruned_loss=0.1094, over 4260999.31 frames. ], batch size: 113, lr: 1.59e-02, grad_scale: 16.0 2023-06-18 20:18:40,797 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 3.101e+02 3.636e+02 4.483e+02 8.727e+02, threshold=7.272e+02, percent-clipped=3.0 2023-06-18 20:19:17,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=294990.0, ans=0.125 2023-06-18 20:19:26,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=294990.0, ans=0.0 2023-06-18 20:19:40,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=295050.0, ans=0.125 2023-06-18 20:19:54,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=295110.0, ans=0.125 2023-06-18 20:20:06,194 INFO [train.py:996] (2/4) Epoch 2, batch 18700, loss[loss=0.3638, simple_loss=0.3882, pruned_loss=0.1697, over 21748.00 frames. ], tot_loss[loss=0.2799, simple_loss=0.3356, pruned_loss=0.1121, over 4267073.37 frames. ], batch size: 441, lr: 1.59e-02, grad_scale: 16.0 2023-06-18 20:21:02,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=295290.0, ans=0.0 2023-06-18 20:21:31,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=295410.0, ans=0.1 2023-06-18 20:21:41,692 INFO [train.py:996] (2/4) Epoch 2, batch 18750, loss[loss=0.3297, simple_loss=0.3853, pruned_loss=0.1371, over 21632.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.3372, pruned_loss=0.1146, over 4265193.68 frames. ], batch size: 230, lr: 1.59e-02, grad_scale: 16.0 2023-06-18 20:21:48,767 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.45 vs. limit=15.0 2023-06-18 20:21:52,351 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 3.155e+02 3.822e+02 4.527e+02 9.030e+02, threshold=7.645e+02, percent-clipped=1.0 2023-06-18 20:22:21,504 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.07 vs. limit=15.0 2023-06-18 20:23:17,284 INFO [train.py:996] (2/4) Epoch 2, batch 18800, loss[loss=0.2211, simple_loss=0.3002, pruned_loss=0.07096, over 21445.00 frames. ], tot_loss[loss=0.2863, simple_loss=0.3428, pruned_loss=0.1149, over 4257420.45 frames. ], batch size: 211, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:24:37,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=296010.0, ans=0.125 2023-06-18 20:24:47,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=296010.0, ans=0.0 2023-06-18 20:24:57,831 INFO [train.py:996] (2/4) Epoch 2, batch 18850, loss[loss=0.1984, simple_loss=0.2714, pruned_loss=0.06272, over 21179.00 frames. ], tot_loss[loss=0.2763, simple_loss=0.3364, pruned_loss=0.1081, over 4262285.69 frames. ], batch size: 143, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:25:03,144 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-06-18 20:25:03,592 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.896e+02 3.612e+02 4.924e+02 1.009e+03, threshold=7.223e+02, percent-clipped=2.0 2023-06-18 20:26:09,682 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-06-18 20:26:19,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=296310.0, ans=0.125 2023-06-18 20:26:26,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=296310.0, ans=0.125 2023-06-18 20:26:30,513 INFO [train.py:996] (2/4) Epoch 2, batch 18900, loss[loss=0.2761, simple_loss=0.3164, pruned_loss=0.1179, over 21738.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.335, pruned_loss=0.1093, over 4249266.58 frames. ], batch size: 282, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:28:00,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=296610.0, ans=0.125 2023-06-18 20:28:04,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=296610.0, ans=0.2 2023-06-18 20:28:07,320 INFO [train.py:996] (2/4) Epoch 2, batch 18950, loss[loss=0.2711, simple_loss=0.3105, pruned_loss=0.1158, over 20285.00 frames. ], tot_loss[loss=0.2825, simple_loss=0.3378, pruned_loss=0.1136, over 4253284.87 frames. ], batch size: 703, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:28:18,298 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.174e+02 3.002e+02 3.745e+02 4.553e+02 7.623e+02, threshold=7.489e+02, percent-clipped=1.0 2023-06-18 20:28:18,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=296670.0, ans=0.125 2023-06-18 20:28:20,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=296670.0, ans=0.125 2023-06-18 20:28:23,930 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=15.0 2023-06-18 20:28:34,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=296730.0, ans=0.09899494936611666 2023-06-18 20:29:06,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=296790.0, ans=0.035 2023-06-18 20:29:42,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=296910.0, ans=0.0 2023-06-18 20:29:44,731 INFO [train.py:996] (2/4) Epoch 2, batch 19000, loss[loss=0.3545, simple_loss=0.4059, pruned_loss=0.1515, over 21407.00 frames. ], tot_loss[loss=0.2921, simple_loss=0.3498, pruned_loss=0.1172, over 4252616.95 frames. ], batch size: 471, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:30:49,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=297090.0, ans=0.0 2023-06-18 20:31:28,473 INFO [train.py:996] (2/4) Epoch 2, batch 19050, loss[loss=0.2939, simple_loss=0.3408, pruned_loss=0.1235, over 21798.00 frames. ], tot_loss[loss=0.3004, simple_loss=0.3556, pruned_loss=0.1226, over 4263599.43 frames. ], batch size: 282, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:31:34,527 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.373e+02 3.128e+02 4.146e+02 5.526e+02 1.033e+03, threshold=8.291e+02, percent-clipped=8.0 2023-06-18 20:31:43,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=297330.0, ans=0.05 2023-06-18 20:32:04,904 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-18 20:32:05,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=297330.0, ans=0.0 2023-06-18 20:32:39,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=297450.0, ans=0.09899494936611666 2023-06-18 20:33:05,258 INFO [train.py:996] (2/4) Epoch 2, batch 19100, loss[loss=0.2187, simple_loss=0.2851, pruned_loss=0.0761, over 21393.00 frames. ], tot_loss[loss=0.2998, simple_loss=0.3534, pruned_loss=0.1231, over 4267824.20 frames. ], batch size: 131, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:33:19,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=297630.0, ans=0.0 2023-06-18 20:33:42,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=297630.0, ans=0.0 2023-06-18 20:34:17,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=297750.0, ans=0.015 2023-06-18 20:34:44,315 INFO [train.py:996] (2/4) Epoch 2, batch 19150, loss[loss=0.3339, simple_loss=0.411, pruned_loss=0.1285, over 21700.00 frames. ], tot_loss[loss=0.3004, simple_loss=0.3541, pruned_loss=0.1233, over 4273520.50 frames. ], batch size: 351, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:34:51,111 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.341e+02 3.456e+02 4.497e+02 5.703e+02 1.044e+03, threshold=8.993e+02, percent-clipped=4.0 2023-06-18 20:34:57,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=297870.0, ans=0.125 2023-06-18 20:35:35,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=297990.0, ans=0.125 2023-06-18 20:35:42,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=297990.0, ans=0.2 2023-06-18 20:35:56,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=298050.0, ans=0.125 2023-06-18 20:36:08,126 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=16.16 vs. limit=22.5 2023-06-18 20:36:09,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=298110.0, ans=0.2 2023-06-18 20:36:27,163 INFO [train.py:996] (2/4) Epoch 2, batch 19200, loss[loss=0.3376, simple_loss=0.4182, pruned_loss=0.1285, over 21769.00 frames. ], tot_loss[loss=0.306, simple_loss=0.3645, pruned_loss=0.1237, over 4279856.63 frames. ], batch size: 351, lr: 1.59e-02, grad_scale: 32.0 2023-06-18 20:36:53,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=298230.0, ans=0.0 2023-06-18 20:37:04,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=298230.0, ans=0.0 2023-06-18 20:37:14,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=298290.0, ans=0.125 2023-06-18 20:37:19,898 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.15 vs. limit=15.0 2023-06-18 20:37:22,865 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.25 vs. limit=15.0 2023-06-18 20:37:34,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=298350.0, ans=0.02 2023-06-18 20:37:43,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=298410.0, ans=0.125 2023-06-18 20:37:48,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=298410.0, ans=0.1 2023-06-18 20:37:58,667 INFO [train.py:996] (2/4) Epoch 2, batch 19250, loss[loss=0.2271, simple_loss=0.3182, pruned_loss=0.06801, over 21785.00 frames. ], tot_loss[loss=0.2972, simple_loss=0.3615, pruned_loss=0.1164, over 4283034.04 frames. ], batch size: 332, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:38:00,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=298470.0, ans=0.1 2023-06-18 20:38:09,019 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.661e+02 2.933e+02 3.534e+02 4.386e+02 8.060e+02, threshold=7.069e+02, percent-clipped=0.0 2023-06-18 20:38:28,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=298530.0, ans=0.0 2023-06-18 20:38:30,653 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=15.0 2023-06-18 20:38:40,486 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.92 vs. limit=15.0 2023-06-18 20:39:34,514 INFO [train.py:996] (2/4) Epoch 2, batch 19300, loss[loss=0.3166, simple_loss=0.378, pruned_loss=0.1276, over 21531.00 frames. ], tot_loss[loss=0.2956, simple_loss=0.3587, pruned_loss=0.1163, over 4289614.35 frames. ], batch size: 471, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:41:17,385 INFO [train.py:996] (2/4) Epoch 2, batch 19350, loss[loss=0.2793, simple_loss=0.3677, pruned_loss=0.09544, over 21205.00 frames. ], tot_loss[loss=0.2871, simple_loss=0.3525, pruned_loss=0.1109, over 4291259.56 frames. ], batch size: 548, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:41:25,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=299070.0, ans=0.125 2023-06-18 20:41:27,671 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 20:41:28,560 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.010e+02 3.011e+02 3.707e+02 4.151e+02 9.500e+02, threshold=7.414e+02, percent-clipped=4.0 2023-06-18 20:42:33,476 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.35 vs. limit=15.0 2023-06-18 20:42:54,499 INFO [train.py:996] (2/4) Epoch 2, batch 19400, loss[loss=0.4257, simple_loss=0.4321, pruned_loss=0.2096, over 21720.00 frames. ], tot_loss[loss=0.2883, simple_loss=0.3528, pruned_loss=0.1119, over 4292880.20 frames. ], batch size: 508, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:43:21,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=299430.0, ans=0.0 2023-06-18 20:43:52,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=299550.0, ans=0.1 2023-06-18 20:44:13,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=299610.0, ans=0.0 2023-06-18 20:44:25,373 INFO [train.py:996] (2/4) Epoch 2, batch 19450, loss[loss=0.2641, simple_loss=0.3164, pruned_loss=0.1059, over 20158.00 frames. ], tot_loss[loss=0.2901, simple_loss=0.3506, pruned_loss=0.1148, over 4297407.08 frames. ], batch size: 703, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:44:36,333 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.273e+02 3.184e+02 3.692e+02 4.694e+02 9.525e+02, threshold=7.383e+02, percent-clipped=2.0 2023-06-18 20:44:55,975 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.99 vs. limit=15.0 2023-06-18 20:46:07,386 INFO [train.py:996] (2/4) Epoch 2, batch 19500, loss[loss=0.2729, simple_loss=0.3006, pruned_loss=0.1227, over 21990.00 frames. ], tot_loss[loss=0.2885, simple_loss=0.3452, pruned_loss=0.1159, over 4295229.53 frames. ], batch size: 103, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:46:09,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=299970.0, ans=0.05 2023-06-18 20:46:51,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=300090.0, ans=0.125 2023-06-18 20:47:03,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=300150.0, ans=0.125 2023-06-18 20:47:45,719 INFO [train.py:996] (2/4) Epoch 2, batch 19550, loss[loss=0.1982, simple_loss=0.2656, pruned_loss=0.06544, over 21335.00 frames. ], tot_loss[loss=0.2829, simple_loss=0.3397, pruned_loss=0.113, over 4282026.55 frames. ], batch size: 176, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:47:47,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=300270.0, ans=0.125 2023-06-18 20:47:51,804 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 3.032e+02 3.762e+02 4.813e+02 9.306e+02, threshold=7.523e+02, percent-clipped=3.0 2023-06-18 20:48:37,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=300450.0, ans=15.0 2023-06-18 20:48:48,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=300450.0, ans=0.1 2023-06-18 20:48:55,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=300510.0, ans=6.0 2023-06-18 20:49:17,050 INFO [train.py:996] (2/4) Epoch 2, batch 19600, loss[loss=0.2912, simple_loss=0.3406, pruned_loss=0.1209, over 21641.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.3426, pruned_loss=0.114, over 4287027.65 frames. ], batch size: 263, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:49:20,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=300570.0, ans=0.1 2023-06-18 20:49:42,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=300630.0, ans=0.125 2023-06-18 20:50:32,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=300810.0, ans=10.0 2023-06-18 20:50:40,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=300810.0, ans=0.125 2023-06-18 20:50:42,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=300810.0, ans=0.2 2023-06-18 20:50:49,885 INFO [train.py:996] (2/4) Epoch 2, batch 19650, loss[loss=0.2661, simple_loss=0.3537, pruned_loss=0.08927, over 19949.00 frames. ], tot_loss[loss=0.2954, simple_loss=0.3502, pruned_loss=0.1203, over 4283282.94 frames. ], batch size: 704, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:50:55,968 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.998e+02 3.349e+02 4.086e+02 5.431e+02 7.953e+02, threshold=8.171e+02, percent-clipped=1.0 2023-06-18 20:51:22,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=300930.0, ans=0.0 2023-06-18 20:52:15,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=301110.0, ans=0.0 2023-06-18 20:52:20,913 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=22.5 2023-06-18 20:52:24,564 INFO [train.py:996] (2/4) Epoch 2, batch 19700, loss[loss=0.2887, simple_loss=0.3696, pruned_loss=0.1039, over 21649.00 frames. ], tot_loss[loss=0.299, simple_loss=0.3549, pruned_loss=0.1215, over 4288253.72 frames. ], batch size: 414, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:52:25,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=301170.0, ans=0.125 2023-06-18 20:52:51,829 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 20:52:56,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=301230.0, ans=0.125 2023-06-18 20:53:51,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=301410.0, ans=0.125 2023-06-18 20:54:03,051 INFO [train.py:996] (2/4) Epoch 2, batch 19750, loss[loss=0.2839, simple_loss=0.3561, pruned_loss=0.1059, over 21433.00 frames. ], tot_loss[loss=0.3051, simple_loss=0.3636, pruned_loss=0.1233, over 4276624.45 frames. ], batch size: 194, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:54:09,463 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 3.167e+02 3.934e+02 5.557e+02 1.096e+03, threshold=7.868e+02, percent-clipped=5.0 2023-06-18 20:54:39,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=301530.0, ans=0.1 2023-06-18 20:55:02,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=301590.0, ans=0.0 2023-06-18 20:55:04,188 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.47 vs. limit=12.0 2023-06-18 20:55:40,444 INFO [train.py:996] (2/4) Epoch 2, batch 19800, loss[loss=0.2715, simple_loss=0.3103, pruned_loss=0.1163, over 21487.00 frames. ], tot_loss[loss=0.307, simple_loss=0.3642, pruned_loss=0.125, over 4279858.73 frames. ], batch size: 131, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:56:12,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=301830.0, ans=0.0 2023-06-18 20:56:22,502 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-18 20:56:31,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=301890.0, ans=0.125 2023-06-18 20:57:01,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=302010.0, ans=0.2 2023-06-18 20:57:22,428 INFO [train.py:996] (2/4) Epoch 2, batch 19850, loss[loss=0.3026, simple_loss=0.3576, pruned_loss=0.1238, over 19967.00 frames. ], tot_loss[loss=0.2938, simple_loss=0.3541, pruned_loss=0.1168, over 4275133.38 frames. ], batch size: 702, lr: 1.58e-02, grad_scale: 32.0 2023-06-18 20:57:28,510 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.939e+02 2.803e+02 3.716e+02 4.795e+02 8.783e+02, threshold=7.432e+02, percent-clipped=5.0 2023-06-18 20:57:35,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=302070.0, ans=15.0 2023-06-18 20:57:49,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=302130.0, ans=0.025 2023-06-18 20:57:49,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=302130.0, ans=0.1 2023-06-18 20:58:09,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=302190.0, ans=0.125 2023-06-18 20:58:54,949 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-18 20:59:00,046 INFO [train.py:996] (2/4) Epoch 2, batch 19900, loss[loss=0.2463, simple_loss=0.3006, pruned_loss=0.09601, over 21195.00 frames. ], tot_loss[loss=0.2903, simple_loss=0.3536, pruned_loss=0.1134, over 4272690.97 frames. ], batch size: 176, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 20:59:19,418 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-18 21:00:00,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=302550.0, ans=0.125 2023-06-18 21:00:31,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=302610.0, ans=0.125 2023-06-18 21:00:35,687 INFO [train.py:996] (2/4) Epoch 2, batch 19950, loss[loss=0.2431, simple_loss=0.2913, pruned_loss=0.0975, over 21557.00 frames. ], tot_loss[loss=0.2866, simple_loss=0.3474, pruned_loss=0.1129, over 4268332.32 frames. ], batch size: 247, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:00:41,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=302670.0, ans=0.0 2023-06-18 21:00:46,809 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 2.768e+02 3.439e+02 5.262e+02 1.066e+03, threshold=6.877e+02, percent-clipped=5.0 2023-06-18 21:00:49,655 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.67 vs. limit=22.5 2023-06-18 21:00:52,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=302670.0, ans=0.125 2023-06-18 21:01:10,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=302730.0, ans=0.125 2023-06-18 21:01:32,874 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.38 vs. limit=10.0 2023-06-18 21:01:41,981 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.61 vs. limit=15.0 2023-06-18 21:02:16,332 INFO [train.py:996] (2/4) Epoch 2, batch 20000, loss[loss=0.3104, simple_loss=0.3542, pruned_loss=0.1333, over 21522.00 frames. ], tot_loss[loss=0.2881, simple_loss=0.3488, pruned_loss=0.1137, over 4267463.04 frames. ], batch size: 194, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:03:20,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=303150.0, ans=0.0 2023-06-18 21:03:46,357 INFO [train.py:996] (2/4) Epoch 2, batch 20050, loss[loss=0.2741, simple_loss=0.3278, pruned_loss=0.1102, over 21477.00 frames. ], tot_loss[loss=0.2934, simple_loss=0.352, pruned_loss=0.1174, over 4279564.13 frames. ], batch size: 194, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:03:56,942 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.497e+02 3.190e+02 3.752e+02 4.909e+02 8.771e+02, threshold=7.503e+02, percent-clipped=6.0 2023-06-18 21:04:07,316 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.74 vs. limit=6.0 2023-06-18 21:04:17,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=303330.0, ans=0.125 2023-06-18 21:04:27,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=303330.0, ans=0.125 2023-06-18 21:04:27,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=303330.0, ans=0.125 2023-06-18 21:04:48,572 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-06-18 21:05:31,075 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:05:33,704 INFO [train.py:996] (2/4) Epoch 2, batch 20100, loss[loss=0.3866, simple_loss=0.4702, pruned_loss=0.1515, over 21301.00 frames. ], tot_loss[loss=0.2969, simple_loss=0.3549, pruned_loss=0.1195, over 4281556.76 frames. ], batch size: 548, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:05:53,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=303630.0, ans=0.2 2023-06-18 21:06:04,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=303630.0, ans=0.125 2023-06-18 21:06:13,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=303690.0, ans=0.125 2023-06-18 21:06:15,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=303690.0, ans=0.09899494936611666 2023-06-18 21:06:29,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=303750.0, ans=0.125 2023-06-18 21:07:04,248 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.49 vs. limit=6.0 2023-06-18 21:07:17,563 INFO [train.py:996] (2/4) Epoch 2, batch 20150, loss[loss=0.3285, simple_loss=0.3819, pruned_loss=0.1375, over 21384.00 frames. ], tot_loss[loss=0.3084, simple_loss=0.3663, pruned_loss=0.1252, over 4277463.51 frames. ], batch size: 131, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:07:24,247 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.500e+02 3.342e+02 4.169e+02 5.156e+02 8.825e+02, threshold=8.338e+02, percent-clipped=3.0 2023-06-18 21:07:33,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=303930.0, ans=0.2 2023-06-18 21:07:38,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=303930.0, ans=0.125 2023-06-18 21:07:57,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=303990.0, ans=0.2 2023-06-18 21:08:51,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=304110.0, ans=0.0 2023-06-18 21:08:58,629 INFO [train.py:996] (2/4) Epoch 2, batch 20200, loss[loss=0.3344, simple_loss=0.3939, pruned_loss=0.1374, over 21369.00 frames. ], tot_loss[loss=0.3154, simple_loss=0.3739, pruned_loss=0.1285, over 4278993.27 frames. ], batch size: 131, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:09:04,666 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-06-18 21:09:26,520 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.44 vs. limit=15.0 2023-06-18 21:09:38,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=304290.0, ans=0.125 2023-06-18 21:09:38,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=304290.0, ans=0.0 2023-06-18 21:10:20,604 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-18 21:10:35,657 INFO [train.py:996] (2/4) Epoch 2, batch 20250, loss[loss=0.2849, simple_loss=0.3349, pruned_loss=0.1174, over 21305.00 frames. ], tot_loss[loss=0.3142, simple_loss=0.3745, pruned_loss=0.127, over 4284284.23 frames. ], batch size: 159, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:10:41,730 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 3.323e+02 4.182e+02 5.137e+02 1.003e+03, threshold=8.365e+02, percent-clipped=2.0 2023-06-18 21:10:49,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=304470.0, ans=0.125 2023-06-18 21:11:06,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=304530.0, ans=0.025 2023-06-18 21:11:51,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=304650.0, ans=0.0 2023-06-18 21:11:52,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=304650.0, ans=0.0 2023-06-18 21:11:55,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=304650.0, ans=0.0 2023-06-18 21:11:56,221 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2023-06-18 21:12:04,129 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-18 21:12:13,290 INFO [train.py:996] (2/4) Epoch 2, batch 20300, loss[loss=0.2855, simple_loss=0.365, pruned_loss=0.103, over 21745.00 frames. ], tot_loss[loss=0.3065, simple_loss=0.3692, pruned_loss=0.1219, over 4274546.03 frames. ], batch size: 414, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:12:33,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=304830.0, ans=0.125 2023-06-18 21:12:38,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=304830.0, ans=0.125 2023-06-18 21:12:41,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=304830.0, ans=0.125 2023-06-18 21:13:41,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=305010.0, ans=0.2 2023-06-18 21:13:43,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=305010.0, ans=0.2 2023-06-18 21:13:48,916 INFO [train.py:996] (2/4) Epoch 2, batch 20350, loss[loss=0.3159, simple_loss=0.3813, pruned_loss=0.1252, over 19830.00 frames. ], tot_loss[loss=0.3077, simple_loss=0.3699, pruned_loss=0.1227, over 4271696.36 frames. ], batch size: 704, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:13:55,476 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 3.188e+02 3.897e+02 4.973e+02 9.485e+02, threshold=7.794e+02, percent-clipped=2.0 2023-06-18 21:14:14,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=305130.0, ans=0.125 2023-06-18 21:14:22,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=305130.0, ans=0.125 2023-06-18 21:14:23,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=305190.0, ans=0.1 2023-06-18 21:14:39,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=305190.0, ans=0.125 2023-06-18 21:14:50,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=305250.0, ans=0.0 2023-06-18 21:15:26,544 INFO [train.py:996] (2/4) Epoch 2, batch 20400, loss[loss=0.3403, simple_loss=0.4005, pruned_loss=0.1401, over 21628.00 frames. ], tot_loss[loss=0.3132, simple_loss=0.374, pruned_loss=0.1262, over 4267259.98 frames. ], batch size: 389, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:15:54,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=305430.0, ans=0.125 2023-06-18 21:16:46,023 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.20 vs. limit=15.0 2023-06-18 21:17:02,155 INFO [train.py:996] (2/4) Epoch 2, batch 20450, loss[loss=0.3259, simple_loss=0.367, pruned_loss=0.1425, over 21479.00 frames. ], tot_loss[loss=0.3178, simple_loss=0.375, pruned_loss=0.1303, over 4257934.50 frames. ], batch size: 194, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:17:07,778 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.641e+02 3.573e+02 4.608e+02 6.565e+02 1.538e+03, threshold=9.216e+02, percent-clipped=19.0 2023-06-18 21:18:24,659 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-18 21:18:30,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=305910.0, ans=0.0 2023-06-18 21:18:37,633 INFO [train.py:996] (2/4) Epoch 2, batch 20500, loss[loss=0.2966, simple_loss=0.3415, pruned_loss=0.1258, over 21795.00 frames. ], tot_loss[loss=0.3151, simple_loss=0.3701, pruned_loss=0.1301, over 4257128.59 frames. ], batch size: 371, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:19:02,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=306030.0, ans=0.0 2023-06-18 21:19:48,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=306150.0, ans=0.125 2023-06-18 21:19:57,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=306210.0, ans=0.125 2023-06-18 21:20:19,090 INFO [train.py:996] (2/4) Epoch 2, batch 20550, loss[loss=0.2401, simple_loss=0.3063, pruned_loss=0.08689, over 15945.00 frames. ], tot_loss[loss=0.3089, simple_loss=0.3625, pruned_loss=0.1276, over 4252418.57 frames. ], batch size: 60, lr: 1.57e-02, grad_scale: 32.0 2023-06-18 21:20:25,448 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 3.461e+02 4.145e+02 5.402e+02 8.194e+02, threshold=8.291e+02, percent-clipped=0.0 2023-06-18 21:20:45,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=306330.0, ans=0.1 2023-06-18 21:21:25,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=306450.0, ans=0.125 2023-06-18 21:21:27,470 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.67 vs. limit=6.0 2023-06-18 21:21:28,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=306450.0, ans=0.125 2023-06-18 21:21:44,067 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.24 vs. limit=10.0 2023-06-18 21:21:44,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=306510.0, ans=0.125 2023-06-18 21:21:56,398 INFO [train.py:996] (2/4) Epoch 2, batch 20600, loss[loss=0.2989, simple_loss=0.3525, pruned_loss=0.1226, over 21811.00 frames. ], tot_loss[loss=0.3064, simple_loss=0.3629, pruned_loss=0.1249, over 4248970.14 frames. ], batch size: 414, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:22:01,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=306570.0, ans=0.1 2023-06-18 21:22:28,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=306630.0, ans=0.125 2023-06-18 21:22:29,194 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.16 vs. limit=6.0 2023-06-18 21:22:58,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=306750.0, ans=0.0 2023-06-18 21:23:32,700 INFO [train.py:996] (2/4) Epoch 2, batch 20650, loss[loss=0.2787, simple_loss=0.3221, pruned_loss=0.1177, over 21844.00 frames. ], tot_loss[loss=0.3049, simple_loss=0.3587, pruned_loss=0.1256, over 4254331.31 frames. ], batch size: 247, lr: 1.56e-02, grad_scale: 64.0 2023-06-18 21:23:33,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=306870.0, ans=0.0 2023-06-18 21:23:33,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=306870.0, ans=0.1 2023-06-18 21:23:37,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=306870.0, ans=0.0 2023-06-18 21:23:37,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=306870.0, ans=0.0 2023-06-18 21:23:38,859 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.283e+02 3.072e+02 3.756e+02 5.105e+02 7.352e+02, threshold=7.512e+02, percent-clipped=0.0 2023-06-18 21:23:47,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=306930.0, ans=0.125 2023-06-18 21:23:50,003 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=22.5 2023-06-18 21:24:49,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=307050.0, ans=0.125 2023-06-18 21:25:12,215 INFO [train.py:996] (2/4) Epoch 2, batch 20700, loss[loss=0.2366, simple_loss=0.3076, pruned_loss=0.08279, over 21677.00 frames. ], tot_loss[loss=0.2954, simple_loss=0.351, pruned_loss=0.1199, over 4251987.45 frames. ], batch size: 298, lr: 1.56e-02, grad_scale: 64.0 2023-06-18 21:25:16,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=307170.0, ans=0.0 2023-06-18 21:25:22,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=307170.0, ans=0.0 2023-06-18 21:25:26,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=307230.0, ans=0.1 2023-06-18 21:26:19,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=307350.0, ans=0.125 2023-06-18 21:26:37,208 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-06-18 21:26:49,461 INFO [train.py:996] (2/4) Epoch 2, batch 20750, loss[loss=0.3361, simple_loss=0.4009, pruned_loss=0.1356, over 21449.00 frames. ], tot_loss[loss=0.2941, simple_loss=0.3521, pruned_loss=0.1181, over 4253094.70 frames. ], batch size: 194, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:26:57,493 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.977e+02 3.559e+02 4.590e+02 7.850e+02, threshold=7.118e+02, percent-clipped=2.0 2023-06-18 21:27:04,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=307470.0, ans=0.125 2023-06-18 21:27:07,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=307470.0, ans=0.125 2023-06-18 21:27:44,651 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=12.0 2023-06-18 21:27:50,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=307650.0, ans=0.125 2023-06-18 21:27:53,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=307650.0, ans=0.0 2023-06-18 21:27:54,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=307650.0, ans=0.1 2023-06-18 21:27:54,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=307650.0, ans=0.0 2023-06-18 21:28:11,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=307710.0, ans=0.1 2023-06-18 21:28:26,173 INFO [train.py:996] (2/4) Epoch 2, batch 20800, loss[loss=0.1705, simple_loss=0.2156, pruned_loss=0.0627, over 17982.00 frames. ], tot_loss[loss=0.2996, simple_loss=0.357, pruned_loss=0.121, over 4241591.71 frames. ], batch size: 67, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:29:40,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=307950.0, ans=0.1 2023-06-18 21:30:02,865 INFO [train.py:996] (2/4) Epoch 2, batch 20850, loss[loss=0.2812, simple_loss=0.3372, pruned_loss=0.1126, over 21792.00 frames. ], tot_loss[loss=0.2935, simple_loss=0.3497, pruned_loss=0.1187, over 4247393.65 frames. ], batch size: 282, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:30:15,028 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.87 vs. limit=15.0 2023-06-18 21:30:16,769 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 3.417e+02 4.186e+02 5.469e+02 9.109e+02, threshold=8.373e+02, percent-clipped=11.0 2023-06-18 21:30:19,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=308070.0, ans=0.0 2023-06-18 21:30:45,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=308130.0, ans=0.0 2023-06-18 21:31:22,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=308310.0, ans=0.0 2023-06-18 21:31:37,919 INFO [train.py:996] (2/4) Epoch 2, batch 20900, loss[loss=0.2913, simple_loss=0.3488, pruned_loss=0.1169, over 21460.00 frames. ], tot_loss[loss=0.2953, simple_loss=0.351, pruned_loss=0.1199, over 4256954.54 frames. ], batch size: 195, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:31:59,680 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=22.5 2023-06-18 21:32:28,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=308490.0, ans=0.125 2023-06-18 21:33:12,549 INFO [train.py:996] (2/4) Epoch 2, batch 20950, loss[loss=0.2296, simple_loss=0.2887, pruned_loss=0.08526, over 21404.00 frames. ], tot_loss[loss=0.288, simple_loss=0.3463, pruned_loss=0.1149, over 4248544.20 frames. ], batch size: 131, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:33:21,947 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.884e+02 3.051e+02 3.719e+02 4.723e+02 9.435e+02, threshold=7.438e+02, percent-clipped=1.0 2023-06-18 21:33:28,120 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.80 vs. limit=15.0 2023-06-18 21:33:36,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=308730.0, ans=0.04949747468305833 2023-06-18 21:33:46,585 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 21:34:10,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=308790.0, ans=0.125 2023-06-18 21:34:33,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=308910.0, ans=0.0 2023-06-18 21:34:45,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=308910.0, ans=0.125 2023-06-18 21:34:48,331 INFO [train.py:996] (2/4) Epoch 2, batch 21000, loss[loss=0.1988, simple_loss=0.258, pruned_loss=0.06974, over 17690.00 frames. ], tot_loss[loss=0.286, simple_loss=0.3437, pruned_loss=0.1141, over 4249665.90 frames. ], batch size: 68, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:34:48,332 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 21:35:04,496 INFO [train.py:1028] (2/4) Epoch 2, validation: loss=0.2933, simple_loss=0.3899, pruned_loss=0.09838, over 1796401.00 frames. 2023-06-18 21:35:04,496 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-18 21:35:21,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=308970.0, ans=15.0 2023-06-18 21:35:50,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=309090.0, ans=0.125 2023-06-18 21:35:53,610 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.37 vs. limit=12.0 2023-06-18 21:36:16,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=309150.0, ans=0.125 2023-06-18 21:36:16,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=309150.0, ans=0.2 2023-06-18 21:36:40,615 INFO [train.py:996] (2/4) Epoch 2, batch 21050, loss[loss=0.2962, simple_loss=0.3344, pruned_loss=0.129, over 21462.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.3419, pruned_loss=0.1144, over 4246848.43 frames. ], batch size: 441, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:36:55,159 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.229e+02 2.843e+02 3.389e+02 4.157e+02 8.301e+02, threshold=6.779e+02, percent-clipped=3.0 2023-06-18 21:36:58,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=309270.0, ans=0.07 2023-06-18 21:36:59,283 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2023-06-18 21:37:10,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=309330.0, ans=0.125 2023-06-18 21:37:28,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=309390.0, ans=0.125 2023-06-18 21:37:30,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=309390.0, ans=0.05 2023-06-18 21:38:12,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=309510.0, ans=0.1 2023-06-18 21:38:16,154 INFO [train.py:996] (2/4) Epoch 2, batch 21100, loss[loss=0.3209, simple_loss=0.3633, pruned_loss=0.1392, over 21988.00 frames. ], tot_loss[loss=0.2841, simple_loss=0.3398, pruned_loss=0.1142, over 4228999.12 frames. ], batch size: 103, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:38:16,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=309570.0, ans=0.1 2023-06-18 21:39:51,666 INFO [train.py:996] (2/4) Epoch 2, batch 21150, loss[loss=0.2687, simple_loss=0.3157, pruned_loss=0.1109, over 21851.00 frames. ], tot_loss[loss=0.2837, simple_loss=0.3364, pruned_loss=0.1155, over 4241075.15 frames. ], batch size: 373, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:40:05,890 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 2.856e+02 3.188e+02 4.098e+02 8.101e+02, threshold=6.375e+02, percent-clipped=2.0 2023-06-18 21:40:27,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=309930.0, ans=0.0 2023-06-18 21:40:50,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=310050.0, ans=0.1 2023-06-18 21:41:00,790 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=19.38 vs. limit=22.5 2023-06-18 21:41:27,075 INFO [train.py:996] (2/4) Epoch 2, batch 21200, loss[loss=0.2528, simple_loss=0.3019, pruned_loss=0.1018, over 21316.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.3316, pruned_loss=0.1146, over 4239971.99 frames. ], batch size: 131, lr: 1.56e-02, grad_scale: 32.0 2023-06-18 21:41:30,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=310170.0, ans=0.0 2023-06-18 21:41:30,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=310170.0, ans=0.125 2023-06-18 21:42:02,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=310230.0, ans=0.125 2023-06-18 21:42:04,979 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.17 vs. limit=15.0 2023-06-18 21:42:09,728 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.86 vs. limit=22.5 2023-06-18 21:42:41,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=310410.0, ans=0.125 2023-06-18 21:42:43,763 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=15.0 2023-06-18 21:43:02,671 INFO [train.py:996] (2/4) Epoch 2, batch 21250, loss[loss=0.2397, simple_loss=0.2947, pruned_loss=0.09237, over 21579.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.3286, pruned_loss=0.1131, over 4250284.63 frames. ], batch size: 263, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:43:11,963 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.405e+02 2.991e+02 3.536e+02 4.577e+02 9.525e+02, threshold=7.072e+02, percent-clipped=11.0 2023-06-18 21:44:08,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=310650.0, ans=10.0 2023-06-18 21:44:38,887 INFO [train.py:996] (2/4) Epoch 2, batch 21300, loss[loss=0.2967, simple_loss=0.3508, pruned_loss=0.1213, over 21915.00 frames. ], tot_loss[loss=0.2823, simple_loss=0.3337, pruned_loss=0.1155, over 4256583.24 frames. ], batch size: 316, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:45:09,914 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=15.0 2023-06-18 21:46:16,033 INFO [train.py:996] (2/4) Epoch 2, batch 21350, loss[loss=0.2291, simple_loss=0.3111, pruned_loss=0.0735, over 21591.00 frames. ], tot_loss[loss=0.2869, simple_loss=0.3399, pruned_loss=0.1169, over 4264303.51 frames. ], batch size: 230, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:46:30,283 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.415e+02 3.675e+02 5.052e+02 5.900e+02 9.607e+02, threshold=1.010e+03, percent-clipped=12.0 2023-06-18 21:47:17,078 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-18 21:47:36,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=311310.0, ans=0.0 2023-06-18 21:47:53,580 INFO [train.py:996] (2/4) Epoch 2, batch 21400, loss[loss=0.3432, simple_loss=0.395, pruned_loss=0.1457, over 21332.00 frames. ], tot_loss[loss=0.2879, simple_loss=0.3434, pruned_loss=0.1162, over 4265734.54 frames. ], batch size: 549, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:48:31,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=311430.0, ans=0.0 2023-06-18 21:48:49,780 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-18 21:48:52,688 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=22.5 2023-06-18 21:49:34,473 INFO [train.py:996] (2/4) Epoch 2, batch 21450, loss[loss=0.2816, simple_loss=0.329, pruned_loss=0.1171, over 20086.00 frames. ], tot_loss[loss=0.2956, simple_loss=0.3505, pruned_loss=0.1203, over 4274786.90 frames. ], batch size: 703, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:49:44,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=311670.0, ans=0.125 2023-06-18 21:49:46,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=311670.0, ans=0.125 2023-06-18 21:49:46,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=311670.0, ans=0.0 2023-06-18 21:49:48,712 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.034e+02 3.817e+02 5.220e+02 1.129e+03, threshold=7.634e+02, percent-clipped=2.0 2023-06-18 21:50:00,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=311730.0, ans=0.1 2023-06-18 21:50:02,323 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.25 vs. limit=15.0 2023-06-18 21:50:07,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=311730.0, ans=0.125 2023-06-18 21:50:25,198 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.76 vs. limit=10.0 2023-06-18 21:50:52,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=311910.0, ans=0.0 2023-06-18 21:50:53,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=311910.0, ans=0.125 2023-06-18 21:51:15,122 INFO [train.py:996] (2/4) Epoch 2, batch 21500, loss[loss=0.2549, simple_loss=0.3053, pruned_loss=0.1023, over 21223.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3486, pruned_loss=0.1214, over 4271488.12 frames. ], batch size: 176, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:52:37,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=312210.0, ans=0.125 2023-06-18 21:52:51,489 INFO [train.py:996] (2/4) Epoch 2, batch 21550, loss[loss=0.2902, simple_loss=0.3443, pruned_loss=0.118, over 21735.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.3408, pruned_loss=0.1178, over 4276494.12 frames. ], batch size: 112, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:53:05,535 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 3.167e+02 4.080e+02 5.090e+02 8.174e+02, threshold=8.161e+02, percent-clipped=3.0 2023-06-18 21:53:12,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=312330.0, ans=0.0 2023-06-18 21:53:15,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=312330.0, ans=0.1 2023-06-18 21:53:19,321 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.82 vs. limit=22.5 2023-06-18 21:53:31,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=312390.0, ans=0.0 2023-06-18 21:53:52,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=312450.0, ans=0.2 2023-06-18 21:54:11,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=312510.0, ans=0.2 2023-06-18 21:54:26,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=312510.0, ans=0.0 2023-06-18 21:54:28,567 INFO [train.py:996] (2/4) Epoch 2, batch 21600, loss[loss=0.2462, simple_loss=0.3131, pruned_loss=0.0897, over 21179.00 frames. ], tot_loss[loss=0.2825, simple_loss=0.3349, pruned_loss=0.115, over 4273751.42 frames. ], batch size: 176, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:54:29,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=312570.0, ans=0.0 2023-06-18 21:54:43,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=312570.0, ans=0.1 2023-06-18 21:55:40,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=312750.0, ans=0.125 2023-06-18 21:55:46,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=312810.0, ans=0.2 2023-06-18 21:55:47,215 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=12.0 2023-06-18 21:56:04,252 INFO [train.py:996] (2/4) Epoch 2, batch 21650, loss[loss=0.3453, simple_loss=0.4176, pruned_loss=0.1365, over 21544.00 frames. ], tot_loss[loss=0.2839, simple_loss=0.3412, pruned_loss=0.1133, over 4267920.81 frames. ], batch size: 471, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:56:07,019 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.27 vs. limit=10.0 2023-06-18 21:56:18,345 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.349e+02 2.957e+02 3.372e+02 4.177e+02 8.367e+02, threshold=6.745e+02, percent-clipped=1.0 2023-06-18 21:56:20,921 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-06-18 21:56:30,899 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-18 21:56:53,745 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.55 vs. limit=10.0 2023-06-18 21:57:33,685 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-18 21:57:34,165 INFO [train.py:996] (2/4) Epoch 2, batch 21700, loss[loss=0.2518, simple_loss=0.2999, pruned_loss=0.1018, over 21811.00 frames. ], tot_loss[loss=0.2786, simple_loss=0.3384, pruned_loss=0.1094, over 4262347.78 frames. ], batch size: 102, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:57:45,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=313170.0, ans=0.125 2023-06-18 21:57:45,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=313170.0, ans=0.0 2023-06-18 21:58:35,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=313350.0, ans=0.125 2023-06-18 21:58:35,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=313350.0, ans=0.0 2023-06-18 21:58:37,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=313350.0, ans=0.1 2023-06-18 21:59:09,012 INFO [train.py:996] (2/4) Epoch 2, batch 21750, loss[loss=0.2508, simple_loss=0.3023, pruned_loss=0.09964, over 21491.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.3344, pruned_loss=0.1111, over 4268055.15 frames. ], batch size: 212, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 21:59:28,452 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.954e+02 3.507e+02 4.548e+02 1.201e+03, threshold=7.014e+02, percent-clipped=5.0 2023-06-18 22:00:10,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=313650.0, ans=0.125 2023-06-18 22:00:20,152 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.98 vs. limit=15.0 2023-06-18 22:00:21,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=313650.0, ans=0.125 2023-06-18 22:00:50,250 INFO [train.py:996] (2/4) Epoch 2, batch 21800, loss[loss=0.2626, simple_loss=0.3073, pruned_loss=0.109, over 21446.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.3335, pruned_loss=0.1116, over 4255391.92 frames. ], batch size: 212, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 22:00:52,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=313770.0, ans=0.0 2023-06-18 22:01:14,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=313830.0, ans=0.1 2023-06-18 22:01:26,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=313890.0, ans=0.1 2023-06-18 22:01:28,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=313890.0, ans=0.125 2023-06-18 22:01:35,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=313890.0, ans=0.125 2023-06-18 22:01:38,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=313890.0, ans=0.015 2023-06-18 22:01:57,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=313950.0, ans=0.1 2023-06-18 22:02:16,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=314010.0, ans=0.125 2023-06-18 22:02:26,277 INFO [train.py:996] (2/4) Epoch 2, batch 21850, loss[loss=0.2777, simple_loss=0.3384, pruned_loss=0.1084, over 21748.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3396, pruned_loss=0.1129, over 4248200.71 frames. ], batch size: 247, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 22:02:40,362 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.322e+02 3.114e+02 3.894e+02 4.746e+02 8.265e+02, threshold=7.787e+02, percent-clipped=3.0 2023-06-18 22:03:22,484 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=22.5 2023-06-18 22:03:31,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=314250.0, ans=0.125 2023-06-18 22:03:46,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=314310.0, ans=0.125 2023-06-18 22:03:48,823 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.61 vs. limit=15.0 2023-06-18 22:04:06,596 INFO [train.py:996] (2/4) Epoch 2, batch 21900, loss[loss=0.2985, simple_loss=0.336, pruned_loss=0.1305, over 21547.00 frames. ], tot_loss[loss=0.2855, simple_loss=0.341, pruned_loss=0.115, over 4260518.26 frames. ], batch size: 212, lr: 1.55e-02, grad_scale: 32.0 2023-06-18 22:04:08,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=314370.0, ans=0.125 2023-06-18 22:04:08,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=314370.0, ans=10.0 2023-06-18 22:04:14,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=314370.0, ans=0.0 2023-06-18 22:04:28,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=314430.0, ans=0.2 2023-06-18 22:05:16,370 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=15.0 2023-06-18 22:05:23,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=314610.0, ans=0.05 2023-06-18 22:05:36,435 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2023-06-18 22:05:36,737 INFO [train.py:996] (2/4) Epoch 2, batch 21950, loss[loss=0.1864, simple_loss=0.2589, pruned_loss=0.05689, over 21645.00 frames. ], tot_loss[loss=0.281, simple_loss=0.3351, pruned_loss=0.1134, over 4263914.26 frames. ], batch size: 263, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:05:50,668 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.147e+02 2.916e+02 3.382e+02 4.376e+02 8.385e+02, threshold=6.764e+02, percent-clipped=2.0 2023-06-18 22:06:01,018 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.99 vs. limit=15.0 2023-06-18 22:06:01,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=314730.0, ans=0.125 2023-06-18 22:06:16,442 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.02 vs. limit=15.0 2023-06-18 22:06:34,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=314850.0, ans=0.125 2023-06-18 22:06:41,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=314850.0, ans=0.125 2023-06-18 22:07:14,897 INFO [train.py:996] (2/4) Epoch 2, batch 22000, loss[loss=0.2904, simple_loss=0.3385, pruned_loss=0.1211, over 21457.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.3281, pruned_loss=0.1093, over 4264579.51 frames. ], batch size: 389, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:07:15,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=314970.0, ans=0.125 2023-06-18 22:07:17,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=314970.0, ans=0.0 2023-06-18 22:07:28,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=314970.0, ans=0.125 2023-06-18 22:07:54,836 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 22:08:00,044 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-06-18 22:08:03,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=315090.0, ans=0.2 2023-06-18 22:08:35,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=315210.0, ans=0.0 2023-06-18 22:08:37,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=315210.0, ans=0.125 2023-06-18 22:09:01,076 INFO [train.py:996] (2/4) Epoch 2, batch 22050, loss[loss=0.3648, simple_loss=0.4079, pruned_loss=0.1608, over 21248.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.3322, pruned_loss=0.1107, over 4248190.46 frames. ], batch size: 159, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:09:10,809 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 3.309e+02 5.007e+02 6.581e+02 1.076e+03, threshold=1.001e+03, percent-clipped=24.0 2023-06-18 22:09:39,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=315390.0, ans=0.2 2023-06-18 22:09:46,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=315390.0, ans=0.125 2023-06-18 22:10:10,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=315450.0, ans=0.125 2023-06-18 22:10:39,910 INFO [train.py:996] (2/4) Epoch 2, batch 22100, loss[loss=0.3072, simple_loss=0.3644, pruned_loss=0.1251, over 21694.00 frames. ], tot_loss[loss=0.2929, simple_loss=0.3476, pruned_loss=0.1191, over 4249361.74 frames. ], batch size: 389, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:11:07,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=315630.0, ans=0.125 2023-06-18 22:12:12,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=315810.0, ans=0.1 2023-06-18 22:12:17,300 INFO [train.py:996] (2/4) Epoch 2, batch 22150, loss[loss=0.294, simple_loss=0.3534, pruned_loss=0.1173, over 21577.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3499, pruned_loss=0.1207, over 4268244.66 frames. ], batch size: 195, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:12:22,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=315870.0, ans=0.07 2023-06-18 22:12:26,472 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.397e+02 4.124e+02 4.864e+02 1.101e+03, threshold=8.247e+02, percent-clipped=1.0 2023-06-18 22:12:52,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=315990.0, ans=0.125 2023-06-18 22:12:52,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=315990.0, ans=0.125 2023-06-18 22:13:15,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=316050.0, ans=10.0 2023-06-18 22:13:21,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=316050.0, ans=0.125 2023-06-18 22:13:45,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=316110.0, ans=0.1 2023-06-18 22:13:47,637 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=22.5 2023-06-18 22:13:52,421 INFO [train.py:996] (2/4) Epoch 2, batch 22200, loss[loss=0.3053, simple_loss=0.3759, pruned_loss=0.1174, over 21821.00 frames. ], tot_loss[loss=0.2977, simple_loss=0.3516, pruned_loss=0.1219, over 4277812.66 frames. ], batch size: 282, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:15:13,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=316410.0, ans=0.125 2023-06-18 22:15:33,133 INFO [train.py:996] (2/4) Epoch 2, batch 22250, loss[loss=0.3746, simple_loss=0.4274, pruned_loss=0.1609, over 21806.00 frames. ], tot_loss[loss=0.3042, simple_loss=0.3599, pruned_loss=0.1243, over 4285498.14 frames. ], batch size: 124, lr: 1.54e-02, grad_scale: 16.0 2023-06-18 22:15:43,759 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.348e+02 2.999e+02 3.846e+02 4.976e+02 1.173e+03, threshold=7.692e+02, percent-clipped=5.0 2023-06-18 22:16:28,001 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.53 vs. limit=6.0 2023-06-18 22:16:28,021 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.62 vs. limit=6.0 2023-06-18 22:16:30,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=316650.0, ans=0.04949747468305833 2023-06-18 22:16:43,369 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-06-18 22:16:44,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=316710.0, ans=0.0 2023-06-18 22:16:50,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=316710.0, ans=0.0 2023-06-18 22:17:00,371 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.26 vs. limit=6.0 2023-06-18 22:17:08,373 INFO [train.py:996] (2/4) Epoch 2, batch 22300, loss[loss=0.2998, simple_loss=0.3439, pruned_loss=0.1278, over 21873.00 frames. ], tot_loss[loss=0.3083, simple_loss=0.3621, pruned_loss=0.1273, over 4291018.91 frames. ], batch size: 298, lr: 1.54e-02, grad_scale: 16.0 2023-06-18 22:18:03,507 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=22.5 2023-06-18 22:18:32,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=317010.0, ans=0.125 2023-06-18 22:18:42,790 INFO [train.py:996] (2/4) Epoch 2, batch 22350, loss[loss=0.2727, simple_loss=0.3353, pruned_loss=0.105, over 21640.00 frames. ], tot_loss[loss=0.3065, simple_loss=0.359, pruned_loss=0.127, over 4298110.64 frames. ], batch size: 263, lr: 1.54e-02, grad_scale: 16.0 2023-06-18 22:18:53,865 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 3.080e+02 3.447e+02 4.334e+02 7.080e+02, threshold=6.895e+02, percent-clipped=0.0 2023-06-18 22:18:57,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=317130.0, ans=0.125 2023-06-18 22:18:58,524 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.56 vs. limit=15.0 2023-06-18 22:19:23,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=317190.0, ans=0.125 2023-06-18 22:20:19,424 INFO [train.py:996] (2/4) Epoch 2, batch 22400, loss[loss=0.2605, simple_loss=0.3236, pruned_loss=0.09867, over 21400.00 frames. ], tot_loss[loss=0.2989, simple_loss=0.354, pruned_loss=0.1218, over 4290388.24 frames. ], batch size: 131, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:20:51,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=317430.0, ans=0.0 2023-06-18 22:21:34,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=317610.0, ans=0.0 2023-06-18 22:21:36,767 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=12.0 2023-06-18 22:21:41,360 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.02 vs. limit=22.5 2023-06-18 22:21:47,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=317610.0, ans=0.0 2023-06-18 22:21:49,915 INFO [train.py:996] (2/4) Epoch 2, batch 22450, loss[loss=0.2465, simple_loss=0.3012, pruned_loss=0.09584, over 21654.00 frames. ], tot_loss[loss=0.2935, simple_loss=0.347, pruned_loss=0.12, over 4277339.79 frames. ], batch size: 282, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:22:00,644 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.269e+02 3.085e+02 3.597e+02 4.516e+02 1.181e+03, threshold=7.194e+02, percent-clipped=2.0 2023-06-18 22:22:13,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=317730.0, ans=0.125 2023-06-18 22:23:09,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=317850.0, ans=0.1 2023-06-18 22:23:19,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=317910.0, ans=0.125 2023-06-18 22:23:21,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=317910.0, ans=0.125 2023-06-18 22:23:28,631 INFO [train.py:996] (2/4) Epoch 2, batch 22500, loss[loss=0.3713, simple_loss=0.4391, pruned_loss=0.1518, over 21600.00 frames. ], tot_loss[loss=0.2917, simple_loss=0.3439, pruned_loss=0.1197, over 4274919.74 frames. ], batch size: 414, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:23:35,815 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 22:24:22,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=318090.0, ans=0.125 2023-06-18 22:24:23,154 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-18 22:25:10,128 INFO [train.py:996] (2/4) Epoch 2, batch 22550, loss[loss=0.3242, simple_loss=0.3742, pruned_loss=0.1371, over 21834.00 frames. ], tot_loss[loss=0.2949, simple_loss=0.3495, pruned_loss=0.1201, over 4276103.56 frames. ], batch size: 441, lr: 1.54e-02, grad_scale: 32.0 2023-06-18 22:25:14,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=318270.0, ans=0.125 2023-06-18 22:25:26,412 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.340e+02 3.552e+02 4.294e+02 6.006e+02 1.237e+03, threshold=8.588e+02, percent-clipped=14.0 2023-06-18 22:25:55,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=318390.0, ans=0.0 2023-06-18 22:26:10,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=318390.0, ans=0.125 2023-06-18 22:26:25,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=318450.0, ans=0.1 2023-06-18 22:26:40,687 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.31 vs. limit=15.0 2023-06-18 22:26:44,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=318510.0, ans=0.1 2023-06-18 22:26:51,932 INFO [train.py:996] (2/4) Epoch 2, batch 22600, loss[loss=0.2938, simple_loss=0.3531, pruned_loss=0.1173, over 21791.00 frames. ], tot_loss[loss=0.2987, simple_loss=0.3535, pruned_loss=0.122, over 4282092.07 frames. ], batch size: 332, lr: 1.54e-02, grad_scale: 16.0 2023-06-18 22:27:08,435 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.25 vs. limit=15.0 2023-06-18 22:28:04,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=318750.0, ans=0.125 2023-06-18 22:28:07,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=318810.0, ans=0.1 2023-06-18 22:28:29,441 INFO [train.py:996] (2/4) Epoch 2, batch 22650, loss[loss=0.244, simple_loss=0.2963, pruned_loss=0.09582, over 21547.00 frames. ], tot_loss[loss=0.2975, simple_loss=0.3512, pruned_loss=0.1218, over 4270097.52 frames. ], batch size: 263, lr: 1.53e-02, grad_scale: 16.0 2023-06-18 22:28:41,855 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 3.125e+02 3.820e+02 4.472e+02 8.562e+02, threshold=7.640e+02, percent-clipped=0.0 2023-06-18 22:28:59,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=318930.0, ans=0.125 2023-06-18 22:29:47,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=319110.0, ans=0.0 2023-06-18 22:29:49,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=319110.0, ans=0.125 2023-06-18 22:30:07,835 INFO [train.py:996] (2/4) Epoch 2, batch 22700, loss[loss=0.282, simple_loss=0.3316, pruned_loss=0.1162, over 20075.00 frames. ], tot_loss[loss=0.2944, simple_loss=0.3458, pruned_loss=0.1214, over 4270752.77 frames. ], batch size: 703, lr: 1.53e-02, grad_scale: 16.0 2023-06-18 22:30:56,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=319290.0, ans=0.0 2023-06-18 22:30:56,666 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=15.0 2023-06-18 22:30:59,631 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-18 22:31:13,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=319350.0, ans=0.0 2023-06-18 22:31:46,698 INFO [train.py:996] (2/4) Epoch 2, batch 22750, loss[loss=0.3464, simple_loss=0.385, pruned_loss=0.1539, over 21804.00 frames. ], tot_loss[loss=0.2962, simple_loss=0.3466, pruned_loss=0.1229, over 4268896.59 frames. ], batch size: 282, lr: 1.53e-02, grad_scale: 16.0 2023-06-18 22:31:58,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=319470.0, ans=0.125 2023-06-18 22:31:59,217 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 3.020e+02 3.645e+02 4.350e+02 9.693e+02, threshold=7.290e+02, percent-clipped=3.0 2023-06-18 22:32:00,896 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.05 vs. limit=15.0 2023-06-18 22:32:34,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=319590.0, ans=0.2 2023-06-18 22:32:34,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=319590.0, ans=0.125 2023-06-18 22:32:45,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=319590.0, ans=0.125 2023-06-18 22:33:06,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=319710.0, ans=0.2 2023-06-18 22:33:15,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=319710.0, ans=0.0 2023-06-18 22:33:24,564 INFO [train.py:996] (2/4) Epoch 2, batch 22800, loss[loss=0.2896, simple_loss=0.3429, pruned_loss=0.1182, over 20835.00 frames. ], tot_loss[loss=0.3001, simple_loss=0.3507, pruned_loss=0.1248, over 4272136.33 frames. ], batch size: 607, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:34:20,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=319890.0, ans=0.125 2023-06-18 22:34:20,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=319890.0, ans=0.04949747468305833 2023-06-18 22:35:01,230 INFO [train.py:996] (2/4) Epoch 2, batch 22850, loss[loss=0.2502, simple_loss=0.3055, pruned_loss=0.09741, over 21299.00 frames. ], tot_loss[loss=0.2977, simple_loss=0.3472, pruned_loss=0.1241, over 4274466.09 frames. ], batch size: 131, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:35:03,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=320070.0, ans=0.125 2023-06-18 22:35:18,262 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.390e+02 3.149e+02 3.783e+02 4.416e+02 9.029e+02, threshold=7.565e+02, percent-clipped=2.0 2023-06-18 22:35:28,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=320130.0, ans=0.125 2023-06-18 22:35:46,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=320190.0, ans=0.125 2023-06-18 22:35:52,217 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=22.5 2023-06-18 22:36:06,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=320250.0, ans=0.0 2023-06-18 22:36:29,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=320310.0, ans=0.04949747468305833 2023-06-18 22:36:32,965 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=22.5 2023-06-18 22:36:38,195 INFO [train.py:996] (2/4) Epoch 2, batch 22900, loss[loss=0.2886, simple_loss=0.3869, pruned_loss=0.09517, over 21893.00 frames. ], tot_loss[loss=0.3, simple_loss=0.352, pruned_loss=0.1239, over 4272885.41 frames. ], batch size: 317, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:36:45,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=320370.0, ans=0.07 2023-06-18 22:36:46,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=320370.0, ans=0.125 2023-06-18 22:37:55,522 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.20 vs. limit=15.0 2023-06-18 22:38:03,512 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=12.0 2023-06-18 22:38:19,518 INFO [train.py:996] (2/4) Epoch 2, batch 22950, loss[loss=0.3078, simple_loss=0.3761, pruned_loss=0.1197, over 21284.00 frames. ], tot_loss[loss=0.3016, simple_loss=0.3626, pruned_loss=0.1203, over 4268328.26 frames. ], batch size: 159, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:38:20,694 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=15.0 2023-06-18 22:38:32,112 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.003e+02 3.018e+02 3.691e+02 4.786e+02 9.826e+02, threshold=7.383e+02, percent-clipped=2.0 2023-06-18 22:38:38,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=320730.0, ans=0.125 2023-06-18 22:38:52,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=320730.0, ans=0.07 2023-06-18 22:39:04,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=320790.0, ans=0.125 2023-06-18 22:39:29,903 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.55 vs. limit=12.0 2023-06-18 22:39:30,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=320850.0, ans=0.07 2023-06-18 22:39:36,288 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.22 vs. limit=8.0 2023-06-18 22:39:54,962 INFO [train.py:996] (2/4) Epoch 2, batch 23000, loss[loss=0.292, simple_loss=0.3534, pruned_loss=0.1153, over 21434.00 frames. ], tot_loss[loss=0.2977, simple_loss=0.3612, pruned_loss=0.1171, over 4264276.98 frames. ], batch size: 548, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:40:02,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=320970.0, ans=0.1 2023-06-18 22:40:16,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=321030.0, ans=0.07 2023-06-18 22:40:34,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=321090.0, ans=0.125 2023-06-18 22:40:37,017 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-18 22:40:44,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=321090.0, ans=0.125 2023-06-18 22:40:46,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=321090.0, ans=0.125 2023-06-18 22:41:31,907 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=22.5 2023-06-18 22:41:32,470 INFO [train.py:996] (2/4) Epoch 2, batch 23050, loss[loss=0.3354, simple_loss=0.3906, pruned_loss=0.1401, over 21580.00 frames. ], tot_loss[loss=0.3013, simple_loss=0.3626, pruned_loss=0.12, over 4262353.24 frames. ], batch size: 389, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:41:54,749 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 3.225e+02 4.039e+02 5.218e+02 8.181e+02, threshold=8.078e+02, percent-clipped=3.0 2023-06-18 22:42:08,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=321330.0, ans=0.125 2023-06-18 22:42:57,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=321510.0, ans=0.2 2023-06-18 22:43:08,185 INFO [train.py:996] (2/4) Epoch 2, batch 23100, loss[loss=0.2648, simple_loss=0.3074, pruned_loss=0.1111, over 21805.00 frames. ], tot_loss[loss=0.2988, simple_loss=0.3572, pruned_loss=0.1202, over 4256049.60 frames. ], batch size: 98, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:43:29,387 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=22.5 2023-06-18 22:44:38,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=321810.0, ans=0.0 2023-06-18 22:44:40,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=321810.0, ans=0.1 2023-06-18 22:44:42,564 INFO [train.py:996] (2/4) Epoch 2, batch 23150, loss[loss=0.2776, simple_loss=0.3353, pruned_loss=0.11, over 21490.00 frames. ], tot_loss[loss=0.2949, simple_loss=0.3507, pruned_loss=0.1196, over 4258169.15 frames. ], batch size: 131, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:44:58,931 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 3.109e+02 3.658e+02 4.384e+02 7.114e+02, threshold=7.315e+02, percent-clipped=0.0 2023-06-18 22:46:12,272 INFO [train.py:996] (2/4) Epoch 2, batch 23200, loss[loss=0.3457, simple_loss=0.3823, pruned_loss=0.1545, over 21761.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3507, pruned_loss=0.1209, over 4269990.27 frames. ], batch size: 441, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:47:01,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=322290.0, ans=0.125 2023-06-18 22:47:02,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=322290.0, ans=0.0 2023-06-18 22:47:17,407 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.02 vs. limit=15.0 2023-06-18 22:47:48,010 INFO [train.py:996] (2/4) Epoch 2, batch 23250, loss[loss=0.3153, simple_loss=0.3986, pruned_loss=0.116, over 19867.00 frames. ], tot_loss[loss=0.297, simple_loss=0.35, pruned_loss=0.122, over 4273066.24 frames. ], batch size: 702, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:48:00,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=322470.0, ans=0.0 2023-06-18 22:48:03,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=322470.0, ans=0.0 2023-06-18 22:48:04,720 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.419e+02 3.062e+02 3.496e+02 4.224e+02 8.959e+02, threshold=6.992e+02, percent-clipped=2.0 2023-06-18 22:48:09,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=322530.0, ans=0.1 2023-06-18 22:48:19,802 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.86 vs. limit=10.0 2023-06-18 22:48:27,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=322590.0, ans=0.125 2023-06-18 22:48:35,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=322590.0, ans=0.125 2023-06-18 22:49:05,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=322710.0, ans=0.2 2023-06-18 22:49:25,577 INFO [train.py:996] (2/4) Epoch 2, batch 23300, loss[loss=0.3181, simple_loss=0.4068, pruned_loss=0.1147, over 21783.00 frames. ], tot_loss[loss=0.3026, simple_loss=0.3572, pruned_loss=0.124, over 4280014.61 frames. ], batch size: 282, lr: 1.53e-02, grad_scale: 32.0 2023-06-18 22:49:48,337 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.33 vs. limit=15.0 2023-06-18 22:50:03,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=322890.0, ans=0.1 2023-06-18 22:50:06,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=322890.0, ans=0.0 2023-06-18 22:50:31,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=322950.0, ans=0.125 2023-06-18 22:51:02,221 INFO [train.py:996] (2/4) Epoch 2, batch 23350, loss[loss=0.2095, simple_loss=0.2748, pruned_loss=0.07211, over 21289.00 frames. ], tot_loss[loss=0.3033, simple_loss=0.3607, pruned_loss=0.1229, over 4272121.73 frames. ], batch size: 176, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 22:51:04,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=323070.0, ans=0.125 2023-06-18 22:51:18,958 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 3.248e+02 3.923e+02 4.916e+02 7.049e+02, threshold=7.847e+02, percent-clipped=1.0 2023-06-18 22:51:19,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=323070.0, ans=0.125 2023-06-18 22:51:27,851 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.75 vs. limit=10.0 2023-06-18 22:51:47,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=323190.0, ans=0.95 2023-06-18 22:51:50,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=323190.0, ans=0.125 2023-06-18 22:52:24,730 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=22.5 2023-06-18 22:52:37,487 INFO [train.py:996] (2/4) Epoch 2, batch 23400, loss[loss=0.2853, simple_loss=0.3378, pruned_loss=0.1164, over 21829.00 frames. ], tot_loss[loss=0.2956, simple_loss=0.3544, pruned_loss=0.1185, over 4274812.97 frames. ], batch size: 247, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 22:52:54,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=323370.0, ans=0.0 2023-06-18 22:54:08,903 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2023-06-18 22:54:20,635 INFO [train.py:996] (2/4) Epoch 2, batch 23450, loss[loss=0.3162, simple_loss=0.3694, pruned_loss=0.1316, over 21927.00 frames. ], tot_loss[loss=0.2996, simple_loss=0.356, pruned_loss=0.1216, over 4280288.25 frames. ], batch size: 316, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 22:54:33,419 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 3.119e+02 3.774e+02 4.736e+02 8.725e+02, threshold=7.548e+02, percent-clipped=2.0 2023-06-18 22:55:59,327 INFO [train.py:996] (2/4) Epoch 2, batch 23500, loss[loss=0.278, simple_loss=0.3335, pruned_loss=0.1112, over 21413.00 frames. ], tot_loss[loss=0.3041, simple_loss=0.3583, pruned_loss=0.1249, over 4283377.47 frames. ], batch size: 211, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 22:55:59,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=323970.0, ans=0.125 2023-06-18 22:56:14,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=324030.0, ans=0.125 2023-06-18 22:56:24,141 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.83 vs. limit=22.5 2023-06-18 22:56:31,738 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.84 vs. limit=15.0 2023-06-18 22:56:43,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=324090.0, ans=0.125 2023-06-18 22:57:15,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=324210.0, ans=0.2 2023-06-18 22:57:36,240 INFO [train.py:996] (2/4) Epoch 2, batch 23550, loss[loss=0.2716, simple_loss=0.3213, pruned_loss=0.1109, over 21793.00 frames. ], tot_loss[loss=0.3037, simple_loss=0.3565, pruned_loss=0.1255, over 4277400.45 frames. ], batch size: 351, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 22:57:48,494 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.262e+02 3.808e+02 4.439e+02 7.936e+02, threshold=7.617e+02, percent-clipped=1.0 2023-06-18 22:58:12,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=324390.0, ans=0.04949747468305833 2023-06-18 22:58:14,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=324390.0, ans=0.04949747468305833 2023-06-18 22:58:25,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=324390.0, ans=0.1 2023-06-18 22:58:55,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=324510.0, ans=0.0 2023-06-18 22:59:03,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=324510.0, ans=0.125 2023-06-18 22:59:03,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=324510.0, ans=0.0 2023-06-18 22:59:14,002 INFO [train.py:996] (2/4) Epoch 2, batch 23600, loss[loss=0.3432, simple_loss=0.3936, pruned_loss=0.1464, over 21574.00 frames. ], tot_loss[loss=0.3023, simple_loss=0.3548, pruned_loss=0.1249, over 4269975.18 frames. ], batch size: 414, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 22:59:20,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=324570.0, ans=0.0 2023-06-18 22:59:41,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=324630.0, ans=10.0 2023-06-18 22:59:43,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=324630.0, ans=0.0 2023-06-18 22:59:47,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=324630.0, ans=0.0 2023-06-18 23:00:12,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=324690.0, ans=0.1 2023-06-18 23:00:57,798 INFO [train.py:996] (2/4) Epoch 2, batch 23650, loss[loss=0.2634, simple_loss=0.3349, pruned_loss=0.09594, over 21782.00 frames. ], tot_loss[loss=0.2983, simple_loss=0.353, pruned_loss=0.1218, over 4261331.30 frames. ], batch size: 247, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:01:10,458 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.426e+02 3.434e+02 4.143e+02 5.445e+02 9.457e+02, threshold=8.285e+02, percent-clipped=4.0 2023-06-18 23:01:55,157 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:02:17,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=325110.0, ans=0.125 2023-06-18 23:02:34,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=325110.0, ans=0.2 2023-06-18 23:02:36,799 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.76 vs. limit=10.0 2023-06-18 23:02:39,089 INFO [train.py:996] (2/4) Epoch 2, batch 23700, loss[loss=0.3232, simple_loss=0.3782, pruned_loss=0.1341, over 21430.00 frames. ], tot_loss[loss=0.2952, simple_loss=0.3533, pruned_loss=0.1185, over 4262054.76 frames. ], batch size: 471, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:02:51,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=325170.0, ans=0.125 2023-06-18 23:03:33,645 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.61 vs. limit=10.0 2023-06-18 23:04:11,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=325410.0, ans=0.0 2023-06-18 23:04:11,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=325410.0, ans=0.1 2023-06-18 23:04:20,811 INFO [train.py:996] (2/4) Epoch 2, batch 23750, loss[loss=0.2506, simple_loss=0.3399, pruned_loss=0.08066, over 21649.00 frames. ], tot_loss[loss=0.2997, simple_loss=0.3577, pruned_loss=0.1209, over 4266582.82 frames. ], batch size: 263, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:04:38,215 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 3.156e+02 3.937e+02 4.892e+02 1.167e+03, threshold=7.875e+02, percent-clipped=3.0 2023-06-18 23:04:53,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=325530.0, ans=0.1 2023-06-18 23:04:59,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=325530.0, ans=0.125 2023-06-18 23:05:06,534 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=15.0 2023-06-18 23:05:13,277 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.46 vs. limit=15.0 2023-06-18 23:05:34,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=325650.0, ans=0.125 2023-06-18 23:05:37,178 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=22.5 2023-06-18 23:06:06,061 INFO [train.py:996] (2/4) Epoch 2, batch 23800, loss[loss=0.3658, simple_loss=0.4283, pruned_loss=0.1516, over 21694.00 frames. ], tot_loss[loss=0.2953, simple_loss=0.3552, pruned_loss=0.1177, over 4268864.79 frames. ], batch size: 332, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:06:14,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=325770.0, ans=0.0 2023-06-18 23:06:22,041 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=15.0 2023-06-18 23:07:01,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=325890.0, ans=0.125 2023-06-18 23:07:20,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=325950.0, ans=0.0 2023-06-18 23:07:20,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=325950.0, ans=0.1 2023-06-18 23:07:20,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=325950.0, ans=0.0 2023-06-18 23:07:51,527 INFO [train.py:996] (2/4) Epoch 2, batch 23850, loss[loss=0.3474, simple_loss=0.3986, pruned_loss=0.1481, over 21729.00 frames. ], tot_loss[loss=0.3051, simple_loss=0.3668, pruned_loss=0.1217, over 4269566.31 frames. ], batch size: 298, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:08:09,326 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.883e+02 3.353e+02 4.244e+02 5.255e+02 8.980e+02, threshold=8.488e+02, percent-clipped=3.0 2023-06-18 23:08:19,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=326130.0, ans=0.2 2023-06-18 23:08:48,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=326250.0, ans=0.125 2023-06-18 23:09:30,863 INFO [train.py:996] (2/4) Epoch 2, batch 23900, loss[loss=0.37, simple_loss=0.4322, pruned_loss=0.1539, over 21464.00 frames. ], tot_loss[loss=0.3126, simple_loss=0.3748, pruned_loss=0.1252, over 4274799.06 frames. ], batch size: 471, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:09:31,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=326370.0, ans=0.125 2023-06-18 23:11:09,861 INFO [train.py:996] (2/4) Epoch 2, batch 23950, loss[loss=0.3047, simple_loss=0.3715, pruned_loss=0.1189, over 21329.00 frames. ], tot_loss[loss=0.3076, simple_loss=0.3672, pruned_loss=0.124, over 4269179.97 frames. ], batch size: 131, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:11:10,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=326670.0, ans=10.0 2023-06-18 23:11:25,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=326670.0, ans=0.0 2023-06-18 23:11:28,008 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.351e+02 3.116e+02 3.830e+02 4.465e+02 7.558e+02, threshold=7.660e+02, percent-clipped=0.0 2023-06-18 23:11:33,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=326730.0, ans=0.1 2023-06-18 23:12:46,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=326910.0, ans=0.125 2023-06-18 23:12:50,763 INFO [train.py:996] (2/4) Epoch 2, batch 24000, loss[loss=0.3557, simple_loss=0.4154, pruned_loss=0.148, over 21803.00 frames. ], tot_loss[loss=0.3134, simple_loss=0.3694, pruned_loss=0.1287, over 4267227.96 frames. ], batch size: 118, lr: 1.52e-02, grad_scale: 32.0 2023-06-18 23:12:50,764 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-18 23:13:09,352 INFO [train.py:1028] (2/4) Epoch 2, validation: loss=0.2897, simple_loss=0.3899, pruned_loss=0.09475, over 1796401.00 frames. 2023-06-18 23:13:09,352 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-18 23:14:16,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=327150.0, ans=0.125 2023-06-18 23:14:17,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=327150.0, ans=0.125 2023-06-18 23:14:27,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=327210.0, ans=0.125 2023-06-18 23:14:27,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=327210.0, ans=0.2 2023-06-18 23:14:29,645 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2023-06-18 23:14:48,746 INFO [train.py:996] (2/4) Epoch 2, batch 24050, loss[loss=0.223, simple_loss=0.301, pruned_loss=0.0725, over 21360.00 frames. ], tot_loss[loss=0.3141, simple_loss=0.371, pruned_loss=0.1286, over 4271442.45 frames. ], batch size: 131, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:15:05,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=327270.0, ans=0.1 2023-06-18 23:15:06,327 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.521e+02 3.470e+02 4.161e+02 4.943e+02 1.064e+03, threshold=8.323e+02, percent-clipped=4.0 2023-06-18 23:15:18,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=327330.0, ans=0.0 2023-06-18 23:15:26,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=327330.0, ans=0.1 2023-06-18 23:15:56,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=327450.0, ans=0.125 2023-06-18 23:15:56,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=327450.0, ans=0.0 2023-06-18 23:16:01,890 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2023-06-18 23:16:09,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=327510.0, ans=0.015 2023-06-18 23:16:34,967 INFO [train.py:996] (2/4) Epoch 2, batch 24100, loss[loss=0.3217, simple_loss=0.3738, pruned_loss=0.1348, over 21405.00 frames. ], tot_loss[loss=0.3103, simple_loss=0.3699, pruned_loss=0.1253, over 4277723.34 frames. ], batch size: 211, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:17:00,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=327630.0, ans=0.125 2023-06-18 23:18:14,220 INFO [train.py:996] (2/4) Epoch 2, batch 24150, loss[loss=0.3217, simple_loss=0.3668, pruned_loss=0.1383, over 21887.00 frames. ], tot_loss[loss=0.3131, simple_loss=0.3702, pruned_loss=0.128, over 4282940.30 frames. ], batch size: 371, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:18:26,898 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.990e+02 3.404e+02 4.259e+02 8.342e+02, threshold=6.809e+02, percent-clipped=1.0 2023-06-18 23:18:34,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=327930.0, ans=0.125 2023-06-18 23:18:52,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=327930.0, ans=0.5 2023-06-18 23:19:39,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=328110.0, ans=0.2 2023-06-18 23:19:55,233 INFO [train.py:996] (2/4) Epoch 2, batch 24200, loss[loss=0.2619, simple_loss=0.3461, pruned_loss=0.08887, over 19930.00 frames. ], tot_loss[loss=0.3137, simple_loss=0.3706, pruned_loss=0.1283, over 4284567.63 frames. ], batch size: 703, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:20:07,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=328170.0, ans=0.125 2023-06-18 23:20:27,552 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=12.0 2023-06-18 23:21:41,624 INFO [train.py:996] (2/4) Epoch 2, batch 24250, loss[loss=0.2329, simple_loss=0.315, pruned_loss=0.07542, over 21320.00 frames. ], tot_loss[loss=0.302, simple_loss=0.3645, pruned_loss=0.1197, over 4276916.75 frames. ], batch size: 176, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:21:47,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=328470.0, ans=10.0 2023-06-18 23:21:58,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=328470.0, ans=0.1 2023-06-18 23:21:59,863 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.648e+02 3.026e+02 3.601e+02 5.036e+02 9.709e+02, threshold=7.202e+02, percent-clipped=3.0 2023-06-18 23:22:16,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=328530.0, ans=0.0 2023-06-18 23:22:32,653 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-18 23:22:54,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=328650.0, ans=0.125 2023-06-18 23:23:04,643 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=15.0 2023-06-18 23:23:10,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=328710.0, ans=0.04949747468305833 2023-06-18 23:23:21,464 INFO [train.py:996] (2/4) Epoch 2, batch 24300, loss[loss=0.1799, simple_loss=0.2547, pruned_loss=0.05252, over 21534.00 frames. ], tot_loss[loss=0.2881, simple_loss=0.354, pruned_loss=0.1111, over 4273560.75 frames. ], batch size: 212, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:23:32,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=328770.0, ans=0.0 2023-06-18 23:23:44,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=328830.0, ans=0.0 2023-06-18 23:24:48,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=329010.0, ans=15.0 2023-06-18 23:24:50,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=329010.0, ans=0.125 2023-06-18 23:25:04,603 INFO [train.py:996] (2/4) Epoch 2, batch 24350, loss[loss=0.4568, simple_loss=0.4676, pruned_loss=0.223, over 21515.00 frames. ], tot_loss[loss=0.2879, simple_loss=0.3513, pruned_loss=0.1122, over 4282036.18 frames. ], batch size: 508, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:25:17,423 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.950e+02 2.896e+02 3.511e+02 4.657e+02 9.016e+02, threshold=7.022e+02, percent-clipped=4.0 2023-06-18 23:25:30,621 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-18 23:25:54,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=329190.0, ans=0.0 2023-06-18 23:25:57,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=329190.0, ans=0.125 2023-06-18 23:26:45,797 INFO [train.py:996] (2/4) Epoch 2, batch 24400, loss[loss=0.2885, simple_loss=0.3624, pruned_loss=0.1073, over 21747.00 frames. ], tot_loss[loss=0.2974, simple_loss=0.3594, pruned_loss=0.1177, over 4278382.38 frames. ], batch size: 298, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:27:04,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=329430.0, ans=0.05 2023-06-18 23:28:23,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=329610.0, ans=0.0 2023-06-18 23:28:25,851 INFO [train.py:996] (2/4) Epoch 2, batch 24450, loss[loss=0.2316, simple_loss=0.3043, pruned_loss=0.07949, over 21155.00 frames. ], tot_loss[loss=0.2979, simple_loss=0.3586, pruned_loss=0.1186, over 4269999.61 frames. ], batch size: 143, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:28:34,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=329670.0, ans=0.125 2023-06-18 23:28:38,613 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.400e+02 3.395e+02 4.151e+02 4.993e+02 8.571e+02, threshold=8.301e+02, percent-clipped=4.0 2023-06-18 23:28:47,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=329730.0, ans=0.0 2023-06-18 23:29:04,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=329730.0, ans=0.0 2023-06-18 23:29:10,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=329790.0, ans=0.125 2023-06-18 23:29:13,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=329790.0, ans=0.0 2023-06-18 23:30:04,437 INFO [train.py:996] (2/4) Epoch 2, batch 24500, loss[loss=0.3102, simple_loss=0.3607, pruned_loss=0.1299, over 21670.00 frames. ], tot_loss[loss=0.2987, simple_loss=0.3601, pruned_loss=0.1186, over 4275165.93 frames. ], batch size: 263, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:31:27,195 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=22.5 2023-06-18 23:31:34,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=330210.0, ans=0.125 2023-06-18 23:31:44,666 INFO [train.py:996] (2/4) Epoch 2, batch 24550, loss[loss=0.3735, simple_loss=0.4152, pruned_loss=0.1659, over 21610.00 frames. ], tot_loss[loss=0.3046, simple_loss=0.3643, pruned_loss=0.1224, over 4278731.73 frames. ], batch size: 389, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:31:48,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=330270.0, ans=0.2 2023-06-18 23:32:01,955 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 3.061e+02 3.714e+02 4.494e+02 1.254e+03, threshold=7.429e+02, percent-clipped=1.0 2023-06-18 23:32:23,688 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-18 23:32:23,860 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.01 vs. limit=22.5 2023-06-18 23:33:18,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=330510.0, ans=0.125 2023-06-18 23:33:22,627 INFO [train.py:996] (2/4) Epoch 2, batch 24600, loss[loss=0.2582, simple_loss=0.3169, pruned_loss=0.09977, over 21694.00 frames. ], tot_loss[loss=0.3026, simple_loss=0.3596, pruned_loss=0.1228, over 4282139.29 frames. ], batch size: 298, lr: 1.51e-02, grad_scale: 64.0 2023-06-18 23:33:59,336 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=15.0 2023-06-18 23:34:20,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=330690.0, ans=0.0 2023-06-18 23:34:23,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=330750.0, ans=0.04949747468305833 2023-06-18 23:34:51,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=330810.0, ans=0.125 2023-06-18 23:34:55,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=330810.0, ans=0.125 2023-06-18 23:35:01,709 INFO [train.py:996] (2/4) Epoch 2, batch 24650, loss[loss=0.2873, simple_loss=0.3279, pruned_loss=0.1234, over 21475.00 frames. ], tot_loss[loss=0.2981, simple_loss=0.3524, pruned_loss=0.1219, over 4276501.60 frames. ], batch size: 441, lr: 1.51e-02, grad_scale: 64.0 2023-06-18 23:35:02,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=330870.0, ans=0.0 2023-06-18 23:35:08,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=330870.0, ans=0.2 2023-06-18 23:35:19,504 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.192e+02 3.864e+02 5.203e+02 1.017e+03, threshold=7.727e+02, percent-clipped=5.0 2023-06-18 23:36:05,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=331050.0, ans=0.125 2023-06-18 23:36:29,696 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.56 vs. limit=6.0 2023-06-18 23:36:36,268 INFO [train.py:996] (2/4) Epoch 2, batch 24700, loss[loss=0.3233, simple_loss=0.3597, pruned_loss=0.1435, over 21258.00 frames. ], tot_loss[loss=0.296, simple_loss=0.3514, pruned_loss=0.1203, over 4266656.32 frames. ], batch size: 471, lr: 1.51e-02, grad_scale: 64.0 2023-06-18 23:37:36,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=331290.0, ans=0.0 2023-06-18 23:37:37,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=331290.0, ans=0.1 2023-06-18 23:38:13,942 INFO [train.py:996] (2/4) Epoch 2, batch 24750, loss[loss=0.2859, simple_loss=0.328, pruned_loss=0.122, over 21683.00 frames. ], tot_loss[loss=0.2871, simple_loss=0.3425, pruned_loss=0.1159, over 4269266.15 frames. ], batch size: 333, lr: 1.51e-02, grad_scale: 32.0 2023-06-18 23:38:24,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=331470.0, ans=0.125 2023-06-18 23:38:33,460 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.320e+02 3.050e+02 3.889e+02 4.934e+02 8.372e+02, threshold=7.777e+02, percent-clipped=3.0 2023-06-18 23:39:38,437 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.36 vs. limit=22.5 2023-06-18 23:39:47,056 INFO [train.py:996] (2/4) Epoch 2, batch 24800, loss[loss=0.3003, simple_loss=0.3467, pruned_loss=0.1269, over 21805.00 frames. ], tot_loss[loss=0.2852, simple_loss=0.3394, pruned_loss=0.1155, over 4276473.39 frames. ], batch size: 298, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:40:18,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=331830.0, ans=0.0 2023-06-18 23:41:04,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=331950.0, ans=0.125 2023-06-18 23:41:26,056 INFO [train.py:996] (2/4) Epoch 2, batch 24850, loss[loss=0.3395, simple_loss=0.4017, pruned_loss=0.1386, over 21328.00 frames. ], tot_loss[loss=0.2872, simple_loss=0.3411, pruned_loss=0.1166, over 4276609.00 frames. ], batch size: 548, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:41:50,088 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 3.323e+02 4.351e+02 5.576e+02 8.938e+02, threshold=8.701e+02, percent-clipped=5.0 2023-06-18 23:42:20,372 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=22.5 2023-06-18 23:42:37,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=332250.0, ans=0.125 2023-06-18 23:43:09,980 INFO [train.py:996] (2/4) Epoch 2, batch 24900, loss[loss=0.322, simple_loss=0.3758, pruned_loss=0.1341, over 21831.00 frames. ], tot_loss[loss=0.2893, simple_loss=0.3432, pruned_loss=0.1177, over 4273267.44 frames. ], batch size: 282, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:43:57,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=332490.0, ans=0.125 2023-06-18 23:43:57,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=332490.0, ans=0.2 2023-06-18 23:44:05,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=332490.0, ans=0.125 2023-06-18 23:44:55,501 INFO [train.py:996] (2/4) Epoch 2, batch 24950, loss[loss=0.3262, simple_loss=0.3711, pruned_loss=0.1406, over 21626.00 frames. ], tot_loss[loss=0.2992, simple_loss=0.3526, pruned_loss=0.1229, over 4273648.43 frames. ], batch size: 263, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:44:59,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=332670.0, ans=0.1 2023-06-18 23:45:09,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=332670.0, ans=0.0 2023-06-18 23:45:12,891 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=12.0 2023-06-18 23:45:15,202 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.174e+02 3.427e+02 4.669e+02 5.544e+02 9.304e+02, threshold=9.338e+02, percent-clipped=1.0 2023-06-18 23:45:41,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=332790.0, ans=0.125 2023-06-18 23:45:59,456 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.43 vs. limit=6.0 2023-06-18 23:46:20,636 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-18 23:46:40,052 INFO [train.py:996] (2/4) Epoch 2, batch 25000, loss[loss=0.3293, simple_loss=0.3772, pruned_loss=0.1407, over 21742.00 frames. ], tot_loss[loss=0.3022, simple_loss=0.3568, pruned_loss=0.1238, over 4270955.58 frames. ], batch size: 351, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:47:30,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=333150.0, ans=0.0 2023-06-18 23:48:07,410 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=15.0 2023-06-18 23:48:17,029 INFO [train.py:996] (2/4) Epoch 2, batch 25050, loss[loss=0.2545, simple_loss=0.3036, pruned_loss=0.1027, over 21353.00 frames. ], tot_loss[loss=0.2971, simple_loss=0.3499, pruned_loss=0.1221, over 4266362.55 frames. ], batch size: 160, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:48:36,482 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 3.031e+02 3.621e+02 4.496e+02 7.145e+02, threshold=7.242e+02, percent-clipped=0.0 2023-06-18 23:49:19,966 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=16.96 vs. limit=15.0 2023-06-18 23:49:55,966 INFO [train.py:996] (2/4) Epoch 2, batch 25100, loss[loss=0.2852, simple_loss=0.3523, pruned_loss=0.1091, over 21638.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.3438, pruned_loss=0.1205, over 4266541.69 frames. ], batch size: 332, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:50:03,320 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=18.14 vs. limit=15.0 2023-06-18 23:51:00,559 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:51:28,135 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:51:33,448 INFO [train.py:996] (2/4) Epoch 2, batch 25150, loss[loss=0.27, simple_loss=0.3408, pruned_loss=0.09966, over 21901.00 frames. ], tot_loss[loss=0.2893, simple_loss=0.3451, pruned_loss=0.1167, over 4256878.07 frames. ], batch size: 316, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:51:38,117 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=12.0 2023-06-18 23:51:48,177 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.104e+02 3.003e+02 3.483e+02 4.487e+02 9.549e+02, threshold=6.965e+02, percent-clipped=3.0 2023-06-18 23:52:01,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=333930.0, ans=0.0 2023-06-18 23:52:44,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=334110.0, ans=0.0 2023-06-18 23:52:59,688 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=15.0 2023-06-18 23:52:59,817 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=22.5 2023-06-18 23:53:11,756 INFO [train.py:996] (2/4) Epoch 2, batch 25200, loss[loss=0.2231, simple_loss=0.3083, pruned_loss=0.06892, over 21469.00 frames. ], tot_loss[loss=0.2847, simple_loss=0.3432, pruned_loss=0.1131, over 4254027.97 frames. ], batch size: 211, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:53:48,064 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-18 23:54:23,302 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=12.0 2023-06-18 23:54:39,797 INFO [train.py:996] (2/4) Epoch 2, batch 25250, loss[loss=0.2428, simple_loss=0.2906, pruned_loss=0.09747, over 21188.00 frames. ], tot_loss[loss=0.2823, simple_loss=0.3413, pruned_loss=0.1116, over 4252643.49 frames. ], batch size: 548, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:55:04,414 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 2.756e+02 3.618e+02 4.524e+02 8.260e+02, threshold=7.237e+02, percent-clipped=4.0 2023-06-18 23:55:19,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=334530.0, ans=0.5 2023-06-18 23:55:33,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=334590.0, ans=0.1 2023-06-18 23:55:41,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=334650.0, ans=0.125 2023-06-18 23:56:06,249 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.15 vs. limit=10.0 2023-06-18 23:56:24,664 INFO [train.py:996] (2/4) Epoch 2, batch 25300, loss[loss=0.366, simple_loss=0.413, pruned_loss=0.1595, over 21622.00 frames. ], tot_loss[loss=0.2815, simple_loss=0.3395, pruned_loss=0.1117, over 4244686.01 frames. ], batch size: 389, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:56:37,264 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.94 vs. limit=8.0 2023-06-18 23:57:04,338 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=9.58 vs. limit=15.0 2023-06-18 23:57:47,031 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=15.0 2023-06-18 23:58:10,270 INFO [train.py:996] (2/4) Epoch 2, batch 25350, loss[loss=0.3056, simple_loss=0.36, pruned_loss=0.1256, over 21528.00 frames. ], tot_loss[loss=0.2835, simple_loss=0.3429, pruned_loss=0.1121, over 4253929.28 frames. ], batch size: 441, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:58:15,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=335070.0, ans=0.125 2023-06-18 23:58:29,564 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.861e+02 3.471e+02 4.257e+02 9.448e+02, threshold=6.941e+02, percent-clipped=2.0 2023-06-18 23:58:33,902 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=22.5 2023-06-18 23:58:53,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=335190.0, ans=0.1 2023-06-18 23:59:04,965 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.83 vs. limit=15.0 2023-06-18 23:59:14,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=335250.0, ans=0.125 2023-06-18 23:59:42,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=335310.0, ans=0.0 2023-06-18 23:59:44,367 INFO [train.py:996] (2/4) Epoch 2, batch 25400, loss[loss=0.2933, simple_loss=0.3425, pruned_loss=0.1221, over 21699.00 frames. ], tot_loss[loss=0.2796, simple_loss=0.3378, pruned_loss=0.1107, over 4248105.64 frames. ], batch size: 282, lr: 1.50e-02, grad_scale: 32.0 2023-06-18 23:59:51,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=335370.0, ans=0.1 2023-06-18 23:59:51,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=335370.0, ans=0.1 2023-06-18 23:59:57,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=335370.0, ans=0.125 2023-06-19 00:00:01,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=335370.0, ans=0.0 2023-06-19 00:01:22,845 INFO [train.py:996] (2/4) Epoch 2, batch 25450, loss[loss=0.3472, simple_loss=0.4137, pruned_loss=0.1403, over 21695.00 frames. ], tot_loss[loss=0.2833, simple_loss=0.3404, pruned_loss=0.1131, over 4245257.51 frames. ], batch size: 441, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 00:01:23,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=335670.0, ans=0.0 2023-06-19 00:01:47,349 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 2.942e+02 3.491e+02 4.451e+02 7.396e+02, threshold=6.982e+02, percent-clipped=1.0 2023-06-19 00:01:59,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=335730.0, ans=0.125 2023-06-19 00:02:01,825 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.66 vs. limit=6.0 2023-06-19 00:02:14,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=335790.0, ans=10.0 2023-06-19 00:03:09,113 INFO [train.py:996] (2/4) Epoch 2, batch 25500, loss[loss=0.3149, simple_loss=0.3922, pruned_loss=0.1188, over 21499.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.3386, pruned_loss=0.1079, over 4237806.85 frames. ], batch size: 471, lr: 1.50e-02, grad_scale: 32.0 2023-06-19 00:03:25,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=335970.0, ans=0.0 2023-06-19 00:03:51,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=336090.0, ans=0.125 2023-06-19 00:04:43,272 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-19 00:04:56,945 INFO [train.py:996] (2/4) Epoch 2, batch 25550, loss[loss=0.2881, simple_loss=0.3883, pruned_loss=0.09401, over 21631.00 frames. ], tot_loss[loss=0.2842, simple_loss=0.3479, pruned_loss=0.1103, over 4237128.59 frames. ], batch size: 414, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:04:59,993 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.60 vs. limit=15.0 2023-06-19 00:05:12,169 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.776e+02 2.591e+02 3.118e+02 3.638e+02 5.445e+02, threshold=6.236e+02, percent-clipped=0.0 2023-06-19 00:05:21,309 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.24 vs. limit=10.0 2023-06-19 00:06:38,379 INFO [train.py:996] (2/4) Epoch 2, batch 25600, loss[loss=0.3146, simple_loss=0.37, pruned_loss=0.1296, over 21838.00 frames. ], tot_loss[loss=0.2874, simple_loss=0.3521, pruned_loss=0.1113, over 4253349.47 frames. ], batch size: 282, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:07:37,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=336750.0, ans=0.0 2023-06-19 00:07:44,121 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-19 00:08:14,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=336810.0, ans=0.125 2023-06-19 00:08:17,784 INFO [train.py:996] (2/4) Epoch 2, batch 25650, loss[loss=0.2848, simple_loss=0.337, pruned_loss=0.1163, over 21433.00 frames. ], tot_loss[loss=0.2937, simple_loss=0.3557, pruned_loss=0.1159, over 4250809.31 frames. ], batch size: 131, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:08:29,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=336870.0, ans=0.0 2023-06-19 00:08:31,778 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 3.010e+02 3.647e+02 4.694e+02 1.135e+03, threshold=7.294e+02, percent-clipped=6.0 2023-06-19 00:09:20,444 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-19 00:09:57,165 INFO [train.py:996] (2/4) Epoch 2, batch 25700, loss[loss=0.2708, simple_loss=0.3456, pruned_loss=0.09798, over 21387.00 frames. ], tot_loss[loss=0.2934, simple_loss=0.3517, pruned_loss=0.1175, over 4238329.15 frames. ], batch size: 194, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:09:59,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=337170.0, ans=0.125 2023-06-19 00:10:01,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=337170.0, ans=0.125 2023-06-19 00:10:07,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=337170.0, ans=0.125 2023-06-19 00:10:37,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=337290.0, ans=0.125 2023-06-19 00:11:01,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=337350.0, ans=0.125 2023-06-19 00:11:19,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=337350.0, ans=0.0 2023-06-19 00:11:24,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=337410.0, ans=0.125 2023-06-19 00:11:38,911 INFO [train.py:996] (2/4) Epoch 2, batch 25750, loss[loss=0.3408, simple_loss=0.3856, pruned_loss=0.148, over 21786.00 frames. ], tot_loss[loss=0.3001, simple_loss=0.3578, pruned_loss=0.1212, over 4246103.77 frames. ], batch size: 441, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:11:54,357 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 3.020e+02 3.881e+02 5.422e+02 1.342e+03, threshold=7.762e+02, percent-clipped=9.0 2023-06-19 00:13:05,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=337710.0, ans=0.125 2023-06-19 00:13:15,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=337710.0, ans=0.2 2023-06-19 00:13:26,829 INFO [train.py:996] (2/4) Epoch 2, batch 25800, loss[loss=0.3627, simple_loss=0.4269, pruned_loss=0.1493, over 21419.00 frames. ], tot_loss[loss=0.3123, simple_loss=0.3716, pruned_loss=0.1265, over 4253070.77 frames. ], batch size: 131, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:13:50,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=337830.0, ans=0.0 2023-06-19 00:14:35,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=337950.0, ans=0.1 2023-06-19 00:14:45,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=338010.0, ans=0.125 2023-06-19 00:15:02,351 INFO [train.py:996] (2/4) Epoch 2, batch 25850, loss[loss=0.2843, simple_loss=0.3418, pruned_loss=0.1134, over 21686.00 frames. ], tot_loss[loss=0.3112, simple_loss=0.3721, pruned_loss=0.1252, over 4258856.37 frames. ], batch size: 230, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:15:22,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=338070.0, ans=0.125 2023-06-19 00:15:26,951 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 3.253e+02 3.802e+02 4.832e+02 7.273e+02, threshold=7.603e+02, percent-clipped=0.0 2023-06-19 00:15:43,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=338130.0, ans=0.04949747468305833 2023-06-19 00:15:47,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=338130.0, ans=0.1 2023-06-19 00:15:58,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=338190.0, ans=0.0 2023-06-19 00:16:09,383 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.74 vs. limit=10.0 2023-06-19 00:16:25,903 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-06-19 00:16:28,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=338310.0, ans=0.0 2023-06-19 00:16:32,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=338310.0, ans=0.125 2023-06-19 00:16:33,545 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.95 vs. limit=6.0 2023-06-19 00:16:53,763 INFO [train.py:996] (2/4) Epoch 2, batch 25900, loss[loss=0.3032, simple_loss=0.3789, pruned_loss=0.1138, over 21424.00 frames. ], tot_loss[loss=0.3131, simple_loss=0.3733, pruned_loss=0.1264, over 4263278.86 frames. ], batch size: 211, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:16:55,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=338370.0, ans=0.125 2023-06-19 00:17:19,091 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.02 vs. limit=10.0 2023-06-19 00:17:25,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=338430.0, ans=0.2 2023-06-19 00:18:39,804 INFO [train.py:996] (2/4) Epoch 2, batch 25950, loss[loss=0.3816, simple_loss=0.423, pruned_loss=0.1701, over 21612.00 frames. ], tot_loss[loss=0.3198, simple_loss=0.3803, pruned_loss=0.1296, over 4271487.06 frames. ], batch size: 415, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:18:42,638 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-19 00:18:54,228 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.271e+02 3.185e+02 3.771e+02 4.566e+02 7.877e+02, threshold=7.541e+02, percent-clipped=2.0 2023-06-19 00:19:07,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=338730.0, ans=0.2 2023-06-19 00:19:13,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=338790.0, ans=0.0 2023-06-19 00:20:20,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=338970.0, ans=15.0 2023-06-19 00:20:20,784 INFO [train.py:996] (2/4) Epoch 2, batch 26000, loss[loss=0.3212, simple_loss=0.3922, pruned_loss=0.1251, over 21414.00 frames. ], tot_loss[loss=0.3151, simple_loss=0.3774, pruned_loss=0.1264, over 4268179.40 frames. ], batch size: 211, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:20:22,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=338970.0, ans=0.0 2023-06-19 00:20:31,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=338970.0, ans=0.1 2023-06-19 00:21:10,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=339090.0, ans=0.2 2023-06-19 00:21:50,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=339210.0, ans=15.0 2023-06-19 00:22:00,590 INFO [train.py:996] (2/4) Epoch 2, batch 26050, loss[loss=0.2693, simple_loss=0.3185, pruned_loss=0.1101, over 21061.00 frames. ], tot_loss[loss=0.3193, simple_loss=0.3787, pruned_loss=0.13, over 4271755.83 frames. ], batch size: 608, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:22:01,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=339270.0, ans=0.0 2023-06-19 00:22:14,507 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.427e+02 3.293e+02 3.871e+02 4.573e+02 8.054e+02, threshold=7.741e+02, percent-clipped=1.0 2023-06-19 00:23:25,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=339510.0, ans=0.125 2023-06-19 00:23:38,917 INFO [train.py:996] (2/4) Epoch 2, batch 26100, loss[loss=0.2959, simple_loss=0.3411, pruned_loss=0.1253, over 21912.00 frames. ], tot_loss[loss=0.3141, simple_loss=0.3713, pruned_loss=0.1284, over 4279975.30 frames. ], batch size: 283, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:23:46,411 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.65 vs. limit=15.0 2023-06-19 00:24:00,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=339630.0, ans=0.04949747468305833 2023-06-19 00:24:12,183 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.42 vs. limit=15.0 2023-06-19 00:25:15,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=339810.0, ans=0.1 2023-06-19 00:25:20,027 INFO [train.py:996] (2/4) Epoch 2, batch 26150, loss[loss=0.3111, simple_loss=0.371, pruned_loss=0.1256, over 21312.00 frames. ], tot_loss[loss=0.3147, simple_loss=0.371, pruned_loss=0.1292, over 4276630.90 frames. ], batch size: 143, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:25:30,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=339870.0, ans=0.125 2023-06-19 00:25:34,896 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.250e+02 3.966e+02 5.249e+02 8.349e+02, threshold=7.932e+02, percent-clipped=3.0 2023-06-19 00:25:40,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=339930.0, ans=0.125 2023-06-19 00:25:45,913 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.52 vs. limit=6.0 2023-06-19 00:25:55,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=339930.0, ans=0.1 2023-06-19 00:26:22,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=340050.0, ans=0.125 2023-06-19 00:26:35,626 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.29 vs. limit=15.0 2023-06-19 00:26:45,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=340110.0, ans=0.1 2023-06-19 00:27:00,542 INFO [train.py:996] (2/4) Epoch 2, batch 26200, loss[loss=0.2765, simple_loss=0.3685, pruned_loss=0.09224, over 21859.00 frames. ], tot_loss[loss=0.311, simple_loss=0.3708, pruned_loss=0.1257, over 4275398.98 frames. ], batch size: 316, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:27:09,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=340170.0, ans=0.05 2023-06-19 00:27:09,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=340170.0, ans=0.05 2023-06-19 00:27:15,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=340230.0, ans=0.0 2023-06-19 00:27:16,611 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=12.0 2023-06-19 00:27:48,157 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.57 vs. limit=15.0 2023-06-19 00:28:23,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=340410.0, ans=0.2 2023-06-19 00:28:29,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=340410.0, ans=0.0 2023-06-19 00:28:31,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=340410.0, ans=0.125 2023-06-19 00:28:39,147 INFO [train.py:996] (2/4) Epoch 2, batch 26250, loss[loss=0.2849, simple_loss=0.3347, pruned_loss=0.1176, over 21363.00 frames. ], tot_loss[loss=0.311, simple_loss=0.373, pruned_loss=0.1245, over 4282711.41 frames. ], batch size: 159, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:28:54,380 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.928e+02 2.946e+02 3.629e+02 4.371e+02 7.049e+02, threshold=7.257e+02, percent-clipped=0.0 2023-06-19 00:29:34,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=340590.0, ans=0.05 2023-06-19 00:30:08,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=340710.0, ans=0.125 2023-06-19 00:30:20,185 INFO [train.py:996] (2/4) Epoch 2, batch 26300, loss[loss=0.284, simple_loss=0.3341, pruned_loss=0.117, over 21608.00 frames. ], tot_loss[loss=0.3108, simple_loss=0.3701, pruned_loss=0.1258, over 4290876.37 frames. ], batch size: 548, lr: 1.49e-02, grad_scale: 32.0 2023-06-19 00:31:39,288 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-06-19 00:31:45,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=341010.0, ans=0.0 2023-06-19 00:32:00,730 INFO [train.py:996] (2/4) Epoch 2, batch 26350, loss[loss=0.3572, simple_loss=0.3989, pruned_loss=0.1578, over 21470.00 frames. ], tot_loss[loss=0.3115, simple_loss=0.3692, pruned_loss=0.1268, over 4293776.05 frames. ], batch size: 131, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:32:29,476 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.280e+02 3.077e+02 3.703e+02 4.775e+02 7.605e+02, threshold=7.406e+02, percent-clipped=1.0 2023-06-19 00:32:30,467 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=15.0 2023-06-19 00:32:42,403 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.15 vs. limit=10.0 2023-06-19 00:33:06,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=341250.0, ans=0.125 2023-06-19 00:33:30,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=341310.0, ans=0.05 2023-06-19 00:33:38,863 INFO [train.py:996] (2/4) Epoch 2, batch 26400, loss[loss=0.3185, simple_loss=0.349, pruned_loss=0.144, over 21794.00 frames. ], tot_loss[loss=0.3076, simple_loss=0.3628, pruned_loss=0.1262, over 4288888.35 frames. ], batch size: 372, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:33:41,517 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.67 vs. limit=6.0 2023-06-19 00:34:14,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=341430.0, ans=0.2 2023-06-19 00:34:53,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=341550.0, ans=0.0 2023-06-19 00:35:33,198 INFO [train.py:996] (2/4) Epoch 2, batch 26450, loss[loss=0.4506, simple_loss=0.5088, pruned_loss=0.1962, over 21431.00 frames. ], tot_loss[loss=0.3067, simple_loss=0.3622, pruned_loss=0.1257, over 4284791.86 frames. ], batch size: 507, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:35:33,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=341670.0, ans=0.125 2023-06-19 00:35:58,243 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 3.117e+02 3.740e+02 5.003e+02 1.177e+03, threshold=7.481e+02, percent-clipped=6.0 2023-06-19 00:36:04,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=341730.0, ans=0.2 2023-06-19 00:36:05,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=341730.0, ans=0.125 2023-06-19 00:36:22,878 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=15.0 2023-06-19 00:37:04,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=341910.0, ans=0.125 2023-06-19 00:37:20,206 INFO [train.py:996] (2/4) Epoch 2, batch 26500, loss[loss=0.1704, simple_loss=0.2069, pruned_loss=0.0669, over 16337.00 frames. ], tot_loss[loss=0.306, simple_loss=0.3634, pruned_loss=0.1243, over 4272118.34 frames. ], batch size: 60, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 00:37:34,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=341970.0, ans=0.125 2023-06-19 00:37:37,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=341970.0, ans=0.0 2023-06-19 00:39:03,682 INFO [train.py:996] (2/4) Epoch 2, batch 26550, loss[loss=0.2119, simple_loss=0.2733, pruned_loss=0.07526, over 21225.00 frames. ], tot_loss[loss=0.2974, simple_loss=0.3575, pruned_loss=0.1186, over 4262324.31 frames. ], batch size: 159, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 00:39:19,354 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.60 vs. limit=15.0 2023-06-19 00:39:19,868 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.438e+02 3.448e+02 4.337e+02 5.433e+02 9.319e+02, threshold=8.673e+02, percent-clipped=7.0 2023-06-19 00:39:21,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=342330.0, ans=0.125 2023-06-19 00:39:42,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=342390.0, ans=0.125 2023-06-19 00:40:35,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=342510.0, ans=0.125 2023-06-19 00:40:42,958 INFO [train.py:996] (2/4) Epoch 2, batch 26600, loss[loss=0.3046, simple_loss=0.348, pruned_loss=0.1306, over 21272.00 frames. ], tot_loss[loss=0.2937, simple_loss=0.3565, pruned_loss=0.1155, over 4266080.46 frames. ], batch size: 131, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 00:41:01,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=342630.0, ans=0.95 2023-06-19 00:41:03,351 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.82 vs. limit=10.0 2023-06-19 00:41:51,203 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:42:03,342 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.76 vs. limit=10.0 2023-06-19 00:42:18,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=342810.0, ans=0.125 2023-06-19 00:42:22,727 INFO [train.py:996] (2/4) Epoch 2, batch 26650, loss[loss=0.2717, simple_loss=0.3414, pruned_loss=0.101, over 21522.00 frames. ], tot_loss[loss=0.2883, simple_loss=0.3493, pruned_loss=0.1137, over 4269710.55 frames. ], batch size: 441, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 00:42:38,281 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.691e+02 3.189e+02 3.874e+02 5.287e+02 9.951e+02, threshold=7.747e+02, percent-clipped=1.0 2023-06-19 00:42:48,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=342930.0, ans=10.0 2023-06-19 00:42:49,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=342930.0, ans=0.125 2023-06-19 00:43:28,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=343050.0, ans=0.0 2023-06-19 00:43:52,010 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=2.535e-03 2023-06-19 00:43:56,113 INFO [train.py:996] (2/4) Epoch 2, batch 26700, loss[loss=0.3726, simple_loss=0.3955, pruned_loss=0.1749, over 21745.00 frames. ], tot_loss[loss=0.2812, simple_loss=0.3422, pruned_loss=0.1101, over 4269006.10 frames. ], batch size: 508, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 00:44:00,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=343170.0, ans=0.0 2023-06-19 00:44:02,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=343170.0, ans=0.2 2023-06-19 00:44:36,226 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 00:44:36,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=343230.0, ans=0.0 2023-06-19 00:45:37,250 INFO [train.py:996] (2/4) Epoch 2, batch 26750, loss[loss=0.3398, simple_loss=0.4056, pruned_loss=0.137, over 21812.00 frames. ], tot_loss[loss=0.2794, simple_loss=0.3422, pruned_loss=0.1083, over 4278134.39 frames. ], batch size: 124, lr: 1.48e-02, grad_scale: 16.0 2023-06-19 00:45:58,483 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.751e+02 3.226e+02 3.870e+02 9.468e+02, threshold=6.452e+02, percent-clipped=0.0 2023-06-19 00:46:26,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=343530.0, ans=0.0 2023-06-19 00:46:41,821 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=22.5 2023-06-19 00:46:46,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=343650.0, ans=0.125 2023-06-19 00:46:55,121 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-19 00:46:56,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=343650.0, ans=0.2 2023-06-19 00:47:02,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=343710.0, ans=0.125 2023-06-19 00:47:07,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=343710.0, ans=0.125 2023-06-19 00:47:08,443 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.43 vs. limit=22.5 2023-06-19 00:47:13,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=343710.0, ans=0.2 2023-06-19 00:47:18,796 INFO [train.py:996] (2/4) Epoch 2, batch 26800, loss[loss=0.2902, simple_loss=0.3587, pruned_loss=0.1108, over 19984.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.3515, pruned_loss=0.1149, over 4270533.27 frames. ], batch size: 703, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:47:36,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=343770.0, ans=0.025 2023-06-19 00:47:41,841 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.60 vs. limit=22.5 2023-06-19 00:48:05,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=343890.0, ans=0.1 2023-06-19 00:48:31,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=343950.0, ans=0.0 2023-06-19 00:48:42,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=344010.0, ans=0.0 2023-06-19 00:48:58,656 INFO [train.py:996] (2/4) Epoch 2, batch 26850, loss[loss=0.247, simple_loss=0.3013, pruned_loss=0.09638, over 21629.00 frames. ], tot_loss[loss=0.295, simple_loss=0.3536, pruned_loss=0.1181, over 4265160.80 frames. ], batch size: 298, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:49:29,088 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.578e+02 3.525e+02 4.180e+02 5.123e+02 1.126e+03, threshold=8.361e+02, percent-clipped=11.0 2023-06-19 00:49:59,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=344250.0, ans=0.125 2023-06-19 00:49:59,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=344250.0, ans=0.0 2023-06-19 00:50:01,906 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=15.0 2023-06-19 00:50:37,557 INFO [train.py:996] (2/4) Epoch 2, batch 26900, loss[loss=0.2815, simple_loss=0.3209, pruned_loss=0.1211, over 21141.00 frames. ], tot_loss[loss=0.2885, simple_loss=0.3444, pruned_loss=0.1163, over 4261780.76 frames. ], batch size: 143, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:50:58,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=344370.0, ans=0.125 2023-06-19 00:51:25,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=344490.0, ans=0.2 2023-06-19 00:51:58,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=344610.0, ans=0.0 2023-06-19 00:52:12,728 INFO [train.py:996] (2/4) Epoch 2, batch 26950, loss[loss=0.2842, simple_loss=0.33, pruned_loss=0.1192, over 21761.00 frames. ], tot_loss[loss=0.2885, simple_loss=0.3434, pruned_loss=0.1168, over 4265143.95 frames. ], batch size: 112, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:52:21,648 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-19 00:52:43,528 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.137e+02 3.675e+02 4.908e+02 1.022e+03, threshold=7.351e+02, percent-clipped=1.0 2023-06-19 00:53:03,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=344790.0, ans=0.0 2023-06-19 00:54:08,195 INFO [train.py:996] (2/4) Epoch 2, batch 27000, loss[loss=0.2811, simple_loss=0.3766, pruned_loss=0.09284, over 21135.00 frames. ], tot_loss[loss=0.2845, simple_loss=0.3423, pruned_loss=0.1133, over 4264587.11 frames. ], batch size: 548, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:54:08,196 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 00:54:22,361 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.2130, 2.8117, 1.7119, 1.4530], device='cuda:2') 2023-06-19 00:54:25,476 INFO [train.py:1028] (2/4) Epoch 2, validation: loss=0.2623, simple_loss=0.361, pruned_loss=0.08186, over 1796401.00 frames. 2023-06-19 00:54:25,477 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-19 00:54:44,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=345030.0, ans=0.1 2023-06-19 00:55:01,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=345090.0, ans=0.125 2023-06-19 00:55:23,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=345150.0, ans=0.0 2023-06-19 00:56:06,865 INFO [train.py:996] (2/4) Epoch 2, batch 27050, loss[loss=0.291, simple_loss=0.3602, pruned_loss=0.1109, over 21923.00 frames. ], tot_loss[loss=0.2801, simple_loss=0.3437, pruned_loss=0.1083, over 4257734.66 frames. ], batch size: 372, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:56:13,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=345270.0, ans=0.2 2023-06-19 00:56:18,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=345270.0, ans=0.125 2023-06-19 00:56:23,034 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.868e+02 3.454e+02 4.544e+02 1.088e+03, threshold=6.909e+02, percent-clipped=2.0 2023-06-19 00:56:35,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=345330.0, ans=0.0 2023-06-19 00:57:24,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=345450.0, ans=0.125 2023-06-19 00:57:43,864 INFO [train.py:996] (2/4) Epoch 2, batch 27100, loss[loss=0.2926, simple_loss=0.3394, pruned_loss=0.1229, over 21677.00 frames. ], tot_loss[loss=0.2864, simple_loss=0.3477, pruned_loss=0.1125, over 4272915.08 frames. ], batch size: 263, lr: 1.48e-02, grad_scale: 32.0 2023-06-19 00:57:46,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=345570.0, ans=0.0 2023-06-19 00:57:58,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=345570.0, ans=0.125 2023-06-19 00:58:02,063 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-19 00:58:36,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=345750.0, ans=0.125 2023-06-19 00:58:59,985 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.70 vs. limit=10.0 2023-06-19 00:59:03,175 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-19 00:59:14,979 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.26 vs. limit=22.5 2023-06-19 00:59:20,187 INFO [train.py:996] (2/4) Epoch 2, batch 27150, loss[loss=0.3139, simple_loss=0.3892, pruned_loss=0.1193, over 21638.00 frames. ], tot_loss[loss=0.2955, simple_loss=0.3595, pruned_loss=0.1157, over 4269518.54 frames. ], batch size: 263, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 00:59:36,362 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.470e+02 4.217e+02 5.291e+02 1.062e+03, threshold=8.433e+02, percent-clipped=9.0 2023-06-19 00:59:48,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=345930.0, ans=0.0 2023-06-19 01:00:04,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=345990.0, ans=0.0 2023-06-19 01:00:33,630 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 01:00:51,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=346110.0, ans=0.125 2023-06-19 01:00:55,741 INFO [train.py:996] (2/4) Epoch 2, batch 27200, loss[loss=0.3388, simple_loss=0.3907, pruned_loss=0.1434, over 21773.00 frames. ], tot_loss[loss=0.3052, simple_loss=0.37, pruned_loss=0.1202, over 4276584.15 frames. ], batch size: 247, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:01:10,543 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.42 vs. limit=10.0 2023-06-19 01:01:47,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=346290.0, ans=0.0 2023-06-19 01:02:39,329 INFO [train.py:996] (2/4) Epoch 2, batch 27250, loss[loss=0.2565, simple_loss=0.2878, pruned_loss=0.1126, over 20089.00 frames. ], tot_loss[loss=0.3124, simple_loss=0.3735, pruned_loss=0.1256, over 4275860.83 frames. ], batch size: 704, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:02:42,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=346470.0, ans=0.04949747468305833 2023-06-19 01:02:49,296 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 01:02:55,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=346470.0, ans=0.0 2023-06-19 01:02:59,706 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.364e+02 3.070e+02 3.624e+02 4.371e+02 7.633e+02, threshold=7.247e+02, percent-clipped=0.0 2023-06-19 01:03:19,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=346530.0, ans=0.2 2023-06-19 01:03:23,089 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=12.0 2023-06-19 01:03:25,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=346590.0, ans=0.0 2023-06-19 01:03:31,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=346590.0, ans=0.2 2023-06-19 01:03:39,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=346590.0, ans=0.125 2023-06-19 01:03:55,566 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 01:03:59,763 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-19 01:04:21,670 INFO [train.py:996] (2/4) Epoch 2, batch 27300, loss[loss=0.2879, simple_loss=0.3612, pruned_loss=0.1073, over 21822.00 frames. ], tot_loss[loss=0.3136, simple_loss=0.3753, pruned_loss=0.126, over 4278239.31 frames. ], batch size: 282, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:04:49,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=346770.0, ans=0.125 2023-06-19 01:06:09,262 INFO [train.py:996] (2/4) Epoch 2, batch 27350, loss[loss=0.3742, simple_loss=0.4164, pruned_loss=0.166, over 21511.00 frames. ], tot_loss[loss=0.3161, simple_loss=0.3778, pruned_loss=0.1272, over 4279020.76 frames. ], batch size: 507, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:06:18,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=347070.0, ans=0.2 2023-06-19 01:06:35,166 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.251e+02 3.372e+02 3.927e+02 4.716e+02 8.245e+02, threshold=7.854e+02, percent-clipped=1.0 2023-06-19 01:06:37,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=347130.0, ans=0.0 2023-06-19 01:07:23,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=347310.0, ans=0.2 2023-06-19 01:07:36,992 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.35 vs. limit=12.0 2023-06-19 01:07:48,548 INFO [train.py:996] (2/4) Epoch 2, batch 27400, loss[loss=0.2931, simple_loss=0.3359, pruned_loss=0.1251, over 21595.00 frames. ], tot_loss[loss=0.3133, simple_loss=0.3734, pruned_loss=0.1266, over 4284209.37 frames. ], batch size: 263, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:08:16,274 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-19 01:08:18,078 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.50 vs. limit=6.0 2023-06-19 01:08:49,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=347550.0, ans=0.125 2023-06-19 01:09:03,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=347610.0, ans=0.0 2023-06-19 01:09:21,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=347610.0, ans=0.2 2023-06-19 01:09:29,493 INFO [train.py:996] (2/4) Epoch 2, batch 27450, loss[loss=0.2961, simple_loss=0.3613, pruned_loss=0.1154, over 21682.00 frames. ], tot_loss[loss=0.3078, simple_loss=0.3665, pruned_loss=0.1246, over 4285442.22 frames. ], batch size: 247, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:09:33,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=347670.0, ans=0.2 2023-06-19 01:09:45,221 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 3.415e+02 3.960e+02 5.232e+02 1.053e+03, threshold=7.919e+02, percent-clipped=3.0 2023-06-19 01:10:25,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=347850.0, ans=0.0 2023-06-19 01:10:33,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=347850.0, ans=0.0 2023-06-19 01:10:35,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=347850.0, ans=0.5 2023-06-19 01:10:36,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=347910.0, ans=0.125 2023-06-19 01:11:08,112 INFO [train.py:996] (2/4) Epoch 2, batch 27500, loss[loss=0.2585, simple_loss=0.3113, pruned_loss=0.1028, over 21148.00 frames. ], tot_loss[loss=0.3081, simple_loss=0.3658, pruned_loss=0.1252, over 4293181.52 frames. ], batch size: 608, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:11:16,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=347970.0, ans=0.2 2023-06-19 01:11:42,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=348090.0, ans=0.125 2023-06-19 01:11:47,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=348090.0, ans=0.0 2023-06-19 01:11:55,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=348150.0, ans=0.2 2023-06-19 01:12:19,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=348210.0, ans=0.125 2023-06-19 01:12:47,548 INFO [train.py:996] (2/4) Epoch 2, batch 27550, loss[loss=0.4522, simple_loss=0.5126, pruned_loss=0.1959, over 19984.00 frames. ], tot_loss[loss=0.302, simple_loss=0.3608, pruned_loss=0.1216, over 4289672.25 frames. ], batch size: 702, lr: 1.47e-02, grad_scale: 16.0 2023-06-19 01:13:05,041 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 3.210e+02 3.897e+02 4.749e+02 7.014e+02, threshold=7.795e+02, percent-clipped=0.0 2023-06-19 01:13:25,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=348390.0, ans=0.2 2023-06-19 01:14:29,344 INFO [train.py:996] (2/4) Epoch 2, batch 27600, loss[loss=0.3041, simple_loss=0.3451, pruned_loss=0.1315, over 21817.00 frames. ], tot_loss[loss=0.2972, simple_loss=0.3536, pruned_loss=0.1204, over 4282384.71 frames. ], batch size: 98, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:14:49,181 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.89 vs. limit=6.0 2023-06-19 01:15:02,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=348690.0, ans=0.125 2023-06-19 01:15:12,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=348690.0, ans=0.125 2023-06-19 01:15:14,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=348690.0, ans=0.125 2023-06-19 01:15:22,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=348750.0, ans=10.0 2023-06-19 01:15:41,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=348810.0, ans=0.0 2023-06-19 01:15:51,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=348810.0, ans=0.2 2023-06-19 01:15:51,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=348810.0, ans=0.0 2023-06-19 01:16:03,378 INFO [train.py:996] (2/4) Epoch 2, batch 27650, loss[loss=0.2745, simple_loss=0.3351, pruned_loss=0.107, over 21804.00 frames. ], tot_loss[loss=0.2918, simple_loss=0.3463, pruned_loss=0.1187, over 4277531.30 frames. ], batch size: 124, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:16:25,635 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.734e+02 3.535e+02 4.526e+02 5.756e+02 1.207e+03, threshold=9.051e+02, percent-clipped=8.0 2023-06-19 01:16:34,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=348930.0, ans=0.125 2023-06-19 01:16:42,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=348990.0, ans=0.2 2023-06-19 01:17:12,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=349050.0, ans=0.0 2023-06-19 01:17:14,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=349050.0, ans=0.5 2023-06-19 01:17:48,132 INFO [train.py:996] (2/4) Epoch 2, batch 27700, loss[loss=0.2642, simple_loss=0.3246, pruned_loss=0.1018, over 21176.00 frames. ], tot_loss[loss=0.2889, simple_loss=0.3457, pruned_loss=0.116, over 4280131.38 frames. ], batch size: 143, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:17:51,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=349170.0, ans=0.025 2023-06-19 01:17:55,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=349170.0, ans=0.0 2023-06-19 01:17:58,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=349170.0, ans=0.0 2023-06-19 01:18:03,008 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.38 vs. limit=10.0 2023-06-19 01:18:08,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=349230.0, ans=0.125 2023-06-19 01:18:15,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=349230.0, ans=0.125 2023-06-19 01:18:25,608 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.66 vs. limit=22.5 2023-06-19 01:19:27,732 INFO [train.py:996] (2/4) Epoch 2, batch 27750, loss[loss=0.2672, simple_loss=0.3068, pruned_loss=0.1137, over 20214.00 frames. ], tot_loss[loss=0.2876, simple_loss=0.3462, pruned_loss=0.1145, over 4280348.42 frames. ], batch size: 703, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:19:28,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=349470.0, ans=0.1 2023-06-19 01:19:37,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=349470.0, ans=0.125 2023-06-19 01:19:38,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=349470.0, ans=0.125 2023-06-19 01:19:45,256 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 3.361e+02 3.964e+02 5.094e+02 9.268e+02, threshold=7.928e+02, percent-clipped=1.0 2023-06-19 01:19:46,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=349530.0, ans=0.0 2023-06-19 01:20:19,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=349650.0, ans=0.125 2023-06-19 01:20:32,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=349650.0, ans=0.09899494936611666 2023-06-19 01:21:06,430 INFO [train.py:996] (2/4) Epoch 2, batch 27800, loss[loss=0.2941, simple_loss=0.3541, pruned_loss=0.1171, over 21896.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3469, pruned_loss=0.1157, over 4284187.37 frames. ], batch size: 118, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:21:13,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=349770.0, ans=0.2 2023-06-19 01:21:16,707 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-06-19 01:22:08,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=349950.0, ans=0.125 2023-06-19 01:22:09,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=349950.0, ans=0.0 2023-06-19 01:22:47,549 INFO [train.py:996] (2/4) Epoch 2, batch 27850, loss[loss=0.2998, simple_loss=0.3433, pruned_loss=0.1281, over 21711.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3471, pruned_loss=0.1173, over 4294649.92 frames. ], batch size: 230, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:23:06,198 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.520e+02 3.389e+02 4.361e+02 6.049e+02 9.596e+02, threshold=8.723e+02, percent-clipped=7.0 2023-06-19 01:24:04,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=350250.0, ans=0.125 2023-06-19 01:24:30,954 INFO [train.py:996] (2/4) Epoch 2, batch 27900, loss[loss=0.3781, simple_loss=0.442, pruned_loss=0.1571, over 21482.00 frames. ], tot_loss[loss=0.2986, simple_loss=0.358, pruned_loss=0.1196, over 4287781.94 frames. ], batch size: 471, lr: 1.47e-02, grad_scale: 32.0 2023-06-19 01:24:34,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=350370.0, ans=0.0 2023-06-19 01:26:13,835 INFO [train.py:996] (2/4) Epoch 2, batch 27950, loss[loss=0.2773, simple_loss=0.3382, pruned_loss=0.1081, over 21830.00 frames. ], tot_loss[loss=0.2944, simple_loss=0.3576, pruned_loss=0.1156, over 4282516.38 frames. ], batch size: 118, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:26:32,400 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 3.236e+02 3.896e+02 4.908e+02 8.483e+02, threshold=7.791e+02, percent-clipped=0.0 2023-06-19 01:27:53,668 INFO [train.py:996] (2/4) Epoch 2, batch 28000, loss[loss=0.2852, simple_loss=0.3619, pruned_loss=0.1043, over 21597.00 frames. ], tot_loss[loss=0.2904, simple_loss=0.3551, pruned_loss=0.1129, over 4281876.98 frames. ], batch size: 471, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:28:29,159 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=22.5 2023-06-19 01:28:50,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=351090.0, ans=0.2 2023-06-19 01:28:54,292 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2023-06-19 01:29:15,121 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.70 vs. limit=15.0 2023-06-19 01:29:21,416 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.71 vs. limit=22.5 2023-06-19 01:29:31,542 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.35 vs. limit=15.0 2023-06-19 01:29:35,433 INFO [train.py:996] (2/4) Epoch 2, batch 28050, loss[loss=0.2983, simple_loss=0.3446, pruned_loss=0.126, over 20148.00 frames. ], tot_loss[loss=0.291, simple_loss=0.3527, pruned_loss=0.1146, over 4285059.76 frames. ], batch size: 703, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:29:57,850 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.799e+02 3.165e+02 3.817e+02 7.021e+02, threshold=6.330e+02, percent-clipped=0.0 2023-06-19 01:30:20,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=351390.0, ans=0.2 2023-06-19 01:30:29,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=351390.0, ans=0.125 2023-06-19 01:30:41,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=351390.0, ans=15.0 2023-06-19 01:30:56,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=351450.0, ans=0.0 2023-06-19 01:31:01,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=351510.0, ans=0.1 2023-06-19 01:31:15,449 INFO [train.py:996] (2/4) Epoch 2, batch 28100, loss[loss=0.2625, simple_loss=0.3175, pruned_loss=0.1037, over 21990.00 frames. ], tot_loss[loss=0.29, simple_loss=0.351, pruned_loss=0.1145, over 4271734.14 frames. ], batch size: 103, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:32:02,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=351690.0, ans=0.2 2023-06-19 01:32:11,353 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.56 vs. limit=6.0 2023-06-19 01:32:12,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=351690.0, ans=0.125 2023-06-19 01:32:31,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=351750.0, ans=0.125 2023-06-19 01:32:37,773 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=22.5 2023-06-19 01:32:54,277 INFO [train.py:996] (2/4) Epoch 2, batch 28150, loss[loss=0.2243, simple_loss=0.2687, pruned_loss=0.08999, over 21251.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3429, pruned_loss=0.1147, over 4268523.36 frames. ], batch size: 551, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:33:04,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=351870.0, ans=0.2 2023-06-19 01:33:07,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=351870.0, ans=0.2 2023-06-19 01:33:09,664 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2023-06-19 01:33:11,888 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.407e+02 3.356e+02 3.949e+02 5.361e+02 1.113e+03, threshold=7.898e+02, percent-clipped=11.0 2023-06-19 01:33:26,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=351930.0, ans=0.125 2023-06-19 01:33:49,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=351990.0, ans=0.125 2023-06-19 01:34:07,768 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-19 01:34:16,235 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.88 vs. limit=15.0 2023-06-19 01:34:29,911 INFO [train.py:996] (2/4) Epoch 2, batch 28200, loss[loss=0.3352, simple_loss=0.3718, pruned_loss=0.1493, over 21490.00 frames. ], tot_loss[loss=0.2875, simple_loss=0.3409, pruned_loss=0.1171, over 4265888.67 frames. ], batch size: 194, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:34:50,816 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.45 vs. limit=10.0 2023-06-19 01:35:11,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=352230.0, ans=0.125 2023-06-19 01:35:14,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=352230.0, ans=0.0 2023-06-19 01:35:14,836 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.13 vs. limit=15.0 2023-06-19 01:35:32,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=352350.0, ans=0.1 2023-06-19 01:36:10,836 INFO [train.py:996] (2/4) Epoch 2, batch 28250, loss[loss=0.2959, simple_loss=0.3389, pruned_loss=0.1265, over 21774.00 frames. ], tot_loss[loss=0.2936, simple_loss=0.3468, pruned_loss=0.1202, over 4261981.37 frames. ], batch size: 317, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:36:38,454 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.575e+02 3.660e+02 4.283e+02 5.277e+02 9.711e+02, threshold=8.566e+02, percent-clipped=2.0 2023-06-19 01:36:50,883 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.13 vs. limit=10.0 2023-06-19 01:36:52,661 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-19 01:37:26,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=352650.0, ans=0.0 2023-06-19 01:37:39,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=352710.0, ans=0.07 2023-06-19 01:37:51,599 INFO [train.py:996] (2/4) Epoch 2, batch 28300, loss[loss=0.2547, simple_loss=0.3446, pruned_loss=0.0824, over 21627.00 frames. ], tot_loss[loss=0.2898, simple_loss=0.3452, pruned_loss=0.1172, over 4251989.52 frames. ], batch size: 441, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:37:57,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=352770.0, ans=0.95 2023-06-19 01:38:03,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=352770.0, ans=0.0 2023-06-19 01:38:38,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=352890.0, ans=0.035 2023-06-19 01:38:47,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=352890.0, ans=0.125 2023-06-19 01:39:44,110 INFO [train.py:996] (2/4) Epoch 2, batch 28350, loss[loss=0.2502, simple_loss=0.3248, pruned_loss=0.08783, over 21660.00 frames. ], tot_loss[loss=0.2815, simple_loss=0.3423, pruned_loss=0.1104, over 4259603.55 frames. ], batch size: 298, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:40:07,600 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.865e+02 3.652e+02 5.364e+02 1.153e+03, threshold=7.304e+02, percent-clipped=2.0 2023-06-19 01:40:08,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=353130.0, ans=0.1 2023-06-19 01:40:30,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=353190.0, ans=0.125 2023-06-19 01:41:07,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=353310.0, ans=0.2 2023-06-19 01:41:30,027 INFO [train.py:996] (2/4) Epoch 2, batch 28400, loss[loss=0.2688, simple_loss=0.3188, pruned_loss=0.1094, over 21755.00 frames. ], tot_loss[loss=0.2789, simple_loss=0.3373, pruned_loss=0.1103, over 4259465.12 frames. ], batch size: 282, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:41:42,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=353370.0, ans=0.125 2023-06-19 01:41:45,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=353430.0, ans=0.0 2023-06-19 01:42:01,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=353490.0, ans=0.125 2023-06-19 01:42:03,162 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 01:42:55,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=353610.0, ans=0.0 2023-06-19 01:43:08,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=353670.0, ans=0.04949747468305833 2023-06-19 01:43:09,877 INFO [train.py:996] (2/4) Epoch 2, batch 28450, loss[loss=0.3344, simple_loss=0.3844, pruned_loss=0.1422, over 21467.00 frames. ], tot_loss[loss=0.2876, simple_loss=0.3447, pruned_loss=0.1152, over 4253990.47 frames. ], batch size: 131, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:43:10,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=353670.0, ans=0.0 2023-06-19 01:43:14,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=353670.0, ans=0.2 2023-06-19 01:43:27,702 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 3.368e+02 4.115e+02 5.202e+02 1.060e+03, threshold=8.231e+02, percent-clipped=7.0 2023-06-19 01:43:54,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=353790.0, ans=0.2 2023-06-19 01:44:17,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=353850.0, ans=0.0 2023-06-19 01:44:20,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=353850.0, ans=0.04949747468305833 2023-06-19 01:44:22,741 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=22.5 2023-06-19 01:44:42,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=353910.0, ans=0.1 2023-06-19 01:44:46,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=353910.0, ans=0.125 2023-06-19 01:44:50,982 INFO [train.py:996] (2/4) Epoch 2, batch 28500, loss[loss=0.3132, simple_loss=0.3606, pruned_loss=0.1329, over 21493.00 frames. ], tot_loss[loss=0.2918, simple_loss=0.347, pruned_loss=0.1183, over 4265398.87 frames. ], batch size: 548, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:46:19,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=354210.0, ans=0.125 2023-06-19 01:46:32,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=354210.0, ans=0.125 2023-06-19 01:46:34,629 INFO [train.py:996] (2/4) Epoch 2, batch 28550, loss[loss=0.2825, simple_loss=0.3746, pruned_loss=0.09524, over 21623.00 frames. ], tot_loss[loss=0.2979, simple_loss=0.354, pruned_loss=0.1209, over 4267900.81 frames. ], batch size: 263, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:46:36,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=354270.0, ans=0.1 2023-06-19 01:46:51,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=354330.0, ans=0.125 2023-06-19 01:46:52,897 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.258e+02 3.021e+02 3.809e+02 4.877e+02 1.502e+03, threshold=7.617e+02, percent-clipped=6.0 2023-06-19 01:47:11,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=354390.0, ans=0.125 2023-06-19 01:48:04,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=354510.0, ans=0.125 2023-06-19 01:48:17,828 INFO [train.py:996] (2/4) Epoch 2, batch 28600, loss[loss=0.3611, simple_loss=0.4037, pruned_loss=0.1593, over 21416.00 frames. ], tot_loss[loss=0.3049, simple_loss=0.3618, pruned_loss=0.124, over 4268211.48 frames. ], batch size: 471, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:48:48,655 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=15.0 2023-06-19 01:48:57,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=354630.0, ans=0.125 2023-06-19 01:49:58,553 INFO [train.py:996] (2/4) Epoch 2, batch 28650, loss[loss=0.2635, simple_loss=0.3152, pruned_loss=0.1059, over 21560.00 frames. ], tot_loss[loss=0.3006, simple_loss=0.3555, pruned_loss=0.1228, over 4266905.84 frames. ], batch size: 263, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:50:21,122 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.376e+02 3.990e+02 4.916e+02 8.510e+02, threshold=7.981e+02, percent-clipped=3.0 2023-06-19 01:50:24,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=354930.0, ans=0.1 2023-06-19 01:50:50,172 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=22.5 2023-06-19 01:51:13,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=355050.0, ans=0.125 2023-06-19 01:51:31,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=355110.0, ans=0.0 2023-06-19 01:51:37,527 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 01:51:38,481 INFO [train.py:996] (2/4) Epoch 2, batch 28700, loss[loss=0.3012, simple_loss=0.3512, pruned_loss=0.1256, over 21244.00 frames. ], tot_loss[loss=0.3011, simple_loss=0.3546, pruned_loss=0.1238, over 4274943.97 frames. ], batch size: 176, lr: 1.46e-02, grad_scale: 32.0 2023-06-19 01:52:40,108 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=15.0 2023-06-19 01:53:03,130 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.51 vs. limit=10.0 2023-06-19 01:53:18,233 INFO [train.py:996] (2/4) Epoch 2, batch 28750, loss[loss=0.3502, simple_loss=0.437, pruned_loss=0.1317, over 19819.00 frames. ], tot_loss[loss=0.303, simple_loss=0.3561, pruned_loss=0.125, over 4279729.94 frames. ], batch size: 703, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 01:53:35,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=355470.0, ans=0.125 2023-06-19 01:53:46,509 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 3.065e+02 3.653e+02 4.286e+02 6.736e+02, threshold=7.306e+02, percent-clipped=0.0 2023-06-19 01:54:35,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=355650.0, ans=0.125 2023-06-19 01:54:38,938 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.20 vs. limit=6.0 2023-06-19 01:54:46,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=355710.0, ans=0.125 2023-06-19 01:54:53,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=355710.0, ans=0.2 2023-06-19 01:54:56,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=355710.0, ans=0.125 2023-06-19 01:54:58,841 INFO [train.py:996] (2/4) Epoch 2, batch 28800, loss[loss=0.3483, simple_loss=0.3983, pruned_loss=0.1492, over 21375.00 frames. ], tot_loss[loss=0.3066, simple_loss=0.3611, pruned_loss=0.1261, over 4282329.98 frames. ], batch size: 159, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 01:55:37,409 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=22.5 2023-06-19 01:55:48,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=355890.0, ans=0.2 2023-06-19 01:55:50,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=355890.0, ans=0.1 2023-06-19 01:56:24,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=356010.0, ans=0.0 2023-06-19 01:56:24,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=356010.0, ans=0.2 2023-06-19 01:56:40,234 INFO [train.py:996] (2/4) Epoch 2, batch 28850, loss[loss=0.337, simple_loss=0.3754, pruned_loss=0.1492, over 21526.00 frames. ], tot_loss[loss=0.3088, simple_loss=0.3625, pruned_loss=0.1275, over 4287399.70 frames. ], batch size: 548, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 01:56:56,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=356070.0, ans=0.125 2023-06-19 01:57:12,529 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.406e+02 3.103e+02 3.762e+02 4.499e+02 8.286e+02, threshold=7.524e+02, percent-clipped=2.0 2023-06-19 01:57:36,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=356190.0, ans=0.125 2023-06-19 01:57:43,348 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.92 vs. limit=22.5 2023-06-19 01:58:16,105 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-19 01:58:26,601 INFO [train.py:996] (2/4) Epoch 2, batch 28900, loss[loss=0.3111, simple_loss=0.3616, pruned_loss=0.1303, over 21647.00 frames. ], tot_loss[loss=0.3125, simple_loss=0.3651, pruned_loss=0.13, over 4282965.14 frames. ], batch size: 230, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 01:58:49,437 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.57 vs. limit=15.0 2023-06-19 01:59:02,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=356430.0, ans=0.2 2023-06-19 01:59:12,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=356490.0, ans=0.1 2023-06-19 02:00:19,033 INFO [train.py:996] (2/4) Epoch 2, batch 28950, loss[loss=0.3196, simple_loss=0.4003, pruned_loss=0.1195, over 21541.00 frames. ], tot_loss[loss=0.3079, simple_loss=0.3623, pruned_loss=0.1267, over 4281395.69 frames. ], batch size: 471, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:00:37,079 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.464e+02 3.281e+02 4.028e+02 5.318e+02 1.006e+03, threshold=8.055e+02, percent-clipped=4.0 2023-06-19 02:01:09,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=356790.0, ans=0.1 2023-06-19 02:02:01,667 INFO [train.py:996] (2/4) Epoch 2, batch 29000, loss[loss=0.3271, simple_loss=0.3874, pruned_loss=0.1334, over 21373.00 frames. ], tot_loss[loss=0.3077, simple_loss=0.3647, pruned_loss=0.1253, over 4277949.99 frames. ], batch size: 143, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:02:07,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=356970.0, ans=0.0 2023-06-19 02:02:12,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=356970.0, ans=0.125 2023-06-19 02:03:02,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=357150.0, ans=0.125 2023-06-19 02:03:11,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=357150.0, ans=10.0 2023-06-19 02:03:38,954 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=22.5 2023-06-19 02:03:39,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=357210.0, ans=0.125 2023-06-19 02:03:44,471 INFO [train.py:996] (2/4) Epoch 2, batch 29050, loss[loss=0.3051, simple_loss=0.3543, pruned_loss=0.1279, over 21310.00 frames. ], tot_loss[loss=0.3073, simple_loss=0.363, pruned_loss=0.1258, over 4278930.70 frames. ], batch size: 176, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:03:57,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=357270.0, ans=0.2 2023-06-19 02:03:59,504 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:04:01,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=357330.0, ans=0.0 2023-06-19 02:04:02,380 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.232e+02 3.182e+02 3.653e+02 4.375e+02 6.472e+02, threshold=7.306e+02, percent-clipped=0.0 2023-06-19 02:04:08,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=357330.0, ans=0.2 2023-06-19 02:04:28,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=357390.0, ans=0.0 2023-06-19 02:05:25,358 INFO [train.py:996] (2/4) Epoch 2, batch 29100, loss[loss=0.2708, simple_loss=0.3139, pruned_loss=0.1139, over 21499.00 frames. ], tot_loss[loss=0.2992, simple_loss=0.3532, pruned_loss=0.1226, over 4279417.17 frames. ], batch size: 441, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:05:42,040 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:05:51,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=357630.0, ans=0.0 2023-06-19 02:06:06,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=357690.0, ans=0.125 2023-06-19 02:06:26,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=357750.0, ans=0.95 2023-06-19 02:07:04,300 INFO [train.py:996] (2/4) Epoch 2, batch 29150, loss[loss=0.2622, simple_loss=0.3132, pruned_loss=0.1057, over 21710.00 frames. ], tot_loss[loss=0.2969, simple_loss=0.3532, pruned_loss=0.1203, over 4281148.99 frames. ], batch size: 124, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:07:21,749 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 3.617e+02 4.258e+02 5.180e+02 9.047e+02, threshold=8.516e+02, percent-clipped=9.0 2023-06-19 02:07:59,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=357990.0, ans=0.125 2023-06-19 02:08:08,549 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-19 02:08:20,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=358050.0, ans=0.1 2023-06-19 02:08:33,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=358110.0, ans=0.125 2023-06-19 02:08:44,219 INFO [train.py:996] (2/4) Epoch 2, batch 29200, loss[loss=0.3136, simple_loss=0.3632, pruned_loss=0.132, over 21418.00 frames. ], tot_loss[loss=0.2939, simple_loss=0.3492, pruned_loss=0.1192, over 4282749.62 frames. ], batch size: 473, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:09:58,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=358350.0, ans=0.125 2023-06-19 02:10:06,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=358410.0, ans=0.125 2023-06-19 02:10:24,012 INFO [train.py:996] (2/4) Epoch 2, batch 29250, loss[loss=0.2323, simple_loss=0.2983, pruned_loss=0.08313, over 21204.00 frames. ], tot_loss[loss=0.288, simple_loss=0.3458, pruned_loss=0.1151, over 4279690.44 frames. ], batch size: 176, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:10:32,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=358470.0, ans=0.1 2023-06-19 02:10:46,915 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.694e+02 3.473e+02 5.021e+02 8.866e+02, threshold=6.946e+02, percent-clipped=1.0 2023-06-19 02:11:03,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=358530.0, ans=0.125 2023-06-19 02:11:22,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=358590.0, ans=0.2 2023-06-19 02:11:40,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=358650.0, ans=0.125 2023-06-19 02:12:04,079 INFO [train.py:996] (2/4) Epoch 2, batch 29300, loss[loss=0.2789, simple_loss=0.3275, pruned_loss=0.1151, over 21817.00 frames. ], tot_loss[loss=0.289, simple_loss=0.3481, pruned_loss=0.1149, over 4269405.87 frames. ], batch size: 317, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:12:23,835 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:13:08,755 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2023-06-19 02:13:44,854 INFO [train.py:996] (2/4) Epoch 2, batch 29350, loss[loss=0.381, simple_loss=0.4264, pruned_loss=0.1678, over 21494.00 frames. ], tot_loss[loss=0.2875, simple_loss=0.3455, pruned_loss=0.1147, over 4276266.68 frames. ], batch size: 509, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:14:03,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=359070.0, ans=0.2 2023-06-19 02:14:07,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=359130.0, ans=0.1 2023-06-19 02:14:13,782 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.309e+02 2.964e+02 3.404e+02 4.114e+02 7.296e+02, threshold=6.809e+02, percent-clipped=1.0 2023-06-19 02:14:20,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=359130.0, ans=0.125 2023-06-19 02:14:43,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=359190.0, ans=0.125 2023-06-19 02:14:51,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=359250.0, ans=0.125 2023-06-19 02:14:55,308 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=15.0 2023-06-19 02:15:14,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=359310.0, ans=0.125 2023-06-19 02:15:14,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=359310.0, ans=0.125 2023-06-19 02:15:16,979 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=15.0 2023-06-19 02:15:22,010 INFO [train.py:996] (2/4) Epoch 2, batch 29400, loss[loss=0.2761, simple_loss=0.3599, pruned_loss=0.0962, over 21190.00 frames. ], tot_loss[loss=0.2848, simple_loss=0.3454, pruned_loss=0.1121, over 4283620.78 frames. ], batch size: 548, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:15:22,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=359370.0, ans=0.125 2023-06-19 02:16:24,037 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.64 vs. limit=15.0 2023-06-19 02:17:03,081 INFO [train.py:996] (2/4) Epoch 2, batch 29450, loss[loss=0.3091, simple_loss=0.3645, pruned_loss=0.1268, over 21590.00 frames. ], tot_loss[loss=0.2822, simple_loss=0.3431, pruned_loss=0.1107, over 4280064.99 frames. ], batch size: 263, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:17:25,753 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 3.078e+02 3.690e+02 4.615e+02 7.103e+02, threshold=7.380e+02, percent-clipped=1.0 2023-06-19 02:18:08,670 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.29 vs. limit=6.0 2023-06-19 02:18:37,437 INFO [train.py:996] (2/4) Epoch 2, batch 29500, loss[loss=0.3295, simple_loss=0.3676, pruned_loss=0.1457, over 21866.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3495, pruned_loss=0.1161, over 4280804.05 frames. ], batch size: 371, lr: 1.45e-02, grad_scale: 32.0 2023-06-19 02:18:53,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=359970.0, ans=0.125 2023-06-19 02:19:34,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=360090.0, ans=10.0 2023-06-19 02:19:48,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=360150.0, ans=0.0 2023-06-19 02:19:53,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=360150.0, ans=0.125 2023-06-19 02:20:03,461 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.86 vs. limit=10.0 2023-06-19 02:20:07,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=360210.0, ans=0.125 2023-06-19 02:20:14,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=360210.0, ans=0.125 2023-06-19 02:20:17,243 INFO [train.py:996] (2/4) Epoch 2, batch 29550, loss[loss=0.282, simple_loss=0.3373, pruned_loss=0.1134, over 21322.00 frames. ], tot_loss[loss=0.2916, simple_loss=0.3483, pruned_loss=0.1174, over 4290534.81 frames. ], batch size: 176, lr: 1.45e-02, grad_scale: 64.0 2023-06-19 02:20:49,806 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 3.148e+02 3.536e+02 4.853e+02 9.360e+02, threshold=7.072e+02, percent-clipped=2.0 2023-06-19 02:21:18,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=360390.0, ans=0.0 2023-06-19 02:22:10,023 INFO [train.py:996] (2/4) Epoch 2, batch 29600, loss[loss=0.3387, simple_loss=0.4065, pruned_loss=0.1354, over 21637.00 frames. ], tot_loss[loss=0.2998, simple_loss=0.3567, pruned_loss=0.1214, over 4286870.30 frames. ], batch size: 389, lr: 1.44e-02, grad_scale: 64.0 2023-06-19 02:22:12,776 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.96 vs. limit=15.0 2023-06-19 02:22:12,904 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.83 vs. limit=15.0 2023-06-19 02:22:27,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=360570.0, ans=0.0 2023-06-19 02:22:39,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=360630.0, ans=0.0 2023-06-19 02:23:01,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=360750.0, ans=0.0 2023-06-19 02:23:17,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=360810.0, ans=0.0 2023-06-19 02:23:34,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=360810.0, ans=0.125 2023-06-19 02:23:43,640 INFO [train.py:996] (2/4) Epoch 2, batch 29650, loss[loss=0.2628, simple_loss=0.3264, pruned_loss=0.09962, over 21839.00 frames. ], tot_loss[loss=0.2931, simple_loss=0.3527, pruned_loss=0.1168, over 4284282.44 frames. ], batch size: 371, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:23:47,409 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:24:07,297 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 3.035e+02 3.587e+02 4.924e+02 8.544e+02, threshold=7.175e+02, percent-clipped=8.0 2023-06-19 02:25:23,760 INFO [train.py:996] (2/4) Epoch 2, batch 29700, loss[loss=0.4483, simple_loss=0.4867, pruned_loss=0.205, over 21565.00 frames. ], tot_loss[loss=0.2959, simple_loss=0.3558, pruned_loss=0.118, over 4290750.22 frames. ], batch size: 507, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:25:24,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=361170.0, ans=0.125 2023-06-19 02:25:40,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=361170.0, ans=0.0 2023-06-19 02:26:13,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=361290.0, ans=0.0 2023-06-19 02:26:14,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.27 vs. limit=10.0 2023-06-19 02:26:18,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=361350.0, ans=0.0 2023-06-19 02:26:26,786 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.75 vs. limit=22.5 2023-06-19 02:26:37,761 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=15.0 2023-06-19 02:26:39,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=361410.0, ans=0.0 2023-06-19 02:26:49,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=361410.0, ans=0.125 2023-06-19 02:27:03,542 INFO [train.py:996] (2/4) Epoch 2, batch 29750, loss[loss=0.3082, simple_loss=0.3826, pruned_loss=0.1169, over 21714.00 frames. ], tot_loss[loss=0.2984, simple_loss=0.3613, pruned_loss=0.1177, over 4291078.00 frames. ], batch size: 298, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:27:07,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=361470.0, ans=0.015 2023-06-19 02:27:07,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=361470.0, ans=0.125 2023-06-19 02:27:27,926 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.988e+02 3.188e+02 3.972e+02 5.342e+02 1.059e+03, threshold=7.944e+02, percent-clipped=5.0 2023-06-19 02:27:43,575 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-19 02:27:54,751 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-19 02:28:35,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=361710.0, ans=0.125 2023-06-19 02:28:42,380 INFO [train.py:996] (2/4) Epoch 2, batch 29800, loss[loss=0.3481, simple_loss=0.3832, pruned_loss=0.1565, over 21763.00 frames. ], tot_loss[loss=0.3012, simple_loss=0.3633, pruned_loss=0.1195, over 4293572.63 frames. ], batch size: 508, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:29:03,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=361830.0, ans=0.125 2023-06-19 02:30:17,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=362010.0, ans=0.0 2023-06-19 02:30:22,323 INFO [train.py:996] (2/4) Epoch 2, batch 29850, loss[loss=0.2554, simple_loss=0.3186, pruned_loss=0.09611, over 21796.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3581, pruned_loss=0.1167, over 4289170.49 frames. ], batch size: 247, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:30:38,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=362070.0, ans=0.2 2023-06-19 02:30:43,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=362130.0, ans=0.0 2023-06-19 02:30:46,106 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.139e+02 2.912e+02 3.664e+02 4.469e+02 7.842e+02, threshold=7.327e+02, percent-clipped=0.0 2023-06-19 02:31:03,314 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.43 vs. limit=15.0 2023-06-19 02:31:23,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=362250.0, ans=0.125 2023-06-19 02:31:31,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=362250.0, ans=0.0 2023-06-19 02:32:06,354 INFO [train.py:996] (2/4) Epoch 2, batch 29900, loss[loss=0.3406, simple_loss=0.3941, pruned_loss=0.1436, over 21792.00 frames. ], tot_loss[loss=0.2958, simple_loss=0.3559, pruned_loss=0.1179, over 4298037.46 frames. ], batch size: 118, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:32:27,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=362430.0, ans=0.125 2023-06-19 02:32:27,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=362430.0, ans=0.2 2023-06-19 02:33:13,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=362550.0, ans=0.125 2023-06-19 02:33:41,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=362610.0, ans=0.2 2023-06-19 02:33:49,227 INFO [train.py:996] (2/4) Epoch 2, batch 29950, loss[loss=0.3721, simple_loss=0.4033, pruned_loss=0.1704, over 21460.00 frames. ], tot_loss[loss=0.3027, simple_loss=0.36, pruned_loss=0.1227, over 4296226.93 frames. ], batch size: 510, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:33:54,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=362670.0, ans=0.2 2023-06-19 02:33:56,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=362670.0, ans=0.125 2023-06-19 02:34:07,368 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-19 02:34:09,822 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 3.188e+02 4.013e+02 5.057e+02 1.029e+03, threshold=8.025e+02, percent-clipped=2.0 2023-06-19 02:34:48,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=362850.0, ans=0.04949747468305833 2023-06-19 02:35:16,358 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-19 02:35:29,997 INFO [train.py:996] (2/4) Epoch 2, batch 30000, loss[loss=0.2793, simple_loss=0.3441, pruned_loss=0.1072, over 21141.00 frames. ], tot_loss[loss=0.3022, simple_loss=0.3613, pruned_loss=0.1215, over 4290588.25 frames. ], batch size: 143, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:35:29,998 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 02:35:47,467 INFO [train.py:1028] (2/4) Epoch 2, validation: loss=0.2693, simple_loss=0.3684, pruned_loss=0.08513, over 1796401.00 frames. 2023-06-19 02:35:47,468 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-19 02:36:31,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=363030.0, ans=0.2 2023-06-19 02:37:00,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=363150.0, ans=0.1 2023-06-19 02:37:02,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=363150.0, ans=0.125 2023-06-19 02:37:04,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=363150.0, ans=0.125 2023-06-19 02:37:17,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=363210.0, ans=0.0 2023-06-19 02:37:28,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=363210.0, ans=0.125 2023-06-19 02:37:36,938 INFO [train.py:996] (2/4) Epoch 2, batch 30050, loss[loss=0.3462, simple_loss=0.4538, pruned_loss=0.1193, over 20818.00 frames. ], tot_loss[loss=0.3027, simple_loss=0.3671, pruned_loss=0.1192, over 4285375.75 frames. ], batch size: 607, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:37:48,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=363270.0, ans=0.1 2023-06-19 02:38:06,069 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.812e+02 3.422e+02 4.683e+02 8.613e+02, threshold=6.845e+02, percent-clipped=2.0 2023-06-19 02:38:10,195 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.36 vs. limit=12.0 2023-06-19 02:38:25,981 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.45 vs. limit=15.0 2023-06-19 02:38:50,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=363450.0, ans=0.015 2023-06-19 02:39:15,586 INFO [train.py:996] (2/4) Epoch 2, batch 30100, loss[loss=0.2813, simple_loss=0.3246, pruned_loss=0.119, over 21334.00 frames. ], tot_loss[loss=0.3008, simple_loss=0.3644, pruned_loss=0.1186, over 4281779.17 frames. ], batch size: 160, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:39:16,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=363570.0, ans=0.125 2023-06-19 02:39:40,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=363630.0, ans=0.125 2023-06-19 02:39:48,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=363630.0, ans=0.1 2023-06-19 02:40:29,669 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=15.0 2023-06-19 02:40:46,246 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:40:54,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=363810.0, ans=0.125 2023-06-19 02:41:02,511 INFO [train.py:996] (2/4) Epoch 2, batch 30150, loss[loss=0.3593, simple_loss=0.3987, pruned_loss=0.16, over 21747.00 frames. ], tot_loss[loss=0.3024, simple_loss=0.3618, pruned_loss=0.1215, over 4281474.91 frames. ], batch size: 441, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:41:18,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=363870.0, ans=0.125 2023-06-19 02:41:32,558 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.415e+02 3.181e+02 3.774e+02 4.610e+02 8.129e+02, threshold=7.548e+02, percent-clipped=2.0 2023-06-19 02:41:55,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=363990.0, ans=0.025 2023-06-19 02:42:39,895 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-19 02:42:47,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=364110.0, ans=0.125 2023-06-19 02:42:56,657 INFO [train.py:996] (2/4) Epoch 2, batch 30200, loss[loss=0.305, simple_loss=0.3596, pruned_loss=0.1252, over 21241.00 frames. ], tot_loss[loss=0.3019, simple_loss=0.3644, pruned_loss=0.1198, over 4276134.02 frames. ], batch size: 159, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:42:57,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=364170.0, ans=0.125 2023-06-19 02:43:05,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=364170.0, ans=0.125 2023-06-19 02:43:08,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=364170.0, ans=0.125 2023-06-19 02:43:37,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=364290.0, ans=0.125 2023-06-19 02:44:35,531 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=22.5 2023-06-19 02:44:39,239 INFO [train.py:996] (2/4) Epoch 2, batch 30250, loss[loss=0.3633, simple_loss=0.4602, pruned_loss=0.1332, over 20778.00 frames. ], tot_loss[loss=0.3077, simple_loss=0.3714, pruned_loss=0.122, over 4274501.14 frames. ], batch size: 607, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:44:45,165 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.28 vs. limit=10.0 2023-06-19 02:44:58,842 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.179e+02 3.107e+02 3.710e+02 5.079e+02 9.516e+02, threshold=7.420e+02, percent-clipped=5.0 2023-06-19 02:45:49,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=364650.0, ans=0.2 2023-06-19 02:45:53,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=364650.0, ans=0.125 2023-06-19 02:46:10,288 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 02:46:19,606 INFO [train.py:996] (2/4) Epoch 2, batch 30300, loss[loss=0.2655, simple_loss=0.3191, pruned_loss=0.1059, over 21758.00 frames. ], tot_loss[loss=0.3057, simple_loss=0.3673, pruned_loss=0.122, over 4279686.54 frames. ], batch size: 372, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:46:44,523 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-19 02:46:52,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=364830.0, ans=0.125 2023-06-19 02:46:52,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=364830.0, ans=0.125 2023-06-19 02:46:54,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=364830.0, ans=0.125 2023-06-19 02:47:22,113 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.23 vs. limit=6.0 2023-06-19 02:48:03,265 INFO [train.py:996] (2/4) Epoch 2, batch 30350, loss[loss=0.2698, simple_loss=0.3296, pruned_loss=0.105, over 21605.00 frames. ], tot_loss[loss=0.3076, simple_loss=0.3687, pruned_loss=0.1233, over 4279795.96 frames. ], batch size: 230, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:48:10,658 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=22.5 2023-06-19 02:48:26,004 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.273e+02 3.339e+02 3.934e+02 4.976e+02 9.196e+02, threshold=7.868e+02, percent-clipped=1.0 2023-06-19 02:48:50,187 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.01 vs. limit=10.0 2023-06-19 02:49:31,192 INFO [train.py:996] (2/4) Epoch 2, batch 30400, loss[loss=0.2749, simple_loss=0.313, pruned_loss=0.1184, over 20216.00 frames. ], tot_loss[loss=0.2995, simple_loss=0.3594, pruned_loss=0.1198, over 4267868.66 frames. ], batch size: 703, lr: 1.44e-02, grad_scale: 32.0 2023-06-19 02:49:41,487 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=12.0 2023-06-19 02:50:00,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=365430.0, ans=0.125 2023-06-19 02:50:38,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=365610.0, ans=10.0 2023-06-19 02:50:53,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=365610.0, ans=0.07 2023-06-19 02:50:56,427 INFO [train.py:996] (2/4) Epoch 2, batch 30450, loss[loss=0.3604, simple_loss=0.4696, pruned_loss=0.1256, over 19874.00 frames. ], tot_loss[loss=0.3032, simple_loss=0.3632, pruned_loss=0.1216, over 4206871.24 frames. ], batch size: 702, lr: 1.43e-02, grad_scale: 32.0 2023-06-19 02:51:07,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=365670.0, ans=0.125 2023-06-19 02:51:10,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=365670.0, ans=10.0 2023-06-19 02:51:15,882 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.508e+02 4.343e+02 5.750e+02 8.532e+02 2.294e+03, threshold=1.150e+03, percent-clipped=29.0 2023-06-19 02:51:24,646 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.21 vs. limit=15.0 2023-06-19 02:51:36,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=365790.0, ans=0.125 2023-06-19 02:51:58,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=365910.0, ans=0.95 2023-06-19 02:53:39,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=365934.0, ans=0.125 2023-06-19 02:53:41,467 INFO [train.py:996] (2/4) Epoch 3, batch 0, loss[loss=0.2483, simple_loss=0.2946, pruned_loss=0.101, over 20801.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.2946, pruned_loss=0.101, over 20801.00 frames. ], batch size: 609, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 02:53:41,467 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 02:53:57,720 INFO [train.py:1028] (2/4) Epoch 3, validation: loss=0.2735, simple_loss=0.3782, pruned_loss=0.08435, over 1796401.00 frames. 2023-06-19 02:53:57,721 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-19 02:54:41,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=366054.0, ans=0.5 2023-06-19 02:55:36,579 INFO [train.py:996] (2/4) Epoch 3, batch 50, loss[loss=0.3412, simple_loss=0.4131, pruned_loss=0.1347, over 21632.00 frames. ], tot_loss[loss=0.3059, simple_loss=0.3686, pruned_loss=0.1216, over 956367.11 frames. ], batch size: 389, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 02:55:40,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=366234.0, ans=0.0 2023-06-19 02:55:50,568 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.44 vs. limit=15.0 2023-06-19 02:56:09,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=366294.0, ans=0.125 2023-06-19 02:56:10,887 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.094e+02 3.611e+02 4.559e+02 6.599e+02 1.492e+03, threshold=9.117e+02, percent-clipped=9.0 2023-06-19 02:56:12,042 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.93 vs. limit=10.0 2023-06-19 02:57:06,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=366474.0, ans=0.125 2023-06-19 02:57:15,649 INFO [train.py:996] (2/4) Epoch 3, batch 100, loss[loss=0.3165, simple_loss=0.3929, pruned_loss=0.12, over 21814.00 frames. ], tot_loss[loss=0.3143, simple_loss=0.3819, pruned_loss=0.1233, over 1694182.46 frames. ], batch size: 316, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 02:58:28,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=366774.0, ans=0.2 2023-06-19 02:58:51,807 INFO [train.py:996] (2/4) Epoch 3, batch 150, loss[loss=0.2986, simple_loss=0.3672, pruned_loss=0.115, over 21373.00 frames. ], tot_loss[loss=0.313, simple_loss=0.3818, pruned_loss=0.1221, over 2257609.42 frames. ], batch size: 194, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 02:59:25,649 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 3.063e+02 3.532e+02 4.732e+02 9.517e+02, threshold=7.065e+02, percent-clipped=1.0 2023-06-19 02:59:29,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=366954.0, ans=0.125 2023-06-19 03:00:18,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=367074.0, ans=0.125 2023-06-19 03:00:30,817 INFO [train.py:996] (2/4) Epoch 3, batch 200, loss[loss=0.412, simple_loss=0.4762, pruned_loss=0.1739, over 21500.00 frames. ], tot_loss[loss=0.3107, simple_loss=0.3778, pruned_loss=0.1219, over 2691670.42 frames. ], batch size: 471, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 03:00:52,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=367194.0, ans=0.0 2023-06-19 03:00:52,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=367194.0, ans=0.0 2023-06-19 03:01:01,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=367194.0, ans=0.1 2023-06-19 03:01:27,769 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=15.0 2023-06-19 03:01:28,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=367314.0, ans=0.125 2023-06-19 03:02:09,313 INFO [train.py:996] (2/4) Epoch 3, batch 250, loss[loss=0.3028, simple_loss=0.3493, pruned_loss=0.1281, over 21895.00 frames. ], tot_loss[loss=0.3056, simple_loss=0.3724, pruned_loss=0.1194, over 3046497.89 frames. ], batch size: 298, lr: 1.22e-02, grad_scale: 32.0 2023-06-19 03:02:42,271 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.228e+02 2.832e+02 3.615e+02 5.126e+02 8.493e+02, threshold=7.230e+02, percent-clipped=8.0 2023-06-19 03:02:42,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=367494.0, ans=0.125 2023-06-19 03:02:57,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=367554.0, ans=0.0 2023-06-19 03:03:49,711 INFO [train.py:996] (2/4) Epoch 3, batch 300, loss[loss=0.2816, simple_loss=0.3315, pruned_loss=0.1158, over 21484.00 frames. ], tot_loss[loss=0.3004, simple_loss=0.3658, pruned_loss=0.1175, over 3304558.59 frames. ], batch size: 194, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:05:00,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=367914.0, ans=0.125 2023-06-19 03:05:31,311 INFO [train.py:996] (2/4) Epoch 3, batch 350, loss[loss=0.3104, simple_loss=0.3916, pruned_loss=0.1146, over 21664.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3595, pruned_loss=0.1166, over 3520187.00 frames. ], batch size: 441, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:06:05,956 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 2.954e+02 3.445e+02 4.197e+02 6.448e+02, threshold=6.891e+02, percent-clipped=0.0 2023-06-19 03:06:06,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=368094.0, ans=0.0 2023-06-19 03:06:20,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=368154.0, ans=0.0 2023-06-19 03:06:20,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=368154.0, ans=0.125 2023-06-19 03:06:25,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=15.0 2023-06-19 03:07:12,374 INFO [train.py:996] (2/4) Epoch 3, batch 400, loss[loss=0.2916, simple_loss=0.3541, pruned_loss=0.1145, over 21642.00 frames. ], tot_loss[loss=0.2884, simple_loss=0.3499, pruned_loss=0.1135, over 3690559.06 frames. ], batch size: 415, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:07:12,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=368334.0, ans=0.0 2023-06-19 03:07:27,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=368394.0, ans=0.125 2023-06-19 03:07:45,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=368394.0, ans=0.125 2023-06-19 03:07:50,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=368454.0, ans=0.04949747468305833 2023-06-19 03:07:51,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=368454.0, ans=0.1 2023-06-19 03:07:55,840 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=22.5 2023-06-19 03:08:34,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=368574.0, ans=0.07 2023-06-19 03:08:53,101 INFO [train.py:996] (2/4) Epoch 3, batch 450, loss[loss=0.3089, simple_loss=0.3715, pruned_loss=0.1231, over 21918.00 frames. ], tot_loss[loss=0.2845, simple_loss=0.3448, pruned_loss=0.1121, over 3823103.54 frames. ], batch size: 316, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:09:14,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=368694.0, ans=0.125 2023-06-19 03:09:27,055 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 2.891e+02 3.614e+02 4.402e+02 7.378e+02, threshold=7.228e+02, percent-clipped=3.0 2023-06-19 03:09:51,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=368814.0, ans=0.0 2023-06-19 03:10:28,828 INFO [train.py:996] (2/4) Epoch 3, batch 500, loss[loss=0.2264, simple_loss=0.2885, pruned_loss=0.08218, over 21969.00 frames. ], tot_loss[loss=0.2813, simple_loss=0.3453, pruned_loss=0.1087, over 3924097.83 frames. ], batch size: 119, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:10:46,465 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=15.0 2023-06-19 03:11:06,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=369054.0, ans=0.0 2023-06-19 03:11:21,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=369054.0, ans=0.125 2023-06-19 03:11:37,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=369114.0, ans=0.5 2023-06-19 03:12:08,141 INFO [train.py:996] (2/4) Epoch 3, batch 550, loss[loss=0.2462, simple_loss=0.3021, pruned_loss=0.0952, over 21835.00 frames. ], tot_loss[loss=0.2837, simple_loss=0.3496, pruned_loss=0.1089, over 3999962.83 frames. ], batch size: 98, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:12:22,035 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-19 03:12:46,959 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 3.117e+02 3.637e+02 4.984e+02 1.103e+03, threshold=7.274e+02, percent-clipped=1.0 2023-06-19 03:13:47,837 INFO [train.py:996] (2/4) Epoch 3, batch 600, loss[loss=0.3016, simple_loss=0.3419, pruned_loss=0.1307, over 21731.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.3518, pruned_loss=0.1098, over 4059551.73 frames. ], batch size: 112, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:14:22,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=369594.0, ans=0.0 2023-06-19 03:14:47,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=369714.0, ans=0.0 2023-06-19 03:14:54,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=369714.0, ans=0.125 2023-06-19 03:15:22,974 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.11 vs. limit=15.0 2023-06-19 03:15:28,828 INFO [train.py:996] (2/4) Epoch 3, batch 650, loss[loss=0.3394, simple_loss=0.377, pruned_loss=0.1509, over 21825.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.3521, pruned_loss=0.1093, over 4113541.11 frames. ], batch size: 441, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:16:02,423 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 3.260e+02 4.172e+02 5.495e+02 8.347e+02, threshold=8.343e+02, percent-clipped=4.0 2023-06-19 03:16:35,621 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=22.5 2023-06-19 03:17:08,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=370134.0, ans=0.0 2023-06-19 03:17:09,764 INFO [train.py:996] (2/4) Epoch 3, batch 700, loss[loss=0.3952, simple_loss=0.4525, pruned_loss=0.1689, over 21695.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.3543, pruned_loss=0.111, over 4155117.03 frames. ], batch size: 441, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:17:33,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=370194.0, ans=0.1 2023-06-19 03:17:37,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=370194.0, ans=0.0 2023-06-19 03:18:04,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=370254.0, ans=0.125 2023-06-19 03:18:41,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=370374.0, ans=0.125 2023-06-19 03:18:49,228 INFO [train.py:996] (2/4) Epoch 3, batch 750, loss[loss=0.2873, simple_loss=0.3465, pruned_loss=0.114, over 21841.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3534, pruned_loss=0.1124, over 4190860.95 frames. ], batch size: 124, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:19:28,423 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.172e+02 3.022e+02 3.507e+02 4.070e+02 7.167e+02, threshold=7.014e+02, percent-clipped=0.0 2023-06-19 03:19:38,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=370554.0, ans=0.0 2023-06-19 03:20:08,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=370674.0, ans=0.0 2023-06-19 03:20:31,107 INFO [train.py:996] (2/4) Epoch 3, batch 800, loss[loss=0.2607, simple_loss=0.3083, pruned_loss=0.1065, over 21588.00 frames. ], tot_loss[loss=0.2877, simple_loss=0.3499, pruned_loss=0.1127, over 4204892.78 frames. ], batch size: 247, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:20:36,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=370734.0, ans=0.125 2023-06-19 03:20:43,620 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2023-06-19 03:21:03,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=370794.0, ans=0.125 2023-06-19 03:21:19,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=370854.0, ans=0.2 2023-06-19 03:21:20,245 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-19 03:21:37,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=370914.0, ans=0.125 2023-06-19 03:21:41,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=370914.0, ans=0.025 2023-06-19 03:22:06,196 INFO [train.py:996] (2/4) Epoch 3, batch 850, loss[loss=0.2978, simple_loss=0.3386, pruned_loss=0.1285, over 21812.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.3469, pruned_loss=0.1119, over 4216608.97 frames. ], batch size: 298, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:22:46,300 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 3.110e+02 3.682e+02 5.059e+02 8.553e+02, threshold=7.364e+02, percent-clipped=4.0 2023-06-19 03:23:09,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=371214.0, ans=0.125 2023-06-19 03:23:11,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=371214.0, ans=0.0 2023-06-19 03:23:43,054 INFO [train.py:996] (2/4) Epoch 3, batch 900, loss[loss=0.3001, simple_loss=0.3678, pruned_loss=0.1161, over 21091.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.3448, pruned_loss=0.1129, over 4227602.11 frames. ], batch size: 608, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:23:44,128 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.03 vs. limit=6.0 2023-06-19 03:24:02,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=371334.0, ans=0.125 2023-06-19 03:24:14,428 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-06-19 03:24:20,940 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.37 vs. limit=22.5 2023-06-19 03:24:59,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=371514.0, ans=0.2 2023-06-19 03:25:24,170 INFO [train.py:996] (2/4) Epoch 3, batch 950, loss[loss=0.2587, simple_loss=0.3235, pruned_loss=0.09691, over 21826.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.3446, pruned_loss=0.1134, over 4243158.28 frames. ], batch size: 282, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:25:26,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=371634.0, ans=0.125 2023-06-19 03:25:37,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=371634.0, ans=0.125 2023-06-19 03:25:58,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=371694.0, ans=0.015 2023-06-19 03:25:59,364 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.271e+02 2.854e+02 3.566e+02 4.630e+02 9.213e+02, threshold=7.133e+02, percent-clipped=4.0 2023-06-19 03:26:03,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=371754.0, ans=0.2 2023-06-19 03:27:03,896 INFO [train.py:996] (2/4) Epoch 3, batch 1000, loss[loss=0.2538, simple_loss=0.3068, pruned_loss=0.1004, over 21561.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.3446, pruned_loss=0.113, over 4258555.77 frames. ], batch size: 263, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:27:15,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=371934.0, ans=0.125 2023-06-19 03:27:27,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=371994.0, ans=0.0 2023-06-19 03:27:30,968 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.33 vs. limit=10.0 2023-06-19 03:27:50,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=372054.0, ans=0.2 2023-06-19 03:27:50,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=372054.0, ans=0.125 2023-06-19 03:28:45,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=372174.0, ans=0.1 2023-06-19 03:28:48,345 INFO [train.py:996] (2/4) Epoch 3, batch 1050, loss[loss=0.3185, simple_loss=0.3676, pruned_loss=0.1347, over 21860.00 frames. ], tot_loss[loss=0.284, simple_loss=0.3432, pruned_loss=0.1124, over 4271412.79 frames. ], batch size: 332, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:28:55,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=372234.0, ans=0.125 2023-06-19 03:29:24,722 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 3.077e+02 3.761e+02 4.435e+02 8.515e+02, threshold=7.523e+02, percent-clipped=2.0 2023-06-19 03:29:36,363 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-06-19 03:29:54,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=372414.0, ans=0.0 2023-06-19 03:30:06,431 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.35 vs. limit=15.0 2023-06-19 03:30:27,961 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.85 vs. limit=10.0 2023-06-19 03:30:31,860 INFO [train.py:996] (2/4) Epoch 3, batch 1100, loss[loss=0.3108, simple_loss=0.3613, pruned_loss=0.1302, over 21737.00 frames. ], tot_loss[loss=0.2848, simple_loss=0.3452, pruned_loss=0.1123, over 4277749.31 frames. ], batch size: 414, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:31:07,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=372594.0, ans=0.125 2023-06-19 03:32:17,156 INFO [train.py:996] (2/4) Epoch 3, batch 1150, loss[loss=0.3489, simple_loss=0.3965, pruned_loss=0.1506, over 21449.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.3458, pruned_loss=0.1128, over 4283288.95 frames. ], batch size: 471, lr: 1.21e-02, grad_scale: 16.0 2023-06-19 03:32:30,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=372834.0, ans=0.125 2023-06-19 03:33:03,639 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.941e+02 3.564e+02 4.361e+02 9.852e+02, threshold=7.128e+02, percent-clipped=2.0 2023-06-19 03:33:26,971 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.51 vs. limit=22.5 2023-06-19 03:33:34,928 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=22.5 2023-06-19 03:33:45,165 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-19 03:33:52,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=373074.0, ans=0.125 2023-06-19 03:34:05,585 INFO [train.py:996] (2/4) Epoch 3, batch 1200, loss[loss=0.2491, simple_loss=0.3323, pruned_loss=0.08292, over 21618.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.3468, pruned_loss=0.1123, over 4282673.57 frames. ], batch size: 230, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:35:39,485 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=15.0 2023-06-19 03:35:49,159 INFO [train.py:996] (2/4) Epoch 3, batch 1250, loss[loss=0.3366, simple_loss=0.3848, pruned_loss=0.1442, over 21819.00 frames. ], tot_loss[loss=0.2888, simple_loss=0.3504, pruned_loss=0.1136, over 4282259.14 frames. ], batch size: 112, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:36:30,658 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.020e+02 3.067e+02 3.657e+02 4.609e+02 8.051e+02, threshold=7.314e+02, percent-clipped=2.0 2023-06-19 03:36:48,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=373554.0, ans=0.125 2023-06-19 03:36:52,500 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.82 vs. limit=10.0 2023-06-19 03:37:25,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=373674.0, ans=0.1 2023-06-19 03:37:29,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=373674.0, ans=0.0 2023-06-19 03:37:33,719 INFO [train.py:996] (2/4) Epoch 3, batch 1300, loss[loss=0.3044, simple_loss=0.3826, pruned_loss=0.1131, over 21813.00 frames. ], tot_loss[loss=0.2867, simple_loss=0.349, pruned_loss=0.1122, over 4275365.64 frames. ], batch size: 316, lr: 1.21e-02, grad_scale: 32.0 2023-06-19 03:37:41,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=373734.0, ans=0.125 2023-06-19 03:37:54,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=373794.0, ans=0.2 2023-06-19 03:38:58,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=373974.0, ans=0.5 2023-06-19 03:39:17,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=374034.0, ans=0.07 2023-06-19 03:39:18,831 INFO [train.py:996] (2/4) Epoch 3, batch 1350, loss[loss=0.3456, simple_loss=0.3962, pruned_loss=0.1475, over 21452.00 frames. ], tot_loss[loss=0.2899, simple_loss=0.3507, pruned_loss=0.1145, over 4277232.36 frames. ], batch size: 471, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:40:01,532 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.495e+02 3.516e+02 4.679e+02 5.899e+02 9.616e+02, threshold=9.359e+02, percent-clipped=8.0 2023-06-19 03:40:28,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=374214.0, ans=0.125 2023-06-19 03:40:41,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=374214.0, ans=0.04949747468305833 2023-06-19 03:40:52,555 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.69 vs. limit=15.0 2023-06-19 03:41:03,063 INFO [train.py:996] (2/4) Epoch 3, batch 1400, loss[loss=0.2595, simple_loss=0.3376, pruned_loss=0.09073, over 21750.00 frames. ], tot_loss[loss=0.2897, simple_loss=0.3509, pruned_loss=0.1143, over 4279515.02 frames. ], batch size: 298, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:41:42,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=374394.0, ans=0.125 2023-06-19 03:42:08,345 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.16 vs. limit=22.5 2023-06-19 03:42:47,129 INFO [train.py:996] (2/4) Epoch 3, batch 1450, loss[loss=0.333, simple_loss=0.3791, pruned_loss=0.1435, over 21219.00 frames. ], tot_loss[loss=0.2899, simple_loss=0.3498, pruned_loss=0.115, over 4274167.14 frames. ], batch size: 143, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:42:54,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=374634.0, ans=0.0 2023-06-19 03:43:28,914 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.454e+02 3.105e+02 3.604e+02 4.454e+02 7.120e+02, threshold=7.209e+02, percent-clipped=0.0 2023-06-19 03:43:39,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=374754.0, ans=0.2 2023-06-19 03:43:55,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=374814.0, ans=15.0 2023-06-19 03:44:12,613 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.31 vs. limit=15.0 2023-06-19 03:44:18,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=374874.0, ans=0.125 2023-06-19 03:44:32,097 INFO [train.py:996] (2/4) Epoch 3, batch 1500, loss[loss=0.2528, simple_loss=0.3455, pruned_loss=0.08008, over 20959.00 frames. ], tot_loss[loss=0.2926, simple_loss=0.3517, pruned_loss=0.1168, over 4276648.44 frames. ], batch size: 607, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:44:56,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=374934.0, ans=0.125 2023-06-19 03:44:56,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=374934.0, ans=0.125 2023-06-19 03:44:58,494 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.76 vs. limit=10.0 2023-06-19 03:45:48,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=375114.0, ans=0.2 2023-06-19 03:46:16,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=375234.0, ans=0.0 2023-06-19 03:46:17,201 INFO [train.py:996] (2/4) Epoch 3, batch 1550, loss[loss=0.2443, simple_loss=0.3249, pruned_loss=0.08182, over 21751.00 frames. ], tot_loss[loss=0.2915, simple_loss=0.3505, pruned_loss=0.1162, over 4280575.64 frames. ], batch size: 351, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:46:46,339 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.74 vs. limit=15.0 2023-06-19 03:47:05,571 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 2.746e+02 3.313e+02 3.948e+02 6.762e+02, threshold=6.626e+02, percent-clipped=0.0 2023-06-19 03:47:11,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=375354.0, ans=0.125 2023-06-19 03:47:19,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=375354.0, ans=0.1 2023-06-19 03:47:41,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=375414.0, ans=0.125 2023-06-19 03:47:49,634 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=15.0 2023-06-19 03:48:11,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=375474.0, ans=0.1 2023-06-19 03:48:14,572 INFO [train.py:996] (2/4) Epoch 3, batch 1600, loss[loss=0.2715, simple_loss=0.3432, pruned_loss=0.09992, over 21709.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3465, pruned_loss=0.1129, over 4276538.43 frames. ], batch size: 351, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:48:35,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=375594.0, ans=0.07 2023-06-19 03:48:40,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=375594.0, ans=0.125 2023-06-19 03:48:48,366 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=22.5 2023-06-19 03:49:20,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=375714.0, ans=0.035 2023-06-19 03:49:44,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=375774.0, ans=0.09899494936611666 2023-06-19 03:50:00,867 INFO [train.py:996] (2/4) Epoch 3, batch 1650, loss[loss=0.3023, simple_loss=0.3558, pruned_loss=0.1244, over 21615.00 frames. ], tot_loss[loss=0.2851, simple_loss=0.3467, pruned_loss=0.1118, over 4279799.14 frames. ], batch size: 471, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:50:28,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=375894.0, ans=0.0 2023-06-19 03:50:38,342 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.775e+02 3.357e+02 4.211e+02 7.088e+02, threshold=6.714e+02, percent-clipped=2.0 2023-06-19 03:51:39,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=376074.0, ans=0.125 2023-06-19 03:51:43,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=376074.0, ans=0.1 2023-06-19 03:51:49,346 INFO [train.py:996] (2/4) Epoch 3, batch 1700, loss[loss=0.3266, simple_loss=0.4054, pruned_loss=0.1239, over 21752.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.353, pruned_loss=0.1141, over 4277612.17 frames. ], batch size: 351, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 03:52:04,883 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=22.5 2023-06-19 03:52:07,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=376134.0, ans=0.125 2023-06-19 03:52:21,915 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 03:52:40,334 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-19 03:53:20,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=376374.0, ans=0.09899494936611666 2023-06-19 03:53:43,177 INFO [train.py:996] (2/4) Epoch 3, batch 1750, loss[loss=0.2481, simple_loss=0.3308, pruned_loss=0.08274, over 21720.00 frames. ], tot_loss[loss=0.2886, simple_loss=0.3529, pruned_loss=0.1121, over 4270619.45 frames. ], batch size: 351, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 03:53:52,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=376434.0, ans=0.125 2023-06-19 03:54:02,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=376434.0, ans=0.1 2023-06-19 03:54:23,388 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.216e+02 3.144e+02 4.448e+02 5.330e+02 9.147e+02, threshold=8.897e+02, percent-clipped=12.0 2023-06-19 03:54:48,737 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-06-19 03:55:02,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=376614.0, ans=0.125 2023-06-19 03:55:31,676 INFO [train.py:996] (2/4) Epoch 3, batch 1800, loss[loss=0.2824, simple_loss=0.3502, pruned_loss=0.1073, over 21627.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.3476, pruned_loss=0.108, over 4267745.46 frames. ], batch size: 263, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 03:55:42,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=376734.0, ans=0.95 2023-06-19 03:56:11,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=376794.0, ans=0.0 2023-06-19 03:56:55,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=376974.0, ans=0.125 2023-06-19 03:57:11,741 INFO [train.py:996] (2/4) Epoch 3, batch 1850, loss[loss=0.3003, simple_loss=0.3605, pruned_loss=0.12, over 21799.00 frames. ], tot_loss[loss=0.2802, simple_loss=0.3486, pruned_loss=0.1059, over 4274576.21 frames. ], batch size: 298, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 03:58:00,692 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.103e+02 2.940e+02 3.521e+02 4.849e+02 8.658e+02, threshold=7.043e+02, percent-clipped=0.0 2023-06-19 03:58:06,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=377154.0, ans=0.0 2023-06-19 03:58:06,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=377154.0, ans=0.0 2023-06-19 03:58:27,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=377214.0, ans=0.1 2023-06-19 03:58:29,528 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-19 03:59:02,460 INFO [train.py:996] (2/4) Epoch 3, batch 1900, loss[loss=0.2563, simple_loss=0.3328, pruned_loss=0.08986, over 21232.00 frames. ], tot_loss[loss=0.2797, simple_loss=0.3474, pruned_loss=0.1059, over 4276540.49 frames. ], batch size: 176, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 03:59:15,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=377334.0, ans=0.1 2023-06-19 03:59:30,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=377394.0, ans=0.125 2023-06-19 03:59:40,409 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=12.0 2023-06-19 04:00:31,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=377574.0, ans=0.0 2023-06-19 04:00:46,525 INFO [train.py:996] (2/4) Epoch 3, batch 1950, loss[loss=0.3321, simple_loss=0.4015, pruned_loss=0.1313, over 21594.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3419, pruned_loss=0.1055, over 4277324.68 frames. ], batch size: 441, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 04:00:51,159 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-19 04:00:58,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=377634.0, ans=0.125 2023-06-19 04:01:30,644 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.180e+02 3.084e+02 3.765e+02 4.629e+02 7.601e+02, threshold=7.530e+02, percent-clipped=2.0 2023-06-19 04:01:32,138 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.20 vs. limit=10.0 2023-06-19 04:01:40,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=377754.0, ans=0.05 2023-06-19 04:02:21,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=377874.0, ans=0.125 2023-06-19 04:02:32,679 INFO [train.py:996] (2/4) Epoch 3, batch 2000, loss[loss=0.3143, simple_loss=0.3851, pruned_loss=0.1217, over 21552.00 frames. ], tot_loss[loss=0.2757, simple_loss=0.3407, pruned_loss=0.1054, over 4270447.86 frames. ], batch size: 441, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:02:35,202 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.82 vs. limit=10.0 2023-06-19 04:03:17,402 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-19 04:03:33,764 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-19 04:03:34,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=378114.0, ans=0.1 2023-06-19 04:04:11,042 INFO [train.py:996] (2/4) Epoch 3, batch 2050, loss[loss=0.3533, simple_loss=0.445, pruned_loss=0.1308, over 21266.00 frames. ], tot_loss[loss=0.2778, simple_loss=0.3431, pruned_loss=0.1062, over 4272055.20 frames. ], batch size: 548, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:04:14,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=378234.0, ans=0.0 2023-06-19 04:04:18,348 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=15.0 2023-06-19 04:04:54,239 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.024e+02 2.993e+02 3.653e+02 4.561e+02 8.702e+02, threshold=7.306e+02, percent-clipped=1.0 2023-06-19 04:05:09,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=378354.0, ans=0.0 2023-06-19 04:05:39,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=378474.0, ans=0.0 2023-06-19 04:05:54,049 INFO [train.py:996] (2/4) Epoch 3, batch 2100, loss[loss=0.3889, simple_loss=0.4395, pruned_loss=0.1692, over 21654.00 frames. ], tot_loss[loss=0.2834, simple_loss=0.3469, pruned_loss=0.1099, over 4277816.19 frames. ], batch size: 414, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:05:59,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=378534.0, ans=0.0 2023-06-19 04:06:12,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=378534.0, ans=0.0 2023-06-19 04:06:57,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=378654.0, ans=0.125 2023-06-19 04:06:59,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=378714.0, ans=0.0 2023-06-19 04:07:02,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=378714.0, ans=0.125 2023-06-19 04:07:39,944 INFO [train.py:996] (2/4) Epoch 3, batch 2150, loss[loss=0.2648, simple_loss=0.3146, pruned_loss=0.1075, over 21603.00 frames. ], tot_loss[loss=0.2822, simple_loss=0.3448, pruned_loss=0.1098, over 4274006.18 frames. ], batch size: 298, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:07:54,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=378834.0, ans=0.0 2023-06-19 04:08:30,017 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 3.207e+02 3.919e+02 5.012e+02 8.780e+02, threshold=7.837e+02, percent-clipped=4.0 2023-06-19 04:08:36,652 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.45 vs. limit=15.0 2023-06-19 04:09:24,818 INFO [train.py:996] (2/4) Epoch 3, batch 2200, loss[loss=0.3804, simple_loss=0.4278, pruned_loss=0.1665, over 21485.00 frames. ], tot_loss[loss=0.2851, simple_loss=0.3489, pruned_loss=0.1106, over 4268934.98 frames. ], batch size: 471, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:11:09,114 INFO [train.py:996] (2/4) Epoch 3, batch 2250, loss[loss=0.2623, simple_loss=0.3343, pruned_loss=0.09513, over 21746.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3474, pruned_loss=0.109, over 4268931.09 frames. ], batch size: 298, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:11:36,397 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:11:39,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=379494.0, ans=0.125 2023-06-19 04:11:56,998 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.971e+02 3.785e+02 4.786e+02 8.748e+02, threshold=7.570e+02, percent-clipped=4.0 2023-06-19 04:11:57,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=379554.0, ans=0.0 2023-06-19 04:12:52,025 INFO [train.py:996] (2/4) Epoch 3, batch 2300, loss[loss=0.2894, simple_loss=0.331, pruned_loss=0.1239, over 21842.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.3406, pruned_loss=0.108, over 4277259.06 frames. ], batch size: 118, lr: 1.20e-02, grad_scale: 32.0 2023-06-19 04:12:54,853 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-19 04:13:01,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=379734.0, ans=0.125 2023-06-19 04:13:42,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=379854.0, ans=0.0 2023-06-19 04:14:24,730 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=15.0 2023-06-19 04:14:42,077 INFO [train.py:996] (2/4) Epoch 3, batch 2350, loss[loss=0.2521, simple_loss=0.2994, pruned_loss=0.1024, over 21231.00 frames. ], tot_loss[loss=0.2778, simple_loss=0.3388, pruned_loss=0.1084, over 4267034.59 frames. ], batch size: 176, lr: 1.20e-02, grad_scale: 16.0 2023-06-19 04:15:27,983 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.593e+02 3.227e+02 3.663e+02 5.028e+02 9.666e+02, threshold=7.327e+02, percent-clipped=5.0 2023-06-19 04:15:41,672 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=27.31 vs. limit=15.0 2023-06-19 04:15:44,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=380214.0, ans=0.2 2023-06-19 04:15:46,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=380214.0, ans=0.07 2023-06-19 04:16:35,293 INFO [train.py:996] (2/4) Epoch 3, batch 2400, loss[loss=0.2888, simple_loss=0.3488, pruned_loss=0.1144, over 21828.00 frames. ], tot_loss[loss=0.2817, simple_loss=0.342, pruned_loss=0.1107, over 4261366.17 frames. ], batch size: 282, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:17:50,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=380514.0, ans=0.2 2023-06-19 04:18:21,186 INFO [train.py:996] (2/4) Epoch 3, batch 2450, loss[loss=0.2756, simple_loss=0.3199, pruned_loss=0.1156, over 21836.00 frames. ], tot_loss[loss=0.2878, simple_loss=0.3491, pruned_loss=0.1132, over 4262471.91 frames. ], batch size: 98, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:18:41,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=380694.0, ans=0.04949747468305833 2023-06-19 04:18:48,752 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.98 vs. limit=6.0 2023-06-19 04:19:00,876 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.323e+02 3.007e+02 3.779e+02 4.461e+02 8.893e+02, threshold=7.558e+02, percent-clipped=3.0 2023-06-19 04:19:27,128 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=22.5 2023-06-19 04:19:28,623 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.60 vs. limit=15.0 2023-06-19 04:19:49,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=380874.0, ans=0.125 2023-06-19 04:20:04,642 INFO [train.py:996] (2/4) Epoch 3, batch 2500, loss[loss=0.2457, simple_loss=0.2969, pruned_loss=0.09724, over 21561.00 frames. ], tot_loss[loss=0.2866, simple_loss=0.3477, pruned_loss=0.1127, over 4256705.49 frames. ], batch size: 263, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:20:30,467 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:20:39,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=380994.0, ans=0.1 2023-06-19 04:20:48,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=381054.0, ans=0.1 2023-06-19 04:20:57,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=381054.0, ans=0.0 2023-06-19 04:21:08,938 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:21:14,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=381114.0, ans=10.0 2023-06-19 04:21:22,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=381174.0, ans=0.125 2023-06-19 04:21:37,538 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=15.0 2023-06-19 04:21:50,580 INFO [train.py:996] (2/4) Epoch 3, batch 2550, loss[loss=0.2941, simple_loss=0.3429, pruned_loss=0.1226, over 21833.00 frames. ], tot_loss[loss=0.2859, simple_loss=0.3461, pruned_loss=0.1128, over 4252323.12 frames. ], batch size: 98, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:21:55,340 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.37 vs. limit=10.0 2023-06-19 04:22:18,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=381294.0, ans=0.1 2023-06-19 04:22:31,562 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 2.967e+02 3.517e+02 4.789e+02 7.584e+02, threshold=7.035e+02, percent-clipped=1.0 2023-06-19 04:23:18,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=381474.0, ans=0.125 2023-06-19 04:23:36,599 INFO [train.py:996] (2/4) Epoch 3, batch 2600, loss[loss=0.315, simple_loss=0.3605, pruned_loss=0.1347, over 21818.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3477, pruned_loss=0.1153, over 4259181.12 frames. ], batch size: 247, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:23:37,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=381534.0, ans=0.0 2023-06-19 04:23:49,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=381534.0, ans=0.0 2023-06-19 04:24:16,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=381654.0, ans=0.0 2023-06-19 04:25:24,318 INFO [train.py:996] (2/4) Epoch 3, batch 2650, loss[loss=0.2777, simple_loss=0.3366, pruned_loss=0.1094, over 21851.00 frames. ], tot_loss[loss=0.2913, simple_loss=0.349, pruned_loss=0.1168, over 4272178.87 frames. ], batch size: 124, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:25:51,297 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.287e-02 2023-06-19 04:26:05,634 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.347e+02 3.151e+02 3.898e+02 4.845e+02 8.708e+02, threshold=7.796e+02, percent-clipped=4.0 2023-06-19 04:26:43,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=382014.0, ans=0.04949747468305833 2023-06-19 04:26:59,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=382074.0, ans=0.0 2023-06-19 04:27:09,898 INFO [train.py:996] (2/4) Epoch 3, batch 2700, loss[loss=0.2576, simple_loss=0.3002, pruned_loss=0.1075, over 21193.00 frames. ], tot_loss[loss=0.2874, simple_loss=0.3473, pruned_loss=0.1137, over 4269551.37 frames. ], batch size: 143, lr: 1.19e-02, grad_scale: 16.0 2023-06-19 04:27:10,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=382134.0, ans=0.2 2023-06-19 04:28:08,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=382254.0, ans=0.1 2023-06-19 04:28:55,303 INFO [train.py:996] (2/4) Epoch 3, batch 2750, loss[loss=0.2855, simple_loss=0.3619, pruned_loss=0.1045, over 20827.00 frames. ], tot_loss[loss=0.2866, simple_loss=0.3462, pruned_loss=0.1135, over 4279293.01 frames. ], batch size: 607, lr: 1.19e-02, grad_scale: 16.0 2023-06-19 04:28:56,563 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.65 vs. limit=22.5 2023-06-19 04:29:29,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=382494.0, ans=0.04949747468305833 2023-06-19 04:29:36,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=382554.0, ans=0.0 2023-06-19 04:29:37,651 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.554e+02 3.438e+02 4.314e+02 5.827e+02 1.229e+03, threshold=8.627e+02, percent-clipped=3.0 2023-06-19 04:29:55,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=382614.0, ans=0.0 2023-06-19 04:30:05,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=382614.0, ans=0.125 2023-06-19 04:30:45,985 INFO [train.py:996] (2/4) Epoch 3, batch 2800, loss[loss=0.3285, simple_loss=0.3798, pruned_loss=0.1386, over 21319.00 frames. ], tot_loss[loss=0.292, simple_loss=0.3517, pruned_loss=0.1162, over 4275152.75 frames. ], batch size: 549, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:31:07,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=382794.0, ans=0.2 2023-06-19 04:31:46,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=382854.0, ans=0.1 2023-06-19 04:32:02,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=382914.0, ans=0.125 2023-06-19 04:32:31,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=383034.0, ans=0.0 2023-06-19 04:32:32,065 INFO [train.py:996] (2/4) Epoch 3, batch 2850, loss[loss=0.2582, simple_loss=0.312, pruned_loss=0.1022, over 21760.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3508, pruned_loss=0.1157, over 4266058.88 frames. ], batch size: 282, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:32:40,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=383034.0, ans=0.125 2023-06-19 04:32:49,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=383094.0, ans=0.125 2023-06-19 04:33:19,158 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.372e+02 3.317e+02 3.934e+02 4.710e+02 8.134e+02, threshold=7.867e+02, percent-clipped=0.0 2023-06-19 04:33:35,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=383154.0, ans=0.05 2023-06-19 04:34:08,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=383274.0, ans=0.1 2023-06-19 04:34:15,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=383334.0, ans=0.125 2023-06-19 04:34:17,110 INFO [train.py:996] (2/4) Epoch 3, batch 2900, loss[loss=0.2615, simple_loss=0.3173, pruned_loss=0.1028, over 21339.00 frames. ], tot_loss[loss=0.289, simple_loss=0.3484, pruned_loss=0.1148, over 4274692.02 frames. ], batch size: 176, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:35:05,468 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.74 vs. limit=15.0 2023-06-19 04:35:33,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=383514.0, ans=0.125 2023-06-19 04:35:51,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=383574.0, ans=0.1 2023-06-19 04:36:02,571 INFO [train.py:996] (2/4) Epoch 3, batch 2950, loss[loss=0.2818, simple_loss=0.3577, pruned_loss=0.1029, over 21416.00 frames. ], tot_loss[loss=0.2903, simple_loss=0.349, pruned_loss=0.1158, over 4272929.78 frames. ], batch size: 194, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:36:18,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=383694.0, ans=0.0 2023-06-19 04:36:36,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=383694.0, ans=0.05 2023-06-19 04:36:50,312 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 2.976e+02 3.392e+02 4.326e+02 8.351e+02, threshold=6.785e+02, percent-clipped=1.0 2023-06-19 04:37:10,606 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.34 vs. limit=6.0 2023-06-19 04:37:44,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=383874.0, ans=0.125 2023-06-19 04:37:45,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=383874.0, ans=0.0 2023-06-19 04:37:48,115 INFO [train.py:996] (2/4) Epoch 3, batch 3000, loss[loss=0.3261, simple_loss=0.3821, pruned_loss=0.135, over 21270.00 frames. ], tot_loss[loss=0.2921, simple_loss=0.3523, pruned_loss=0.116, over 4270666.83 frames. ], batch size: 143, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:37:48,116 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 04:38:01,334 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.5585, 4.1004, 3.9925, 2.6960], device='cuda:2') 2023-06-19 04:38:05,911 INFO [train.py:1028] (2/4) Epoch 3, validation: loss=0.2668, simple_loss=0.3633, pruned_loss=0.08521, over 1796401.00 frames. 2023-06-19 04:38:05,912 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-19 04:39:28,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=384114.0, ans=0.0 2023-06-19 04:39:33,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=384114.0, ans=0.125 2023-06-19 04:39:33,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=384114.0, ans=0.125 2023-06-19 04:39:35,558 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.22 vs. limit=12.0 2023-06-19 04:39:52,631 INFO [train.py:996] (2/4) Epoch 3, batch 3050, loss[loss=0.3166, simple_loss=0.3984, pruned_loss=0.1174, over 21525.00 frames. ], tot_loss[loss=0.2902, simple_loss=0.3524, pruned_loss=0.114, over 4266680.97 frames. ], batch size: 471, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:40:44,361 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 3.108e+02 3.737e+02 4.686e+02 8.351e+02, threshold=7.474e+02, percent-clipped=4.0 2023-06-19 04:41:23,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=384474.0, ans=0.0 2023-06-19 04:41:28,500 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.94 vs. limit=15.0 2023-06-19 04:41:41,711 INFO [train.py:996] (2/4) Epoch 3, batch 3100, loss[loss=0.3651, simple_loss=0.3991, pruned_loss=0.1655, over 21773.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.3515, pruned_loss=0.1125, over 4273205.76 frames. ], batch size: 441, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:42:09,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=384594.0, ans=0.0 2023-06-19 04:42:15,053 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=22.5 2023-06-19 04:42:17,071 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=17.97 vs. limit=22.5 2023-06-19 04:42:53,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=384714.0, ans=0.1 2023-06-19 04:43:32,016 INFO [train.py:996] (2/4) Epoch 3, batch 3150, loss[loss=0.2873, simple_loss=0.3547, pruned_loss=0.11, over 21490.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3529, pruned_loss=0.1131, over 4274376.67 frames. ], batch size: 194, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:43:51,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=384834.0, ans=0.09899494936611666 2023-06-19 04:44:15,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=384954.0, ans=0.125 2023-06-19 04:44:19,354 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.477e+02 3.177e+02 3.933e+02 4.816e+02 8.908e+02, threshold=7.865e+02, percent-clipped=2.0 2023-06-19 04:45:23,739 INFO [train.py:996] (2/4) Epoch 3, batch 3200, loss[loss=0.3535, simple_loss=0.4085, pruned_loss=0.1492, over 21607.00 frames. ], tot_loss[loss=0.2914, simple_loss=0.3545, pruned_loss=0.1141, over 4275374.96 frames. ], batch size: 414, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:45:42,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=385134.0, ans=0.1 2023-06-19 04:45:51,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=385194.0, ans=0.0 2023-06-19 04:46:04,371 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=22.5 2023-06-19 04:46:08,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=385254.0, ans=0.125 2023-06-19 04:47:08,035 INFO [train.py:996] (2/4) Epoch 3, batch 3250, loss[loss=0.3191, simple_loss=0.3634, pruned_loss=0.1374, over 21812.00 frames. ], tot_loss[loss=0.294, simple_loss=0.3549, pruned_loss=0.1165, over 4279218.08 frames. ], batch size: 441, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:47:23,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=385434.0, ans=0.125 2023-06-19 04:47:28,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=385494.0, ans=0.2 2023-06-19 04:47:42,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=385494.0, ans=0.125 2023-06-19 04:47:50,335 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.360e+02 3.327e+02 4.160e+02 5.584e+02 8.725e+02, threshold=8.319e+02, percent-clipped=2.0 2023-06-19 04:48:47,924 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 04:48:55,273 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-19 04:48:59,527 INFO [train.py:996] (2/4) Epoch 3, batch 3300, loss[loss=0.2633, simple_loss=0.3148, pruned_loss=0.1059, over 21435.00 frames. ], tot_loss[loss=0.2925, simple_loss=0.3519, pruned_loss=0.1166, over 4280525.06 frames. ], batch size: 389, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:49:17,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=385794.0, ans=0.125 2023-06-19 04:50:44,363 INFO [train.py:996] (2/4) Epoch 3, batch 3350, loss[loss=0.2998, simple_loss=0.3835, pruned_loss=0.1081, over 21313.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3555, pruned_loss=0.1179, over 4276717.73 frames. ], batch size: 548, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:50:46,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=386034.0, ans=0.125 2023-06-19 04:51:20,255 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 3.290e+02 3.790e+02 4.247e+02 7.031e+02, threshold=7.579e+02, percent-clipped=0.0 2023-06-19 04:52:20,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=386274.0, ans=0.125 2023-06-19 04:52:27,717 INFO [train.py:996] (2/4) Epoch 3, batch 3400, loss[loss=0.2796, simple_loss=0.3312, pruned_loss=0.114, over 21835.00 frames. ], tot_loss[loss=0.295, simple_loss=0.355, pruned_loss=0.1175, over 4281619.23 frames. ], batch size: 372, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:52:29,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=386334.0, ans=0.1 2023-06-19 04:54:13,036 INFO [train.py:996] (2/4) Epoch 3, batch 3450, loss[loss=0.2676, simple_loss=0.3176, pruned_loss=0.1088, over 21845.00 frames. ], tot_loss[loss=0.293, simple_loss=0.3516, pruned_loss=0.1172, over 4278670.37 frames. ], batch size: 107, lr: 1.19e-02, grad_scale: 32.0 2023-06-19 04:54:31,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=386694.0, ans=10.0 2023-06-19 04:54:57,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=386754.0, ans=0.1 2023-06-19 04:55:06,161 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.390e+02 3.192e+02 3.930e+02 4.779e+02 8.558e+02, threshold=7.861e+02, percent-clipped=2.0 2023-06-19 04:55:11,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=386754.0, ans=0.025 2023-06-19 04:55:12,373 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-19 04:55:57,927 INFO [train.py:996] (2/4) Epoch 3, batch 3500, loss[loss=0.3077, simple_loss=0.3666, pruned_loss=0.1244, over 21968.00 frames. ], tot_loss[loss=0.3044, simple_loss=0.3633, pruned_loss=0.1227, over 4284431.49 frames. ], batch size: 317, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 04:56:23,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=386994.0, ans=0.1 2023-06-19 04:56:36,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=386994.0, ans=0.125 2023-06-19 04:56:37,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=386994.0, ans=0.1 2023-06-19 04:56:55,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=387054.0, ans=10.0 2023-06-19 04:57:00,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=387054.0, ans=0.125 2023-06-19 04:57:18,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=387114.0, ans=0.125 2023-06-19 04:57:29,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=387174.0, ans=0.07 2023-06-19 04:57:44,030 INFO [train.py:996] (2/4) Epoch 3, batch 3550, loss[loss=0.2592, simple_loss=0.3112, pruned_loss=0.1036, over 21303.00 frames. ], tot_loss[loss=0.305, simple_loss=0.3643, pruned_loss=0.1229, over 4285736.63 frames. ], batch size: 160, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 04:58:38,190 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 3.279e+02 3.936e+02 4.776e+02 8.299e+02, threshold=7.873e+02, percent-clipped=2.0 2023-06-19 04:59:31,917 INFO [train.py:996] (2/4) Epoch 3, batch 3600, loss[loss=0.3469, simple_loss=0.403, pruned_loss=0.1454, over 21822.00 frames. ], tot_loss[loss=0.3016, simple_loss=0.3594, pruned_loss=0.1219, over 4279727.26 frames. ], batch size: 124, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 04:59:34,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=387534.0, ans=0.125 2023-06-19 05:00:46,416 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.74 vs. limit=22.5 2023-06-19 05:00:55,287 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-19 05:01:16,076 INFO [train.py:996] (2/4) Epoch 3, batch 3650, loss[loss=0.2424, simple_loss=0.322, pruned_loss=0.08137, over 21715.00 frames. ], tot_loss[loss=0.301, simple_loss=0.359, pruned_loss=0.1215, over 4271736.05 frames. ], batch size: 298, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:01:56,228 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-19 05:02:01,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=387894.0, ans=0.0 2023-06-19 05:02:08,766 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 3.313e+02 3.848e+02 4.708e+02 1.033e+03, threshold=7.696e+02, percent-clipped=4.0 2023-06-19 05:02:40,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=388014.0, ans=0.125 2023-06-19 05:02:59,732 INFO [train.py:996] (2/4) Epoch 3, batch 3700, loss[loss=0.3183, simple_loss=0.4011, pruned_loss=0.1177, over 20916.00 frames. ], tot_loss[loss=0.2973, simple_loss=0.3562, pruned_loss=0.1191, over 4280396.93 frames. ], batch size: 608, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:03:47,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=388194.0, ans=0.0 2023-06-19 05:04:34,219 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.05 vs. limit=15.0 2023-06-19 05:04:53,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=388374.0, ans=0.1 2023-06-19 05:04:56,371 INFO [train.py:996] (2/4) Epoch 3, batch 3750, loss[loss=0.258, simple_loss=0.3187, pruned_loss=0.09867, over 21751.00 frames. ], tot_loss[loss=0.2953, simple_loss=0.3546, pruned_loss=0.1181, over 4284012.15 frames. ], batch size: 247, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:05:08,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=388434.0, ans=0.2 2023-06-19 05:05:22,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=388494.0, ans=0.1 2023-06-19 05:05:43,607 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 3.137e+02 4.357e+02 5.330e+02 7.776e+02, threshold=8.713e+02, percent-clipped=1.0 2023-06-19 05:06:04,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=388614.0, ans=0.1 2023-06-19 05:06:08,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=388614.0, ans=0.02 2023-06-19 05:06:12,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=388674.0, ans=0.125 2023-06-19 05:06:44,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=388674.0, ans=0.025 2023-06-19 05:06:46,821 INFO [train.py:996] (2/4) Epoch 3, batch 3800, loss[loss=0.2985, simple_loss=0.361, pruned_loss=0.118, over 21995.00 frames. ], tot_loss[loss=0.2936, simple_loss=0.3538, pruned_loss=0.1167, over 4289116.42 frames. ], batch size: 317, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:06:47,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=388734.0, ans=0.2 2023-06-19 05:06:55,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=388734.0, ans=0.125 2023-06-19 05:07:26,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=388854.0, ans=0.125 2023-06-19 05:08:23,806 INFO [train.py:996] (2/4) Epoch 3, batch 3850, loss[loss=0.2564, simple_loss=0.2986, pruned_loss=0.1071, over 21628.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.3493, pruned_loss=0.1161, over 4279570.72 frames. ], batch size: 247, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:09:06,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=389154.0, ans=0.0 2023-06-19 05:09:10,451 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.298e+02 3.065e+02 3.544e+02 4.567e+02 7.617e+02, threshold=7.087e+02, percent-clipped=0.0 2023-06-19 05:09:16,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=389154.0, ans=0.125 2023-06-19 05:09:23,112 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-06-19 05:10:07,271 INFO [train.py:996] (2/4) Epoch 3, batch 3900, loss[loss=0.287, simple_loss=0.3471, pruned_loss=0.1134, over 21727.00 frames. ], tot_loss[loss=0.2877, simple_loss=0.3452, pruned_loss=0.1151, over 4279753.79 frames. ], batch size: 389, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:11:51,742 INFO [train.py:996] (2/4) Epoch 3, batch 3950, loss[loss=0.2818, simple_loss=0.3274, pruned_loss=0.118, over 21641.00 frames. ], tot_loss[loss=0.2876, simple_loss=0.3466, pruned_loss=0.1143, over 4288585.57 frames. ], batch size: 263, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:12:09,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=389634.0, ans=0.0 2023-06-19 05:12:24,340 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-06-19 05:12:28,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=389694.0, ans=0.125 2023-06-19 05:12:38,264 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 3.040e+02 3.554e+02 4.206e+02 5.675e+02, threshold=7.109e+02, percent-clipped=0.0 2023-06-19 05:12:49,646 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.48 vs. limit=15.0 2023-06-19 05:12:56,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=389814.0, ans=0.2 2023-06-19 05:13:12,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=389874.0, ans=0.2 2023-06-19 05:13:31,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=389874.0, ans=0.2 2023-06-19 05:13:36,466 INFO [train.py:996] (2/4) Epoch 3, batch 4000, loss[loss=0.2682, simple_loss=0.3126, pruned_loss=0.1119, over 21632.00 frames. ], tot_loss[loss=0.2796, simple_loss=0.3385, pruned_loss=0.1104, over 4288162.49 frames. ], batch size: 282, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:15:17,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=390174.0, ans=0.1 2023-06-19 05:15:21,464 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-19 05:15:23,382 INFO [train.py:996] (2/4) Epoch 3, batch 4050, loss[loss=0.3113, simple_loss=0.3573, pruned_loss=0.1326, over 21837.00 frames. ], tot_loss[loss=0.2773, simple_loss=0.3383, pruned_loss=0.1082, over 4285599.36 frames. ], batch size: 124, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:15:56,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=390294.0, ans=0.125 2023-06-19 05:16:10,709 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 3.094e+02 3.976e+02 4.759e+02 9.787e+02, threshold=7.952e+02, percent-clipped=5.0 2023-06-19 05:16:12,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=390354.0, ans=0.125 2023-06-19 05:17:02,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=390474.0, ans=0.125 2023-06-19 05:17:13,613 INFO [train.py:996] (2/4) Epoch 3, batch 4100, loss[loss=0.2872, simple_loss=0.3467, pruned_loss=0.1139, over 21325.00 frames. ], tot_loss[loss=0.2806, simple_loss=0.3411, pruned_loss=0.1101, over 4293601.79 frames. ], batch size: 176, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:17:36,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=390594.0, ans=0.2 2023-06-19 05:17:36,772 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-06-19 05:17:53,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=390654.0, ans=0.05 2023-06-19 05:18:58,809 INFO [train.py:996] (2/4) Epoch 3, batch 4150, loss[loss=0.2376, simple_loss=0.3067, pruned_loss=0.08422, over 21312.00 frames. ], tot_loss[loss=0.2777, simple_loss=0.3409, pruned_loss=0.1072, over 4284051.41 frames. ], batch size: 131, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:19:41,656 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 3.003e+02 3.732e+02 5.110e+02 9.922e+02, threshold=7.464e+02, percent-clipped=2.0 2023-06-19 05:19:44,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=390954.0, ans=0.05 2023-06-19 05:20:22,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=391074.0, ans=0.0 2023-06-19 05:20:45,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=391074.0, ans=0.125 2023-06-19 05:20:51,305 INFO [train.py:996] (2/4) Epoch 3, batch 4200, loss[loss=0.2322, simple_loss=0.3006, pruned_loss=0.08186, over 21457.00 frames. ], tot_loss[loss=0.2773, simple_loss=0.3409, pruned_loss=0.1068, over 4277072.68 frames. ], batch size: 195, lr: 1.18e-02, grad_scale: 16.0 2023-06-19 05:21:48,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=391254.0, ans=0.0 2023-06-19 05:22:40,229 INFO [train.py:996] (2/4) Epoch 3, batch 4250, loss[loss=0.2638, simple_loss=0.3589, pruned_loss=0.08437, over 20737.00 frames. ], tot_loss[loss=0.2828, simple_loss=0.3476, pruned_loss=0.109, over 4274411.83 frames. ], batch size: 608, lr: 1.18e-02, grad_scale: 16.0 2023-06-19 05:22:42,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=391434.0, ans=0.125 2023-06-19 05:23:30,825 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.359e+02 3.309e+02 4.046e+02 4.889e+02 9.500e+02, threshold=8.092e+02, percent-clipped=4.0 2023-06-19 05:24:25,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=391674.0, ans=0.2 2023-06-19 05:24:25,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=391674.0, ans=0.2 2023-06-19 05:24:27,981 INFO [train.py:996] (2/4) Epoch 3, batch 4300, loss[loss=0.2939, simple_loss=0.3873, pruned_loss=0.1002, over 21657.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.3513, pruned_loss=0.1097, over 4277674.04 frames. ], batch size: 414, lr: 1.18e-02, grad_scale: 16.0 2023-06-19 05:24:50,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=391794.0, ans=0.0 2023-06-19 05:25:09,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=391794.0, ans=0.0 2023-06-19 05:25:27,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=391854.0, ans=0.07 2023-06-19 05:25:43,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=391914.0, ans=0.035 2023-06-19 05:25:55,465 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.49 vs. limit=10.0 2023-06-19 05:25:56,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=391974.0, ans=0.1 2023-06-19 05:26:13,023 INFO [train.py:996] (2/4) Epoch 3, batch 4350, loss[loss=0.2605, simple_loss=0.3135, pruned_loss=0.1038, over 21611.00 frames. ], tot_loss[loss=0.2844, simple_loss=0.3513, pruned_loss=0.1087, over 4275264.90 frames. ], batch size: 298, lr: 1.18e-02, grad_scale: 16.0 2023-06-19 05:26:50,745 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.97 vs. limit=15.0 2023-06-19 05:27:08,358 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 3.136e+02 3.673e+02 4.293e+02 1.094e+03, threshold=7.346e+02, percent-clipped=4.0 2023-06-19 05:27:29,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=392214.0, ans=0.1 2023-06-19 05:27:43,968 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.29 vs. limit=12.0 2023-06-19 05:27:59,258 INFO [train.py:996] (2/4) Epoch 3, batch 4400, loss[loss=0.2607, simple_loss=0.3484, pruned_loss=0.0865, over 21770.00 frames. ], tot_loss[loss=0.2828, simple_loss=0.3483, pruned_loss=0.1086, over 4261750.40 frames. ], batch size: 352, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:28:46,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=392394.0, ans=0.1 2023-06-19 05:28:50,839 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=12.0 2023-06-19 05:29:33,968 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-19 05:29:39,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=392574.0, ans=0.2 2023-06-19 05:29:49,787 INFO [train.py:996] (2/4) Epoch 3, batch 4450, loss[loss=0.3123, simple_loss=0.3826, pruned_loss=0.121, over 21407.00 frames. ], tot_loss[loss=0.29, simple_loss=0.3573, pruned_loss=0.1114, over 4267453.73 frames. ], batch size: 211, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:29:50,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=392634.0, ans=0.5 2023-06-19 05:29:51,027 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.70 vs. limit=15.0 2023-06-19 05:30:00,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=392634.0, ans=0.1 2023-06-19 05:30:40,185 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.075e+02 3.016e+02 3.680e+02 4.427e+02 7.679e+02, threshold=7.360e+02, percent-clipped=2.0 2023-06-19 05:30:40,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=392754.0, ans=0.1 2023-06-19 05:30:45,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=392754.0, ans=0.07 2023-06-19 05:31:00,155 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.76 vs. limit=15.0 2023-06-19 05:31:22,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=392874.0, ans=0.125 2023-06-19 05:31:32,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=392874.0, ans=0.125 2023-06-19 05:31:36,784 INFO [train.py:996] (2/4) Epoch 3, batch 4500, loss[loss=0.2873, simple_loss=0.3633, pruned_loss=0.1056, over 21733.00 frames. ], tot_loss[loss=0.2937, simple_loss=0.3592, pruned_loss=0.114, over 4276192.48 frames. ], batch size: 247, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:32:06,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=392994.0, ans=0.1 2023-06-19 05:32:43,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=393114.0, ans=0.125 2023-06-19 05:33:28,843 INFO [train.py:996] (2/4) Epoch 3, batch 4550, loss[loss=0.3415, simple_loss=0.3967, pruned_loss=0.1432, over 21866.00 frames. ], tot_loss[loss=0.2972, simple_loss=0.3633, pruned_loss=0.1156, over 4278321.65 frames. ], batch size: 282, lr: 1.18e-02, grad_scale: 32.0 2023-06-19 05:33:52,803 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=22.5 2023-06-19 05:34:13,201 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 3.526e+02 4.619e+02 6.028e+02 1.155e+03, threshold=9.238e+02, percent-clipped=14.0 2023-06-19 05:34:17,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=393354.0, ans=0.09899494936611666 2023-06-19 05:35:14,629 INFO [train.py:996] (2/4) Epoch 3, batch 4600, loss[loss=0.2684, simple_loss=0.3317, pruned_loss=0.1025, over 21730.00 frames. ], tot_loss[loss=0.3002, simple_loss=0.3654, pruned_loss=0.1174, over 4279893.87 frames. ], batch size: 247, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:35:37,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=393594.0, ans=0.05 2023-06-19 05:35:42,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=393594.0, ans=0.125 2023-06-19 05:36:27,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=393714.0, ans=0.0 2023-06-19 05:36:42,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=393774.0, ans=0.125 2023-06-19 05:37:01,743 INFO [train.py:996] (2/4) Epoch 3, batch 4650, loss[loss=0.2209, simple_loss=0.2903, pruned_loss=0.07569, over 21898.00 frames. ], tot_loss[loss=0.2948, simple_loss=0.3592, pruned_loss=0.1152, over 4280843.28 frames. ], batch size: 118, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:37:29,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=393894.0, ans=0.0 2023-06-19 05:37:44,149 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.775e+02 3.310e+02 3.723e+02 7.638e+02, threshold=6.620e+02, percent-clipped=0.0 2023-06-19 05:37:52,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=393954.0, ans=0.2 2023-06-19 05:38:14,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=394014.0, ans=0.125 2023-06-19 05:38:37,869 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.40 vs. limit=10.0 2023-06-19 05:38:39,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=394074.0, ans=0.125 2023-06-19 05:38:45,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=394074.0, ans=0.0 2023-06-19 05:38:53,960 INFO [train.py:996] (2/4) Epoch 3, batch 4700, loss[loss=0.2771, simple_loss=0.3198, pruned_loss=0.1172, over 21705.00 frames. ], tot_loss[loss=0.2855, simple_loss=0.3478, pruned_loss=0.1117, over 4285979.93 frames. ], batch size: 283, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:39:01,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=394134.0, ans=0.0 2023-06-19 05:39:06,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=394134.0, ans=0.125 2023-06-19 05:39:22,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=394194.0, ans=0.125 2023-06-19 05:39:52,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=394254.0, ans=0.0 2023-06-19 05:39:55,282 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:40:32,953 INFO [train.py:996] (2/4) Epoch 3, batch 4750, loss[loss=0.2649, simple_loss=0.3181, pruned_loss=0.1059, over 21810.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3449, pruned_loss=0.1124, over 4286686.39 frames. ], batch size: 282, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:41:21,974 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.116e+02 3.052e+02 3.874e+02 5.001e+02 1.083e+03, threshold=7.748e+02, percent-clipped=9.0 2023-06-19 05:41:24,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=394554.0, ans=0.125 2023-06-19 05:41:38,379 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=15.0 2023-06-19 05:41:43,548 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-19 05:42:25,170 INFO [train.py:996] (2/4) Epoch 3, batch 4800, loss[loss=0.2824, simple_loss=0.328, pruned_loss=0.1184, over 20291.00 frames. ], tot_loss[loss=0.2855, simple_loss=0.3448, pruned_loss=0.1131, over 4278759.66 frames. ], batch size: 703, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:42:39,372 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.88 vs. limit=10.0 2023-06-19 05:42:40,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=394794.0, ans=0.0 2023-06-19 05:42:40,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=394794.0, ans=0.05 2023-06-19 05:44:10,699 INFO [train.py:996] (2/4) Epoch 3, batch 4850, loss[loss=0.2638, simple_loss=0.3285, pruned_loss=0.09959, over 21884.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3442, pruned_loss=0.1109, over 4281095.35 frames. ], batch size: 118, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:44:54,937 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.516e+02 4.493e+02 6.112e+02 1.101e+03, threshold=8.986e+02, percent-clipped=11.0 2023-06-19 05:45:20,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=395214.0, ans=0.1 2023-06-19 05:45:28,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=395274.0, ans=0.125 2023-06-19 05:45:52,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=395274.0, ans=0.125 2023-06-19 05:45:55,770 INFO [train.py:996] (2/4) Epoch 3, batch 4900, loss[loss=0.3165, simple_loss=0.3933, pruned_loss=0.1198, over 21648.00 frames. ], tot_loss[loss=0.2843, simple_loss=0.3452, pruned_loss=0.1117, over 4285910.24 frames. ], batch size: 389, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:46:16,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=395394.0, ans=10.0 2023-06-19 05:46:23,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=395394.0, ans=0.1 2023-06-19 05:46:50,917 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.01 vs. limit=15.0 2023-06-19 05:46:56,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=395514.0, ans=22.5 2023-06-19 05:46:59,903 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.77 vs. limit=15.0 2023-06-19 05:47:10,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=395514.0, ans=0.125 2023-06-19 05:47:27,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=395574.0, ans=0.125 2023-06-19 05:47:34,568 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-06-19 05:47:37,536 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2023-06-19 05:47:41,178 INFO [train.py:996] (2/4) Epoch 3, batch 4950, loss[loss=0.2508, simple_loss=0.2894, pruned_loss=0.1061, over 20038.00 frames. ], tot_loss[loss=0.2819, simple_loss=0.3464, pruned_loss=0.1087, over 4285426.80 frames. ], batch size: 704, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:47:51,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=395634.0, ans=0.2 2023-06-19 05:47:54,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=395634.0, ans=0.125 2023-06-19 05:48:16,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=395694.0, ans=0.2 2023-06-19 05:48:31,579 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.805e+02 3.354e+02 4.068e+02 9.306e+02, threshold=6.708e+02, percent-clipped=1.0 2023-06-19 05:48:32,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=395754.0, ans=0.125 2023-06-19 05:49:02,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=395814.0, ans=0.1 2023-06-19 05:49:27,322 INFO [train.py:996] (2/4) Epoch 3, batch 5000, loss[loss=0.2309, simple_loss=0.3161, pruned_loss=0.07281, over 21604.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3439, pruned_loss=0.1045, over 4279486.21 frames. ], batch size: 230, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:50:27,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=396114.0, ans=0.0 2023-06-19 05:50:42,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=396114.0, ans=0.1 2023-06-19 05:51:12,149 INFO [train.py:996] (2/4) Epoch 3, batch 5050, loss[loss=0.3139, simple_loss=0.3816, pruned_loss=0.1231, over 21442.00 frames. ], tot_loss[loss=0.2785, simple_loss=0.3452, pruned_loss=0.1059, over 4271482.45 frames. ], batch size: 548, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:51:56,275 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 3.428e+02 4.062e+02 4.972e+02 8.550e+02, threshold=8.125e+02, percent-clipped=7.0 2023-06-19 05:52:48,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=396474.0, ans=0.125 2023-06-19 05:52:52,348 INFO [train.py:996] (2/4) Epoch 3, batch 5100, loss[loss=0.298, simple_loss=0.3497, pruned_loss=0.1231, over 21787.00 frames. ], tot_loss[loss=0.2773, simple_loss=0.3422, pruned_loss=0.1062, over 4274437.38 frames. ], batch size: 441, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:53:04,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=396534.0, ans=0.05 2023-06-19 05:53:04,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=396534.0, ans=0.125 2023-06-19 05:53:26,825 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-19 05:53:57,212 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:54:37,016 INFO [train.py:996] (2/4) Epoch 3, batch 5150, loss[loss=0.3353, simple_loss=0.3923, pruned_loss=0.1392, over 21554.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.3421, pruned_loss=0.1074, over 4283776.33 frames. ], batch size: 471, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:54:52,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=396894.0, ans=0.0 2023-06-19 05:55:11,426 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=22.5 2023-06-19 05:55:27,279 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.247e+02 3.957e+02 4.711e+02 9.896e+02, threshold=7.915e+02, percent-clipped=1.0 2023-06-19 05:55:27,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=396954.0, ans=0.0 2023-06-19 05:55:43,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=397014.0, ans=0.125 2023-06-19 05:55:48,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=397014.0, ans=0.0 2023-06-19 05:55:56,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=397014.0, ans=0.2 2023-06-19 05:56:17,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=397074.0, ans=0.2 2023-06-19 05:56:18,902 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:56:19,521 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=22.5 2023-06-19 05:56:21,741 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=6.0 2023-06-19 05:56:23,676 INFO [train.py:996] (2/4) Epoch 3, batch 5200, loss[loss=0.2957, simple_loss=0.3759, pruned_loss=0.1078, over 21492.00 frames. ], tot_loss[loss=0.2803, simple_loss=0.3433, pruned_loss=0.1087, over 4288332.15 frames. ], batch size: 211, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:58:09,854 INFO [train.py:996] (2/4) Epoch 3, batch 5250, loss[loss=0.2328, simple_loss=0.312, pruned_loss=0.07684, over 21368.00 frames. ], tot_loss[loss=0.2844, simple_loss=0.3504, pruned_loss=0.1092, over 4285827.48 frames. ], batch size: 176, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 05:58:11,848 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 05:58:59,533 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 3.349e+02 3.883e+02 5.144e+02 8.715e+02, threshold=7.765e+02, percent-clipped=1.0 2023-06-19 05:59:53,213 INFO [train.py:996] (2/4) Epoch 3, batch 5300, loss[loss=0.3085, simple_loss=0.3586, pruned_loss=0.1292, over 21596.00 frames. ], tot_loss[loss=0.2843, simple_loss=0.3499, pruned_loss=0.1094, over 4286965.61 frames. ], batch size: 195, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:00:00,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=397734.0, ans=0.125 2023-06-19 06:00:00,832 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.12 vs. limit=15.0 2023-06-19 06:00:18,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=397794.0, ans=0.0 2023-06-19 06:00:50,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=397914.0, ans=0.125 2023-06-19 06:00:57,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=397914.0, ans=0.1 2023-06-19 06:01:33,547 INFO [train.py:996] (2/4) Epoch 3, batch 5350, loss[loss=0.2743, simple_loss=0.332, pruned_loss=0.1083, over 21557.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.3492, pruned_loss=0.1111, over 4294737.33 frames. ], batch size: 194, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:01:42,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=398034.0, ans=0.1 2023-06-19 06:02:23,869 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 3.150e+02 3.600e+02 4.539e+02 9.021e+02, threshold=7.200e+02, percent-clipped=2.0 2023-06-19 06:02:24,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=398154.0, ans=0.0 2023-06-19 06:02:53,211 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 06:03:18,181 INFO [train.py:996] (2/4) Epoch 3, batch 5400, loss[loss=0.2669, simple_loss=0.3272, pruned_loss=0.1033, over 21904.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3477, pruned_loss=0.1123, over 4294582.53 frames. ], batch size: 351, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:03:27,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=398334.0, ans=0.0 2023-06-19 06:03:48,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=398394.0, ans=0.1 2023-06-19 06:04:55,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=398574.0, ans=0.09899494936611666 2023-06-19 06:05:02,984 INFO [train.py:996] (2/4) Epoch 3, batch 5450, loss[loss=0.4148, simple_loss=0.4785, pruned_loss=0.1755, over 21517.00 frames. ], tot_loss[loss=0.2834, simple_loss=0.3474, pruned_loss=0.1097, over 4290179.42 frames. ], batch size: 507, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:05:05,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=398634.0, ans=0.2 2023-06-19 06:05:06,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=398634.0, ans=0.0 2023-06-19 06:05:35,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=398694.0, ans=0.125 2023-06-19 06:06:00,071 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 2.869e+02 3.395e+02 4.566e+02 8.866e+02, threshold=6.789e+02, percent-clipped=3.0 2023-06-19 06:06:19,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=398814.0, ans=0.125 2023-06-19 06:06:32,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=398814.0, ans=0.125 2023-06-19 06:06:57,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=398874.0, ans=0.1 2023-06-19 06:07:02,456 INFO [train.py:996] (2/4) Epoch 3, batch 5500, loss[loss=0.3124, simple_loss=0.3595, pruned_loss=0.1327, over 21604.00 frames. ], tot_loss[loss=0.2807, simple_loss=0.3504, pruned_loss=0.1055, over 4282953.35 frames. ], batch size: 548, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:08:20,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=399114.0, ans=0.125 2023-06-19 06:08:46,545 INFO [train.py:996] (2/4) Epoch 3, batch 5550, loss[loss=0.1585, simple_loss=0.2053, pruned_loss=0.0558, over 16022.00 frames. ], tot_loss[loss=0.2749, simple_loss=0.3466, pruned_loss=0.1015, over 4271537.76 frames. ], batch size: 60, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:08:51,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=399234.0, ans=0.125 2023-06-19 06:09:12,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=399294.0, ans=0.1 2023-06-19 06:09:38,649 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.077e+02 2.755e+02 3.259e+02 4.197e+02 7.319e+02, threshold=6.518e+02, percent-clipped=2.0 2023-06-19 06:10:34,019 INFO [train.py:996] (2/4) Epoch 3, batch 5600, loss[loss=0.2866, simple_loss=0.365, pruned_loss=0.1041, over 20788.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3423, pruned_loss=0.09784, over 4277137.58 frames. ], batch size: 607, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:10:36,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=399534.0, ans=0.125 2023-06-19 06:11:11,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=399594.0, ans=0.125 2023-06-19 06:11:34,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=399654.0, ans=0.0 2023-06-19 06:12:17,385 INFO [train.py:996] (2/4) Epoch 3, batch 5650, loss[loss=0.3148, simple_loss=0.3623, pruned_loss=0.1336, over 21789.00 frames. ], tot_loss[loss=0.277, simple_loss=0.3498, pruned_loss=0.1021, over 4271382.19 frames. ], batch size: 112, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:12:53,744 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=12.0 2023-06-19 06:13:13,888 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 3.102e+02 3.958e+02 5.147e+02 8.863e+02, threshold=7.916e+02, percent-clipped=12.0 2023-06-19 06:14:16,687 INFO [train.py:996] (2/4) Epoch 3, batch 5700, loss[loss=0.2649, simple_loss=0.3185, pruned_loss=0.1056, over 21257.00 frames. ], tot_loss[loss=0.2785, simple_loss=0.3493, pruned_loss=0.1038, over 4273189.11 frames. ], batch size: 608, lr: 1.17e-02, grad_scale: 32.0 2023-06-19 06:14:35,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=400194.0, ans=15.0 2023-06-19 06:14:49,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=400194.0, ans=0.0 2023-06-19 06:15:20,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=400314.0, ans=0.125 2023-06-19 06:15:35,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=400314.0, ans=0.1 2023-06-19 06:15:36,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=400314.0, ans=0.125 2023-06-19 06:15:59,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=400374.0, ans=0.0 2023-06-19 06:16:03,941 INFO [train.py:996] (2/4) Epoch 3, batch 5750, loss[loss=0.2797, simple_loss=0.3608, pruned_loss=0.09924, over 21606.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.346, pruned_loss=0.1006, over 4278311.29 frames. ], batch size: 389, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:16:06,624 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-19 06:16:19,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=400434.0, ans=0.0 2023-06-19 06:16:36,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=400494.0, ans=0.0 2023-06-19 06:16:54,228 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.901e+02 3.353e+02 4.192e+02 8.562e+02, threshold=6.706e+02, percent-clipped=1.0 2023-06-19 06:16:58,887 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.86 vs. limit=15.0 2023-06-19 06:17:16,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=400614.0, ans=0.1 2023-06-19 06:17:18,767 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.91 vs. limit=15.0 2023-06-19 06:17:48,978 INFO [train.py:996] (2/4) Epoch 3, batch 5800, loss[loss=0.3106, simple_loss=0.3998, pruned_loss=0.1107, over 21668.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3455, pruned_loss=0.09928, over 4275259.43 frames. ], batch size: 441, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:18:22,953 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.61 vs. limit=10.0 2023-06-19 06:19:26,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=400974.0, ans=0.125 2023-06-19 06:19:41,066 INFO [train.py:996] (2/4) Epoch 3, batch 5850, loss[loss=0.2029, simple_loss=0.3003, pruned_loss=0.05275, over 21693.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3398, pruned_loss=0.09362, over 4271743.09 frames. ], batch size: 247, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:20:24,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=401154.0, ans=0.125 2023-06-19 06:20:32,176 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.620e+02 2.443e+02 2.875e+02 3.533e+02 5.012e+02, threshold=5.751e+02, percent-clipped=0.0 2023-06-19 06:21:31,765 INFO [train.py:996] (2/4) Epoch 3, batch 5900, loss[loss=0.295, simple_loss=0.3846, pruned_loss=0.1027, over 21153.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3328, pruned_loss=0.08761, over 4272128.20 frames. ], batch size: 548, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:21:58,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=401394.0, ans=0.1 2023-06-19 06:22:43,547 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.47 vs. limit=22.5 2023-06-19 06:22:58,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=401574.0, ans=0.1 2023-06-19 06:23:13,921 INFO [train.py:996] (2/4) Epoch 3, batch 5950, loss[loss=0.2762, simple_loss=0.3204, pruned_loss=0.116, over 21812.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3349, pruned_loss=0.09417, over 4275652.18 frames. ], batch size: 112, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:23:47,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=401754.0, ans=0.125 2023-06-19 06:23:58,575 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.593e+02 2.851e+02 3.351e+02 4.142e+02 6.067e+02, threshold=6.702e+02, percent-clipped=3.0 2023-06-19 06:24:50,641 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=22.5 2023-06-19 06:25:00,260 INFO [train.py:996] (2/4) Epoch 3, batch 6000, loss[loss=0.254, simple_loss=0.3323, pruned_loss=0.08787, over 20020.00 frames. ], tot_loss[loss=0.2647, simple_loss=0.3321, pruned_loss=0.09862, over 4276282.61 frames. ], batch size: 702, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:25:00,261 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 06:25:17,476 INFO [train.py:1028] (2/4) Epoch 3, validation: loss=0.2818, simple_loss=0.374, pruned_loss=0.0948, over 1796401.00 frames. 2023-06-19 06:25:17,478 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-19 06:25:24,735 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 06:25:53,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=401994.0, ans=0.125 2023-06-19 06:25:53,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=401994.0, ans=0.035 2023-06-19 06:26:15,531 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.83 vs. limit=15.0 2023-06-19 06:26:16,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=402054.0, ans=0.125 2023-06-19 06:26:32,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=402114.0, ans=0.04949747468305833 2023-06-19 06:26:59,528 INFO [train.py:996] (2/4) Epoch 3, batch 6050, loss[loss=0.2682, simple_loss=0.3107, pruned_loss=0.1128, over 21514.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3274, pruned_loss=0.1001, over 4270526.88 frames. ], batch size: 391, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:27:04,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=402234.0, ans=0.0 2023-06-19 06:27:07,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=402234.0, ans=0.125 2023-06-19 06:27:38,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=402294.0, ans=0.0 2023-06-19 06:27:44,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=402354.0, ans=0.0 2023-06-19 06:27:49,167 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.758e+02 2.918e+02 3.547e+02 4.372e+02 9.416e+02, threshold=7.093e+02, percent-clipped=6.0 2023-06-19 06:28:23,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=402414.0, ans=0.1 2023-06-19 06:28:43,525 INFO [train.py:996] (2/4) Epoch 3, batch 6100, loss[loss=0.3288, simple_loss=0.3736, pruned_loss=0.142, over 17163.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3237, pruned_loss=0.09739, over 4266694.48 frames. ], batch size: 60, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:28:51,382 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.00 vs. limit=10.0 2023-06-19 06:29:08,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=402594.0, ans=0.0 2023-06-19 06:30:00,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=402714.0, ans=0.125 2023-06-19 06:30:02,606 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-19 06:30:13,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=402774.0, ans=0.0 2023-06-19 06:30:15,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=402774.0, ans=0.0 2023-06-19 06:30:30,048 INFO [train.py:996] (2/4) Epoch 3, batch 6150, loss[loss=0.2582, simple_loss=0.324, pruned_loss=0.09619, over 21777.00 frames. ], tot_loss[loss=0.2646, simple_loss=0.3282, pruned_loss=0.1005, over 4270895.61 frames. ], batch size: 333, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:31:09,318 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.84 vs. limit=6.0 2023-06-19 06:31:21,147 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.787e+02 3.076e+02 3.570e+02 4.379e+02 8.300e+02, threshold=7.140e+02, percent-clipped=3.0 2023-06-19 06:31:21,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=402954.0, ans=0.125 2023-06-19 06:31:54,401 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.40 vs. limit=15.0 2023-06-19 06:32:15,748 INFO [train.py:996] (2/4) Epoch 3, batch 6200, loss[loss=0.3196, simple_loss=0.3739, pruned_loss=0.1327, over 21767.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3293, pruned_loss=0.09932, over 4264521.51 frames. ], batch size: 247, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:32:39,239 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 06:32:43,254 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.06 vs. limit=10.0 2023-06-19 06:32:50,183 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-06-19 06:33:00,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=403254.0, ans=0.035 2023-06-19 06:33:22,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=403254.0, ans=0.2 2023-06-19 06:33:44,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=403374.0, ans=0.2 2023-06-19 06:33:49,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=403374.0, ans=0.125 2023-06-19 06:34:08,034 INFO [train.py:996] (2/4) Epoch 3, batch 6250, loss[loss=0.2277, simple_loss=0.3192, pruned_loss=0.06814, over 21374.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.3344, pruned_loss=0.09895, over 4270151.23 frames. ], batch size: 211, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:35:04,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=403554.0, ans=0.125 2023-06-19 06:35:08,840 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 3.015e+02 3.767e+02 4.898e+02 1.129e+03, threshold=7.534e+02, percent-clipped=8.0 2023-06-19 06:35:51,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=403674.0, ans=0.5 2023-06-19 06:36:01,215 INFO [train.py:996] (2/4) Epoch 3, batch 6300, loss[loss=0.2699, simple_loss=0.316, pruned_loss=0.1119, over 20272.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3382, pruned_loss=0.09858, over 4273705.15 frames. ], batch size: 703, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:36:08,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=403734.0, ans=0.0 2023-06-19 06:36:43,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=403854.0, ans=0.0 2023-06-19 06:37:36,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=403974.0, ans=0.1 2023-06-19 06:37:45,962 INFO [train.py:996] (2/4) Epoch 3, batch 6350, loss[loss=0.3167, simple_loss=0.4156, pruned_loss=0.1089, over 21295.00 frames. ], tot_loss[loss=0.2752, simple_loss=0.3432, pruned_loss=0.1036, over 4284119.00 frames. ], batch size: 548, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:37:55,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=404034.0, ans=0.09899494936611666 2023-06-19 06:38:38,189 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.497e+02 3.096e+02 3.646e+02 4.304e+02 8.936e+02, threshold=7.293e+02, percent-clipped=1.0 2023-06-19 06:39:16,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=404274.0, ans=0.0 2023-06-19 06:39:26,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=404274.0, ans=0.0 2023-06-19 06:39:31,465 INFO [train.py:996] (2/4) Epoch 3, batch 6400, loss[loss=0.3567, simple_loss=0.407, pruned_loss=0.1532, over 21832.00 frames. ], tot_loss[loss=0.2834, simple_loss=0.3511, pruned_loss=0.1079, over 4276799.76 frames. ], batch size: 441, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:39:45,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=404334.0, ans=0.1 2023-06-19 06:40:56,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=404514.0, ans=0.2 2023-06-19 06:41:05,161 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=22.5 2023-06-19 06:41:07,336 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-19 06:41:08,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=404574.0, ans=0.1 2023-06-19 06:41:08,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=404574.0, ans=0.2 2023-06-19 06:41:09,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=404574.0, ans=0.2 2023-06-19 06:41:17,049 INFO [train.py:996] (2/4) Epoch 3, batch 6450, loss[loss=0.2376, simple_loss=0.3308, pruned_loss=0.07222, over 21854.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3556, pruned_loss=0.1084, over 4278682.49 frames. ], batch size: 371, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:41:54,224 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.16 vs. limit=10.0 2023-06-19 06:42:09,382 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 3.087e+02 4.205e+02 5.976e+02 1.329e+03, threshold=8.410e+02, percent-clipped=11.0 2023-06-19 06:42:10,729 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.90 vs. limit=15.0 2023-06-19 06:43:02,066 INFO [train.py:996] (2/4) Epoch 3, batch 6500, loss[loss=0.2346, simple_loss=0.308, pruned_loss=0.08062, over 21571.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3471, pruned_loss=0.1059, over 4275204.55 frames. ], batch size: 230, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:44:13,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=405114.0, ans=0.125 2023-06-19 06:44:20,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=405114.0, ans=0.125 2023-06-19 06:44:22,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=405114.0, ans=0.0 2023-06-19 06:44:48,612 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.72 vs. limit=15.0 2023-06-19 06:44:54,419 INFO [train.py:996] (2/4) Epoch 3, batch 6550, loss[loss=0.3542, simple_loss=0.3928, pruned_loss=0.1578, over 21631.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3466, pruned_loss=0.1054, over 4281727.74 frames. ], batch size: 507, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:45:03,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=405234.0, ans=0.2 2023-06-19 06:45:16,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=405294.0, ans=15.0 2023-06-19 06:45:41,976 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.297e+02 2.916e+02 3.526e+02 4.372e+02 9.339e+02, threshold=7.052e+02, percent-clipped=1.0 2023-06-19 06:45:55,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=405414.0, ans=10.0 2023-06-19 06:46:03,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=405414.0, ans=10.0 2023-06-19 06:46:27,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=405474.0, ans=0.125 2023-06-19 06:46:38,343 INFO [train.py:996] (2/4) Epoch 3, batch 6600, loss[loss=0.2624, simple_loss=0.3082, pruned_loss=0.1083, over 21528.00 frames. ], tot_loss[loss=0.2767, simple_loss=0.3413, pruned_loss=0.106, over 4284602.57 frames. ], batch size: 414, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:46:39,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=405534.0, ans=0.1 2023-06-19 06:47:17,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=405654.0, ans=0.0 2023-06-19 06:47:25,110 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=15.08 vs. limit=15.0 2023-06-19 06:48:23,554 INFO [train.py:996] (2/4) Epoch 3, batch 6650, loss[loss=0.2713, simple_loss=0.3228, pruned_loss=0.1099, over 21578.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3315, pruned_loss=0.1023, over 4274599.91 frames. ], batch size: 391, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:48:54,630 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=15.0 2023-06-19 06:49:17,960 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.970e+02 3.451e+02 4.323e+02 7.420e+02, threshold=6.902e+02, percent-clipped=1.0 2023-06-19 06:49:23,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=406014.0, ans=0.1 2023-06-19 06:49:37,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=406014.0, ans=0.125 2023-06-19 06:50:07,760 INFO [train.py:996] (2/4) Epoch 3, batch 6700, loss[loss=0.2525, simple_loss=0.3105, pruned_loss=0.09724, over 21462.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3256, pruned_loss=0.102, over 4279725.17 frames. ], batch size: 212, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:50:14,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=406134.0, ans=0.125 2023-06-19 06:50:28,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=406194.0, ans=0.0 2023-06-19 06:51:09,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=406314.0, ans=0.125 2023-06-19 06:51:52,468 INFO [train.py:996] (2/4) Epoch 3, batch 6750, loss[loss=0.2816, simple_loss=0.3247, pruned_loss=0.1192, over 21270.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3247, pruned_loss=0.1032, over 4270818.84 frames. ], batch size: 176, lr: 1.16e-02, grad_scale: 16.0 2023-06-19 06:51:55,151 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-06-19 06:52:11,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=406494.0, ans=0.125 2023-06-19 06:52:22,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=406494.0, ans=0.0 2023-06-19 06:52:40,106 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.875e+02 3.278e+02 4.228e+02 8.254e+02, threshold=6.556e+02, percent-clipped=2.0 2023-06-19 06:53:07,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=406674.0, ans=0.125 2023-06-19 06:53:08,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=406674.0, ans=0.0 2023-06-19 06:53:08,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=406674.0, ans=0.0 2023-06-19 06:53:35,008 INFO [train.py:996] (2/4) Epoch 3, batch 6800, loss[loss=0.2935, simple_loss=0.3344, pruned_loss=0.1263, over 21574.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3273, pruned_loss=0.1063, over 4280593.19 frames. ], batch size: 473, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:54:22,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=406854.0, ans=0.125 2023-06-19 06:54:46,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=406914.0, ans=0.1 2023-06-19 06:54:55,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=406974.0, ans=0.125 2023-06-19 06:55:19,145 INFO [train.py:996] (2/4) Epoch 3, batch 6850, loss[loss=0.2799, simple_loss=0.3281, pruned_loss=0.1158, over 21348.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3282, pruned_loss=0.1073, over 4278075.27 frames. ], batch size: 176, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:55:52,635 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=22.5 2023-06-19 06:55:53,571 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 06:56:08,182 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 3.177e+02 3.620e+02 4.749e+02 9.271e+02, threshold=7.240e+02, percent-clipped=3.0 2023-06-19 06:57:00,807 INFO [train.py:996] (2/4) Epoch 3, batch 6900, loss[loss=0.2426, simple_loss=0.3094, pruned_loss=0.08792, over 21407.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.331, pruned_loss=0.1076, over 4280984.19 frames. ], batch size: 211, lr: 1.16e-02, grad_scale: 32.0 2023-06-19 06:57:10,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=407334.0, ans=0.125 2023-06-19 06:57:18,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=407394.0, ans=0.125 2023-06-19 06:57:57,726 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.73 vs. limit=12.0 2023-06-19 06:58:45,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=407634.0, ans=0.125 2023-06-19 06:58:46,755 INFO [train.py:996] (2/4) Epoch 3, batch 6950, loss[loss=0.2884, simple_loss=0.3551, pruned_loss=0.1109, over 21359.00 frames. ], tot_loss[loss=0.2698, simple_loss=0.3318, pruned_loss=0.1039, over 4277814.39 frames. ], batch size: 159, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 06:58:54,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=407634.0, ans=0.0 2023-06-19 06:59:21,443 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.99 vs. limit=6.0 2023-06-19 06:59:42,599 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.993e+02 3.659e+02 4.526e+02 7.412e+02, threshold=7.319e+02, percent-clipped=1.0 2023-06-19 07:00:08,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=407814.0, ans=0.125 2023-06-19 07:00:32,095 INFO [train.py:996] (2/4) Epoch 3, batch 7000, loss[loss=0.2932, simple_loss=0.3302, pruned_loss=0.1281, over 21803.00 frames. ], tot_loss[loss=0.2761, simple_loss=0.3363, pruned_loss=0.1079, over 4272903.32 frames. ], batch size: 352, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:00:34,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=407934.0, ans=0.04949747468305833 2023-06-19 07:01:00,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=407994.0, ans=0.125 2023-06-19 07:01:06,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=407994.0, ans=0.125 2023-06-19 07:01:41,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=408114.0, ans=0.125 2023-06-19 07:01:54,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=408114.0, ans=0.125 2023-06-19 07:02:00,229 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.67 vs. limit=10.0 2023-06-19 07:02:15,202 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=15.0 2023-06-19 07:02:16,906 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.61 vs. limit=10.0 2023-06-19 07:02:19,518 INFO [train.py:996] (2/4) Epoch 3, batch 7050, loss[loss=0.3411, simple_loss=0.412, pruned_loss=0.1351, over 19973.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.334, pruned_loss=0.1059, over 4276967.60 frames. ], batch size: 702, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:02:43,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=408294.0, ans=0.0 2023-06-19 07:03:07,460 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.58 vs. limit=15.0 2023-06-19 07:03:19,727 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 3.081e+02 3.762e+02 4.566e+02 1.137e+03, threshold=7.524e+02, percent-clipped=2.0 2023-06-19 07:04:10,042 INFO [train.py:996] (2/4) Epoch 3, batch 7100, loss[loss=0.3929, simple_loss=0.4274, pruned_loss=0.1792, over 21357.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.3383, pruned_loss=0.1081, over 4274333.03 frames. ], batch size: 507, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:04:22,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=408534.0, ans=0.2 2023-06-19 07:04:24,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=408534.0, ans=0.0 2023-06-19 07:05:10,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=408654.0, ans=0.05 2023-06-19 07:05:25,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=408714.0, ans=0.125 2023-06-19 07:05:31,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=408714.0, ans=0.125 2023-06-19 07:05:38,358 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.47 vs. limit=10.0 2023-06-19 07:05:55,535 INFO [train.py:996] (2/4) Epoch 3, batch 7150, loss[loss=0.2938, simple_loss=0.3578, pruned_loss=0.1149, over 21597.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.3347, pruned_loss=0.105, over 4278749.69 frames. ], batch size: 389, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:06:01,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=408834.0, ans=0.125 2023-06-19 07:06:15,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=408834.0, ans=0.0 2023-06-19 07:06:16,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=408894.0, ans=0.5 2023-06-19 07:06:30,898 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.16 vs. limit=22.5 2023-06-19 07:06:57,326 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 3.041e+02 3.413e+02 3.887e+02 5.883e+02, threshold=6.826e+02, percent-clipped=0.0 2023-06-19 07:07:39,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=409134.0, ans=0.125 2023-06-19 07:07:40,895 INFO [train.py:996] (2/4) Epoch 3, batch 7200, loss[loss=0.2842, simple_loss=0.3309, pruned_loss=0.1188, over 21695.00 frames. ], tot_loss[loss=0.2767, simple_loss=0.3379, pruned_loss=0.1077, over 4276593.29 frames. ], batch size: 282, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:07:53,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=409134.0, ans=0.025 2023-06-19 07:07:55,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=409134.0, ans=0.1 2023-06-19 07:08:12,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=409194.0, ans=0.1 2023-06-19 07:09:32,329 INFO [train.py:996] (2/4) Epoch 3, batch 7250, loss[loss=0.2836, simple_loss=0.3233, pruned_loss=0.122, over 21311.00 frames. ], tot_loss[loss=0.2753, simple_loss=0.3337, pruned_loss=0.1084, over 4275941.20 frames. ], batch size: 473, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:09:49,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=409494.0, ans=0.125 2023-06-19 07:10:00,338 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=22.5 2023-06-19 07:10:24,989 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.406e+02 3.049e+02 3.903e+02 5.201e+02 1.242e+03, threshold=7.806e+02, percent-clipped=6.0 2023-06-19 07:10:32,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=409614.0, ans=0.125 2023-06-19 07:11:07,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=409674.0, ans=0.1 2023-06-19 07:11:13,546 INFO [train.py:996] (2/4) Epoch 3, batch 7300, loss[loss=0.2786, simple_loss=0.3186, pruned_loss=0.1193, over 21517.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3276, pruned_loss=0.1072, over 4276220.03 frames. ], batch size: 442, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:11:14,773 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.98 vs. limit=12.0 2023-06-19 07:11:23,362 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-06-19 07:11:54,739 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=22.5 2023-06-19 07:12:21,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=409914.0, ans=0.0 2023-06-19 07:12:54,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=409974.0, ans=0.125 2023-06-19 07:12:59,472 INFO [train.py:996] (2/4) Epoch 3, batch 7350, loss[loss=0.2802, simple_loss=0.3326, pruned_loss=0.1139, over 21546.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3262, pruned_loss=0.1076, over 4272352.82 frames. ], batch size: 230, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:13:05,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=410034.0, ans=0.1 2023-06-19 07:13:31,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=410094.0, ans=0.0 2023-06-19 07:13:59,111 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.286e+02 3.138e+02 3.671e+02 4.789e+02 1.075e+03, threshold=7.343e+02, percent-clipped=3.0 2023-06-19 07:14:36,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=410274.0, ans=0.0 2023-06-19 07:14:55,712 INFO [train.py:996] (2/4) Epoch 3, batch 7400, loss[loss=0.2427, simple_loss=0.3246, pruned_loss=0.08037, over 21675.00 frames. ], tot_loss[loss=0.2781, simple_loss=0.3342, pruned_loss=0.111, over 4278154.61 frames. ], batch size: 298, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:16:13,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=410514.0, ans=0.0 2023-06-19 07:16:48,349 INFO [train.py:996] (2/4) Epoch 3, batch 7450, loss[loss=0.2614, simple_loss=0.321, pruned_loss=0.1009, over 21811.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.3324, pruned_loss=0.1089, over 4282307.01 frames. ], batch size: 352, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:16:55,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=410634.0, ans=0.0 2023-06-19 07:16:57,707 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.16 vs. limit=15.0 2023-06-19 07:17:02,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=410634.0, ans=0.2 2023-06-19 07:17:25,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=410754.0, ans=0.0 2023-06-19 07:17:27,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=410754.0, ans=0.125 2023-06-19 07:17:41,922 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.990e+02 3.597e+02 4.537e+02 7.540e+02, threshold=7.195e+02, percent-clipped=1.0 2023-06-19 07:18:27,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=410874.0, ans=0.125 2023-06-19 07:18:35,626 INFO [train.py:996] (2/4) Epoch 3, batch 7500, loss[loss=0.2861, simple_loss=0.3797, pruned_loss=0.09624, over 21901.00 frames. ], tot_loss[loss=0.279, simple_loss=0.3373, pruned_loss=0.1104, over 4279523.12 frames. ], batch size: 317, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:20:24,687 INFO [train.py:996] (2/4) Epoch 3, batch 7550, loss[loss=0.2535, simple_loss=0.3274, pruned_loss=0.08978, over 21801.00 frames. ], tot_loss[loss=0.2794, simple_loss=0.3432, pruned_loss=0.1078, over 4281455.41 frames. ], batch size: 118, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:20:26,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=411234.0, ans=0.125 2023-06-19 07:21:22,339 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 3.181e+02 3.735e+02 4.565e+02 8.412e+02, threshold=7.470e+02, percent-clipped=4.0 2023-06-19 07:21:22,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=411354.0, ans=0.035 2023-06-19 07:21:28,184 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.87 vs. limit=15.0 2023-06-19 07:21:31,565 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=12.74 vs. limit=15.0 2023-06-19 07:22:09,081 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=22.5 2023-06-19 07:22:09,832 INFO [train.py:996] (2/4) Epoch 3, batch 7600, loss[loss=0.2544, simple_loss=0.3237, pruned_loss=0.09258, over 21447.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.3416, pruned_loss=0.1064, over 4280713.83 frames. ], batch size: 211, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:22:11,067 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=15.0 2023-06-19 07:22:12,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=411534.0, ans=0.0 2023-06-19 07:22:16,372 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2023-06-19 07:23:06,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=411654.0, ans=0.2 2023-06-19 07:23:10,453 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.65 vs. limit=8.0 2023-06-19 07:23:17,468 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.87 vs. limit=15.0 2023-06-19 07:23:47,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=411774.0, ans=0.025 2023-06-19 07:23:55,486 INFO [train.py:996] (2/4) Epoch 3, batch 7650, loss[loss=0.2642, simple_loss=0.3198, pruned_loss=0.1043, over 21783.00 frames. ], tot_loss[loss=0.2789, simple_loss=0.3411, pruned_loss=0.1083, over 4286963.40 frames. ], batch size: 247, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:24:02,753 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 07:24:54,783 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 2.811e+02 3.352e+02 3.855e+02 5.541e+02, threshold=6.704e+02, percent-clipped=0.0 2023-06-19 07:25:15,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=412014.0, ans=0.1 2023-06-19 07:25:43,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=412134.0, ans=0.05 2023-06-19 07:25:44,618 INFO [train.py:996] (2/4) Epoch 3, batch 7700, loss[loss=0.3501, simple_loss=0.4102, pruned_loss=0.145, over 21806.00 frames. ], tot_loss[loss=0.2849, simple_loss=0.3457, pruned_loss=0.1121, over 4285176.38 frames. ], batch size: 118, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:25:57,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=412134.0, ans=0.125 2023-06-19 07:26:02,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=412194.0, ans=0.1 2023-06-19 07:26:02,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=412194.0, ans=0.0 2023-06-19 07:26:29,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=412254.0, ans=0.2 2023-06-19 07:26:48,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=412254.0, ans=0.2 2023-06-19 07:27:31,766 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.53 vs. limit=12.0 2023-06-19 07:27:32,032 INFO [train.py:996] (2/4) Epoch 3, batch 7750, loss[loss=0.3185, simple_loss=0.4087, pruned_loss=0.1142, over 21790.00 frames. ], tot_loss[loss=0.2888, simple_loss=0.352, pruned_loss=0.1128, over 4271807.75 frames. ], batch size: 282, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:27:36,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=412434.0, ans=0.2 2023-06-19 07:28:45,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=412554.0, ans=0.2 2023-06-19 07:28:46,865 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.272e+02 3.582e+02 4.541e+02 5.903e+02 1.038e+03, threshold=9.082e+02, percent-clipped=9.0 2023-06-19 07:29:03,153 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 07:29:11,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=412674.0, ans=0.125 2023-06-19 07:29:11,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=412674.0, ans=0.2 2023-06-19 07:29:24,313 INFO [train.py:996] (2/4) Epoch 3, batch 7800, loss[loss=0.2954, simple_loss=0.3649, pruned_loss=0.113, over 21563.00 frames. ], tot_loss[loss=0.2878, simple_loss=0.3506, pruned_loss=0.1125, over 4267043.93 frames. ], batch size: 441, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:29:37,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=412734.0, ans=0.0 2023-06-19 07:29:42,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=412734.0, ans=0.2 2023-06-19 07:29:43,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=412734.0, ans=0.125 2023-06-19 07:31:13,620 INFO [train.py:996] (2/4) Epoch 3, batch 7850, loss[loss=0.2775, simple_loss=0.3276, pruned_loss=0.1137, over 19974.00 frames. ], tot_loss[loss=0.285, simple_loss=0.3445, pruned_loss=0.1128, over 4257150.89 frames. ], batch size: 703, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:31:16,940 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2023-06-19 07:31:22,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=413034.0, ans=0.125 2023-06-19 07:31:55,294 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.39 vs. limit=15.0 2023-06-19 07:32:23,198 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.221e+02 3.181e+02 3.685e+02 4.397e+02 7.326e+02, threshold=7.370e+02, percent-clipped=0.0 2023-06-19 07:33:08,242 INFO [train.py:996] (2/4) Epoch 3, batch 7900, loss[loss=0.2741, simple_loss=0.3341, pruned_loss=0.107, over 21601.00 frames. ], tot_loss[loss=0.2813, simple_loss=0.3398, pruned_loss=0.1114, over 4258711.43 frames. ], batch size: 230, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:33:26,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=413334.0, ans=0.95 2023-06-19 07:33:57,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=413454.0, ans=0.07 2023-06-19 07:34:56,195 INFO [train.py:996] (2/4) Epoch 3, batch 7950, loss[loss=0.2679, simple_loss=0.3585, pruned_loss=0.08859, over 21665.00 frames. ], tot_loss[loss=0.2831, simple_loss=0.3455, pruned_loss=0.1104, over 4260683.17 frames. ], batch size: 389, lr: 1.15e-02, grad_scale: 16.0 2023-06-19 07:35:01,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=413634.0, ans=0.125 2023-06-19 07:35:17,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=413694.0, ans=0.0 2023-06-19 07:35:47,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.whiten.whitening_limit, batch_count=413754.0, ans=12.0 2023-06-19 07:35:55,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=413754.0, ans=0.1 2023-06-19 07:35:56,714 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.292e+02 2.859e+02 3.738e+02 4.773e+02 1.037e+03, threshold=7.477e+02, percent-clipped=3.0 2023-06-19 07:36:40,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=413874.0, ans=0.0 2023-06-19 07:36:44,352 INFO [train.py:996] (2/4) Epoch 3, batch 8000, loss[loss=0.346, simple_loss=0.4007, pruned_loss=0.1456, over 21435.00 frames. ], tot_loss[loss=0.2884, simple_loss=0.3507, pruned_loss=0.113, over 4258174.16 frames. ], batch size: 471, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:37:17,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=413994.0, ans=0.04949747468305833 2023-06-19 07:38:00,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=414114.0, ans=0.125 2023-06-19 07:38:00,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=414114.0, ans=0.125 2023-06-19 07:38:16,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=414114.0, ans=0.95 2023-06-19 07:38:40,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=414174.0, ans=0.1 2023-06-19 07:38:46,684 INFO [train.py:996] (2/4) Epoch 3, batch 8050, loss[loss=0.2166, simple_loss=0.2614, pruned_loss=0.08591, over 21869.00 frames. ], tot_loss[loss=0.289, simple_loss=0.3536, pruned_loss=0.1122, over 4256845.67 frames. ], batch size: 107, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:38:57,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=414234.0, ans=0.125 2023-06-19 07:39:01,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=414234.0, ans=0.2 2023-06-19 07:39:47,878 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.481e+02 3.438e+02 3.985e+02 5.129e+02 7.856e+02, threshold=7.969e+02, percent-clipped=2.0 2023-06-19 07:40:14,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=414474.0, ans=10.0 2023-06-19 07:40:35,190 INFO [train.py:996] (2/4) Epoch 3, batch 8100, loss[loss=0.2722, simple_loss=0.333, pruned_loss=0.1057, over 21880.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3526, pruned_loss=0.1133, over 4265679.58 frames. ], batch size: 124, lr: 1.15e-02, grad_scale: 32.0 2023-06-19 07:40:44,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=414534.0, ans=0.0 2023-06-19 07:40:50,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=414534.0, ans=0.125 2023-06-19 07:42:08,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=414774.0, ans=0.125 2023-06-19 07:42:24,530 INFO [train.py:996] (2/4) Epoch 3, batch 8150, loss[loss=0.3212, simple_loss=0.4174, pruned_loss=0.1124, over 21555.00 frames. ], tot_loss[loss=0.2953, simple_loss=0.3612, pruned_loss=0.1147, over 4266113.80 frames. ], batch size: 441, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:42:38,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=414834.0, ans=0.125 2023-06-19 07:43:14,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=414954.0, ans=0.0 2023-06-19 07:43:38,474 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.178e+02 2.875e+02 3.410e+02 4.043e+02 9.100e+02, threshold=6.821e+02, percent-clipped=2.0 2023-06-19 07:44:13,637 INFO [train.py:996] (2/4) Epoch 3, batch 8200, loss[loss=0.2215, simple_loss=0.2637, pruned_loss=0.08965, over 16038.00 frames. ], tot_loss[loss=0.2869, simple_loss=0.3517, pruned_loss=0.111, over 4259107.37 frames. ], batch size: 63, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:44:14,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=415134.0, ans=10.0 2023-06-19 07:44:29,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=415134.0, ans=0.5 2023-06-19 07:44:30,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=415134.0, ans=0.1 2023-06-19 07:44:40,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=415194.0, ans=0.0 2023-06-19 07:45:27,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=415314.0, ans=0.1 2023-06-19 07:45:58,717 INFO [train.py:996] (2/4) Epoch 3, batch 8250, loss[loss=0.2575, simple_loss=0.3346, pruned_loss=0.09021, over 21441.00 frames. ], tot_loss[loss=0.285, simple_loss=0.3499, pruned_loss=0.1101, over 4256934.90 frames. ], batch size: 194, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:46:08,631 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.58 vs. limit=6.0 2023-06-19 07:46:57,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=415554.0, ans=0.2 2023-06-19 07:47:10,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=415554.0, ans=0.0 2023-06-19 07:47:11,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=415614.0, ans=0.2 2023-06-19 07:47:12,635 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.72 vs. limit=10.0 2023-06-19 07:47:12,984 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.241e+02 3.026e+02 3.791e+02 5.480e+02 8.265e+02, threshold=7.583e+02, percent-clipped=10.0 2023-06-19 07:47:14,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=415614.0, ans=0.1 2023-06-19 07:47:14,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=415614.0, ans=0.1 2023-06-19 07:47:20,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=415614.0, ans=0.0 2023-06-19 07:47:52,042 INFO [train.py:996] (2/4) Epoch 3, batch 8300, loss[loss=0.2894, simple_loss=0.3653, pruned_loss=0.1067, over 21728.00 frames. ], tot_loss[loss=0.2807, simple_loss=0.3482, pruned_loss=0.1066, over 4266866.25 frames. ], batch size: 332, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:48:49,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=415854.0, ans=0.0 2023-06-19 07:48:51,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=415854.0, ans=0.125 2023-06-19 07:49:23,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=415974.0, ans=0.0 2023-06-19 07:49:38,486 INFO [train.py:996] (2/4) Epoch 3, batch 8350, loss[loss=0.2585, simple_loss=0.3399, pruned_loss=0.08852, over 21401.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.3479, pruned_loss=0.1045, over 4268516.08 frames. ], batch size: 211, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:50:05,839 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.28 vs. limit=6.0 2023-06-19 07:50:38,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=416154.0, ans=0.0 2023-06-19 07:50:45,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=416214.0, ans=0.1 2023-06-19 07:50:46,337 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.962e+02 3.800e+02 4.795e+02 8.641e+02, threshold=7.601e+02, percent-clipped=3.0 2023-06-19 07:51:16,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=416274.0, ans=0.2 2023-06-19 07:51:23,701 INFO [train.py:996] (2/4) Epoch 3, batch 8400, loss[loss=0.2546, simple_loss=0.3452, pruned_loss=0.08201, over 21216.00 frames. ], tot_loss[loss=0.2734, simple_loss=0.3439, pruned_loss=0.1015, over 4272352.07 frames. ], batch size: 548, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 07:51:45,928 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=12.0 2023-06-19 07:51:58,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=416394.0, ans=0.125 2023-06-19 07:52:39,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=416514.0, ans=0.125 2023-06-19 07:53:07,350 INFO [train.py:996] (2/4) Epoch 3, batch 8450, loss[loss=0.2834, simple_loss=0.3348, pruned_loss=0.116, over 21805.00 frames. ], tot_loss[loss=0.2724, simple_loss=0.3413, pruned_loss=0.1017, over 4278398.41 frames. ], batch size: 298, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 07:53:27,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=416634.0, ans=0.125 2023-06-19 07:53:29,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=416694.0, ans=0.1 2023-06-19 07:53:36,213 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=22.5 2023-06-19 07:54:07,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=416754.0, ans=15.0 2023-06-19 07:54:11,586 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-06-19 07:54:12,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=416754.0, ans=0.0 2023-06-19 07:54:15,146 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.539e+02 3.236e+02 3.912e+02 6.365e+02, threshold=6.471e+02, percent-clipped=0.0 2023-06-19 07:54:20,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=416814.0, ans=0.125 2023-06-19 07:54:59,980 INFO [train.py:996] (2/4) Epoch 3, batch 8500, loss[loss=0.3702, simple_loss=0.4689, pruned_loss=0.1358, over 20845.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.3389, pruned_loss=0.1041, over 4279932.13 frames. ], batch size: 607, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 07:55:05,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=416934.0, ans=0.125 2023-06-19 07:55:28,815 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=22.5 2023-06-19 07:55:29,076 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.40 vs. limit=22.5 2023-06-19 07:55:55,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=417054.0, ans=0.1 2023-06-19 07:56:09,451 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 07:56:14,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=417114.0, ans=0.95 2023-06-19 07:56:35,280 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=15.0 2023-06-19 07:56:48,462 INFO [train.py:996] (2/4) Epoch 3, batch 8550, loss[loss=0.3158, simple_loss=0.3953, pruned_loss=0.1182, over 21243.00 frames. ], tot_loss[loss=0.282, simple_loss=0.3459, pruned_loss=0.1091, over 4274551.11 frames. ], batch size: 548, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:56:55,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=417234.0, ans=0.0 2023-06-19 07:57:04,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=417294.0, ans=0.125 2023-06-19 07:57:31,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=417354.0, ans=0.125 2023-06-19 07:57:37,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=417354.0, ans=0.1 2023-06-19 07:57:46,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=417414.0, ans=0.1 2023-06-19 07:57:47,747 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.294e+02 4.189e+02 5.052e+02 1.014e+03, threshold=8.378e+02, percent-clipped=9.0 2023-06-19 07:58:22,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=417474.0, ans=10.0 2023-06-19 07:58:32,111 INFO [train.py:996] (2/4) Epoch 3, batch 8600, loss[loss=0.2543, simple_loss=0.3307, pruned_loss=0.089, over 21396.00 frames. ], tot_loss[loss=0.287, simple_loss=0.3528, pruned_loss=0.1106, over 4277877.93 frames. ], batch size: 211, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 07:58:52,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=417534.0, ans=0.2 2023-06-19 08:00:07,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=417774.0, ans=0.125 2023-06-19 08:00:20,159 INFO [train.py:996] (2/4) Epoch 3, batch 8650, loss[loss=0.1992, simple_loss=0.2749, pruned_loss=0.06173, over 21289.00 frames. ], tot_loss[loss=0.29, simple_loss=0.3565, pruned_loss=0.1118, over 4271267.93 frames. ], batch size: 194, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 08:00:47,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=417894.0, ans=0.125 2023-06-19 08:00:47,695 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.93 vs. limit=10.0 2023-06-19 08:00:57,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=417894.0, ans=0.125 2023-06-19 08:01:28,278 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.801e+02 2.885e+02 3.395e+02 4.112e+02 7.467e+02, threshold=6.789e+02, percent-clipped=0.0 2023-06-19 08:01:30,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=418014.0, ans=0.125 2023-06-19 08:02:05,250 INFO [train.py:996] (2/4) Epoch 3, batch 8700, loss[loss=0.2813, simple_loss=0.3321, pruned_loss=0.1152, over 21443.00 frames. ], tot_loss[loss=0.2807, simple_loss=0.346, pruned_loss=0.1077, over 4261351.72 frames. ], batch size: 389, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 08:02:23,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=418134.0, ans=0.125 2023-06-19 08:02:52,108 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.67 vs. limit=15.0 2023-06-19 08:03:16,674 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:03:32,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=418314.0, ans=0.125 2023-06-19 08:03:33,319 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-06-19 08:03:36,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=418374.0, ans=0.0 2023-06-19 08:03:57,817 INFO [train.py:996] (2/4) Epoch 3, batch 8750, loss[loss=0.312, simple_loss=0.3606, pruned_loss=0.1317, over 21896.00 frames. ], tot_loss[loss=0.2799, simple_loss=0.3429, pruned_loss=0.1084, over 4264047.02 frames. ], batch size: 333, lr: 1.14e-02, grad_scale: 16.0 2023-06-19 08:03:59,044 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=12.0 2023-06-19 08:04:03,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=418434.0, ans=0.125 2023-06-19 08:04:35,527 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2023-06-19 08:04:51,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=418554.0, ans=0.125 2023-06-19 08:05:06,663 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.171e+02 3.018e+02 3.630e+02 4.545e+02 8.299e+02, threshold=7.260e+02, percent-clipped=2.0 2023-06-19 08:05:12,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=418614.0, ans=0.2 2023-06-19 08:05:19,280 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:05:44,852 INFO [train.py:996] (2/4) Epoch 3, batch 8800, loss[loss=0.3331, simple_loss=0.3952, pruned_loss=0.1355, over 21524.00 frames. ], tot_loss[loss=0.2876, simple_loss=0.3515, pruned_loss=0.1119, over 4272240.81 frames. ], batch size: 194, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:07:12,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=418974.0, ans=0.0 2023-06-19 08:07:27,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=418974.0, ans=0.0 2023-06-19 08:07:42,001 INFO [train.py:996] (2/4) Epoch 3, batch 8850, loss[loss=0.27, simple_loss=0.3436, pruned_loss=0.09819, over 21574.00 frames. ], tot_loss[loss=0.2947, simple_loss=0.3597, pruned_loss=0.1149, over 4274269.82 frames. ], batch size: 263, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:07:49,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=419034.0, ans=0.125 2023-06-19 08:08:02,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=419094.0, ans=0.0 2023-06-19 08:08:42,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=419214.0, ans=0.125 2023-06-19 08:08:46,169 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.341e+02 3.554e+02 4.291e+02 5.667e+02 9.091e+02, threshold=8.581e+02, percent-clipped=5.0 2023-06-19 08:08:46,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=419214.0, ans=0.0 2023-06-19 08:09:20,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=419274.0, ans=0.0 2023-06-19 08:09:29,173 INFO [train.py:996] (2/4) Epoch 3, batch 8900, loss[loss=0.2403, simple_loss=0.2993, pruned_loss=0.09068, over 21406.00 frames. ], tot_loss[loss=0.291, simple_loss=0.3537, pruned_loss=0.1141, over 4279058.12 frames. ], batch size: 131, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:09:48,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=419394.0, ans=0.0 2023-06-19 08:09:56,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=419394.0, ans=0.0 2023-06-19 08:10:12,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=419394.0, ans=0.5 2023-06-19 08:10:48,418 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-06-19 08:10:54,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=419574.0, ans=0.125 2023-06-19 08:11:05,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=419574.0, ans=0.125 2023-06-19 08:11:18,166 INFO [train.py:996] (2/4) Epoch 3, batch 8950, loss[loss=0.2719, simple_loss=0.3277, pruned_loss=0.1081, over 21459.00 frames. ], tot_loss[loss=0.2847, simple_loss=0.3487, pruned_loss=0.1104, over 4267699.59 frames. ], batch size: 194, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:11:32,740 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.23 vs. limit=6.0 2023-06-19 08:11:38,306 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=12.0 2023-06-19 08:12:02,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=419754.0, ans=0.1 2023-06-19 08:12:27,666 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.199e+02 3.236e+02 4.208e+02 5.168e+02 9.134e+02, threshold=8.417e+02, percent-clipped=1.0 2023-06-19 08:12:30,650 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=15.0 2023-06-19 08:13:05,028 INFO [train.py:996] (2/4) Epoch 3, batch 9000, loss[loss=0.2762, simple_loss=0.3255, pruned_loss=0.1135, over 21757.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.3424, pruned_loss=0.1093, over 4266248.10 frames. ], batch size: 351, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:13:05,029 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 08:13:24,328 INFO [train.py:1028] (2/4) Epoch 3, validation: loss=0.2787, simple_loss=0.3793, pruned_loss=0.08906, over 1796401.00 frames. 2023-06-19 08:13:24,329 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-19 08:13:38,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=419934.0, ans=0.1 2023-06-19 08:13:48,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=419994.0, ans=0.125 2023-06-19 08:13:48,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=419994.0, ans=0.95 2023-06-19 08:14:06,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=419994.0, ans=0.1 2023-06-19 08:14:33,809 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.42 vs. limit=10.0 2023-06-19 08:14:41,936 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:15:19,203 INFO [train.py:996] (2/4) Epoch 3, batch 9050, loss[loss=0.3961, simple_loss=0.4265, pruned_loss=0.1829, over 21334.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3393, pruned_loss=0.1061, over 4267355.56 frames. ], batch size: 507, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:15:52,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=420294.0, ans=0.05 2023-06-19 08:16:23,152 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 3.182e+02 3.849e+02 4.740e+02 7.257e+02, threshold=7.697e+02, percent-clipped=0.0 2023-06-19 08:16:54,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=420474.0, ans=0.125 2023-06-19 08:17:05,686 INFO [train.py:996] (2/4) Epoch 3, batch 9100, loss[loss=0.3071, simple_loss=0.3712, pruned_loss=0.1215, over 21205.00 frames. ], tot_loss[loss=0.2828, simple_loss=0.3465, pruned_loss=0.1096, over 4264211.21 frames. ], batch size: 143, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:17:06,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=420534.0, ans=0.2 2023-06-19 08:18:01,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=420654.0, ans=0.0 2023-06-19 08:18:51,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=420834.0, ans=0.125 2023-06-19 08:18:52,708 INFO [train.py:996] (2/4) Epoch 3, batch 9150, loss[loss=0.2598, simple_loss=0.3312, pruned_loss=0.09423, over 21442.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.3498, pruned_loss=0.1068, over 4264462.58 frames. ], batch size: 131, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:19:03,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=420834.0, ans=0.0 2023-06-19 08:19:21,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=420894.0, ans=0.2 2023-06-19 08:19:30,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=420894.0, ans=0.0 2023-06-19 08:19:39,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=420954.0, ans=0.125 2023-06-19 08:19:57,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=421014.0, ans=0.1 2023-06-19 08:19:58,192 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=15.0 2023-06-19 08:20:00,326 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 2.911e+02 3.357e+02 4.018e+02 6.144e+02, threshold=6.715e+02, percent-clipped=0.0 2023-06-19 08:20:14,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=421014.0, ans=0.1 2023-06-19 08:20:45,086 INFO [train.py:996] (2/4) Epoch 3, batch 9200, loss[loss=0.3106, simple_loss=0.3789, pruned_loss=0.1211, over 21873.00 frames. ], tot_loss[loss=0.2846, simple_loss=0.3546, pruned_loss=0.1073, over 4264876.25 frames. ], batch size: 371, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:21:20,521 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=12.0 2023-06-19 08:21:40,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=421254.0, ans=0.05 2023-06-19 08:21:43,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=421254.0, ans=0.125 2023-06-19 08:21:45,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=421314.0, ans=0.125 2023-06-19 08:22:14,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=421374.0, ans=0.0 2023-06-19 08:22:31,113 INFO [train.py:996] (2/4) Epoch 3, batch 9250, loss[loss=0.3063, simple_loss=0.3707, pruned_loss=0.121, over 21428.00 frames. ], tot_loss[loss=0.2902, simple_loss=0.358, pruned_loss=0.1112, over 4266364.36 frames. ], batch size: 131, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:23:16,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=421554.0, ans=0.0 2023-06-19 08:23:39,390 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 3.086e+02 3.800e+02 4.447e+02 7.339e+02, threshold=7.599e+02, percent-clipped=2.0 2023-06-19 08:23:40,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=421614.0, ans=0.0 2023-06-19 08:24:17,022 INFO [train.py:996] (2/4) Epoch 3, batch 9300, loss[loss=0.3501, simple_loss=0.4139, pruned_loss=0.1431, over 21575.00 frames. ], tot_loss[loss=0.2872, simple_loss=0.3526, pruned_loss=0.1109, over 4266336.59 frames. ], batch size: 414, lr: 1.14e-02, grad_scale: 32.0 2023-06-19 08:24:21,860 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=22.5 2023-06-19 08:24:47,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=421794.0, ans=0.0 2023-06-19 08:24:52,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=421794.0, ans=0.125 2023-06-19 08:24:55,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=421854.0, ans=0.125 2023-06-19 08:25:54,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=421974.0, ans=0.125 2023-06-19 08:26:11,701 INFO [train.py:996] (2/4) Epoch 3, batch 9350, loss[loss=0.3038, simple_loss=0.3684, pruned_loss=0.1196, over 21639.00 frames. ], tot_loss[loss=0.2895, simple_loss=0.3576, pruned_loss=0.1107, over 4267244.77 frames. ], batch size: 263, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:26:22,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=422034.0, ans=0.125 2023-06-19 08:26:24,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=422034.0, ans=0.0 2023-06-19 08:27:19,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=422214.0, ans=0.125 2023-06-19 08:27:20,734 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 3.071e+02 3.690e+02 4.644e+02 6.944e+02, threshold=7.381e+02, percent-clipped=0.0 2023-06-19 08:27:44,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=422274.0, ans=0.0 2023-06-19 08:27:59,156 INFO [train.py:996] (2/4) Epoch 3, batch 9400, loss[loss=0.243, simple_loss=0.3025, pruned_loss=0.09177, over 21686.00 frames. ], tot_loss[loss=0.2902, simple_loss=0.3581, pruned_loss=0.1111, over 4262333.56 frames. ], batch size: 282, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:29:25,134 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=22.5 2023-06-19 08:29:44,004 INFO [train.py:996] (2/4) Epoch 3, batch 9450, loss[loss=0.2471, simple_loss=0.3042, pruned_loss=0.09499, over 21991.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.3483, pruned_loss=0.1091, over 4264308.00 frames. ], batch size: 103, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:29:44,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=422634.0, ans=0.0 2023-06-19 08:30:43,713 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=12.0 2023-06-19 08:30:51,267 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 3.147e+02 3.732e+02 4.957e+02 8.626e+02, threshold=7.464e+02, percent-clipped=5.0 2023-06-19 08:31:28,562 INFO [train.py:996] (2/4) Epoch 3, batch 9500, loss[loss=0.2397, simple_loss=0.2958, pruned_loss=0.09177, over 21332.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.3406, pruned_loss=0.1071, over 4272099.18 frames. ], batch size: 211, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:31:35,465 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-19 08:31:54,496 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-19 08:32:11,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=422994.0, ans=0.125 2023-06-19 08:32:30,248 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.47 vs. limit=15.0 2023-06-19 08:33:14,264 INFO [train.py:996] (2/4) Epoch 3, batch 9550, loss[loss=0.3306, simple_loss=0.3918, pruned_loss=0.1347, over 21471.00 frames. ], tot_loss[loss=0.2834, simple_loss=0.3469, pruned_loss=0.1099, over 4263885.04 frames. ], batch size: 159, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:33:19,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=423234.0, ans=0.2 2023-06-19 08:33:29,771 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.16 vs. limit=12.0 2023-06-19 08:33:32,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=423234.0, ans=0.125 2023-06-19 08:33:55,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=423294.0, ans=0.125 2023-06-19 08:34:20,858 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.824e+02 3.274e+02 3.853e+02 7.090e+02, threshold=6.547e+02, percent-clipped=0.0 2023-06-19 08:34:45,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=423474.0, ans=0.0 2023-06-19 08:34:50,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=423474.0, ans=0.125 2023-06-19 08:34:55,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=423474.0, ans=0.125 2023-06-19 08:34:58,280 INFO [train.py:996] (2/4) Epoch 3, batch 9600, loss[loss=0.2922, simple_loss=0.3413, pruned_loss=0.1216, over 21846.00 frames. ], tot_loss[loss=0.2876, simple_loss=0.3506, pruned_loss=0.1123, over 4270869.50 frames. ], batch size: 441, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:35:26,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=423594.0, ans=0.125 2023-06-19 08:35:28,600 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=22.5 2023-06-19 08:35:41,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=423654.0, ans=0.2 2023-06-19 08:36:18,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=423714.0, ans=0.125 2023-06-19 08:36:45,584 INFO [train.py:996] (2/4) Epoch 3, batch 9650, loss[loss=0.3325, simple_loss=0.3873, pruned_loss=0.1389, over 21227.00 frames. ], tot_loss[loss=0.2889, simple_loss=0.3526, pruned_loss=0.1126, over 4273067.76 frames. ], batch size: 143, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:37:03,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=423834.0, ans=0.2 2023-06-19 08:37:11,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=423894.0, ans=0.07 2023-06-19 08:37:44,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=423954.0, ans=0.125 2023-06-19 08:37:58,447 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.433e+02 3.232e+02 3.864e+02 5.587e+02 9.927e+02, threshold=7.728e+02, percent-clipped=9.0 2023-06-19 08:38:00,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=424014.0, ans=0.1 2023-06-19 08:38:08,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=424014.0, ans=0.1 2023-06-19 08:38:13,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=424074.0, ans=0.0 2023-06-19 08:38:40,974 INFO [train.py:996] (2/4) Epoch 3, batch 9700, loss[loss=0.359, simple_loss=0.4047, pruned_loss=0.1567, over 21384.00 frames. ], tot_loss[loss=0.292, simple_loss=0.3561, pruned_loss=0.1139, over 4267367.57 frames. ], batch size: 507, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:38:51,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=424134.0, ans=0.125 2023-06-19 08:39:40,676 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.83 vs. limit=22.5 2023-06-19 08:39:44,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=424314.0, ans=0.125 2023-06-19 08:40:18,524 INFO [train.py:996] (2/4) Epoch 3, batch 9750, loss[loss=0.2643, simple_loss=0.3179, pruned_loss=0.1053, over 21839.00 frames. ], tot_loss[loss=0.2876, simple_loss=0.3502, pruned_loss=0.1125, over 4262888.44 frames. ], batch size: 98, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:40:55,933 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.46 vs. limit=15.0 2023-06-19 08:41:00,434 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.50 vs. limit=6.0 2023-06-19 08:41:18,623 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 3.037e+02 3.853e+02 4.411e+02 7.266e+02, threshold=7.707e+02, percent-clipped=0.0 2023-06-19 08:41:51,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=424674.0, ans=0.015 2023-06-19 08:41:55,561 INFO [train.py:996] (2/4) Epoch 3, batch 9800, loss[loss=0.3213, simple_loss=0.3611, pruned_loss=0.1407, over 21800.00 frames. ], tot_loss[loss=0.2854, simple_loss=0.348, pruned_loss=0.1115, over 4247232.18 frames. ], batch size: 510, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:42:21,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=424794.0, ans=0.07 2023-06-19 08:42:24,866 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:42:33,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=424794.0, ans=0.125 2023-06-19 08:43:25,886 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.59 vs. limit=15.0 2023-06-19 08:43:37,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=425034.0, ans=10.0 2023-06-19 08:43:39,056 INFO [train.py:996] (2/4) Epoch 3, batch 9850, loss[loss=0.2566, simple_loss=0.3064, pruned_loss=0.1034, over 21813.00 frames. ], tot_loss[loss=0.284, simple_loss=0.3449, pruned_loss=0.1115, over 4254662.02 frames. ], batch size: 98, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:44:10,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=425094.0, ans=0.2 2023-06-19 08:44:50,383 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.794e+02 3.273e+02 4.007e+02 7.022e+02, threshold=6.547e+02, percent-clipped=0.0 2023-06-19 08:45:21,559 INFO [train.py:996] (2/4) Epoch 3, batch 9900, loss[loss=0.2464, simple_loss=0.3128, pruned_loss=0.09, over 21671.00 frames. ], tot_loss[loss=0.2789, simple_loss=0.3383, pruned_loss=0.1097, over 4261853.55 frames. ], batch size: 298, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:45:22,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=425334.0, ans=0.0 2023-06-19 08:46:28,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=425454.0, ans=10.0 2023-06-19 08:46:29,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=425454.0, ans=0.125 2023-06-19 08:46:46,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=425514.0, ans=0.0 2023-06-19 08:46:46,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=425514.0, ans=0.0 2023-06-19 08:46:51,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=425574.0, ans=0.125 2023-06-19 08:46:53,617 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2023-06-19 08:47:10,948 INFO [train.py:996] (2/4) Epoch 3, batch 9950, loss[loss=0.3573, simple_loss=0.4011, pruned_loss=0.1567, over 21407.00 frames. ], tot_loss[loss=0.2825, simple_loss=0.3408, pruned_loss=0.1121, over 4260792.49 frames. ], batch size: 471, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:48:18,527 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.167e+02 3.027e+02 3.494e+02 4.278e+02 9.586e+02, threshold=6.989e+02, percent-clipped=3.0 2023-06-19 08:48:26,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=425814.0, ans=10.0 2023-06-19 08:49:03,430 INFO [train.py:996] (2/4) Epoch 3, batch 10000, loss[loss=0.3058, simple_loss=0.3616, pruned_loss=0.1251, over 21924.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3354, pruned_loss=0.1103, over 4272687.78 frames. ], batch size: 372, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 08:50:45,790 INFO [train.py:996] (2/4) Epoch 3, batch 10050, loss[loss=0.2718, simple_loss=0.3415, pruned_loss=0.1011, over 21596.00 frames. ], tot_loss[loss=0.2824, simple_loss=0.3406, pruned_loss=0.1121, over 4275038.40 frames. ], batch size: 441, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 08:51:09,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=426294.0, ans=0.125 2023-06-19 08:51:44,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=426354.0, ans=0.025 2023-06-19 08:51:50,945 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.885e+02 3.512e+02 4.083e+02 6.660e+02, threshold=7.024e+02, percent-clipped=0.0 2023-06-19 08:52:19,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=426474.0, ans=0.1 2023-06-19 08:52:37,746 INFO [train.py:996] (2/4) Epoch 3, batch 10100, loss[loss=0.296, simple_loss=0.3496, pruned_loss=0.1212, over 20048.00 frames. ], tot_loss[loss=0.2781, simple_loss=0.337, pruned_loss=0.1096, over 4278061.84 frames. ], batch size: 702, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 08:52:48,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=426534.0, ans=0.2 2023-06-19 08:53:02,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=426594.0, ans=0.0 2023-06-19 08:53:14,514 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 08:54:21,467 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.42 vs. limit=15.0 2023-06-19 08:54:23,750 INFO [train.py:996] (2/4) Epoch 3, batch 10150, loss[loss=0.3074, simple_loss=0.3697, pruned_loss=0.1226, over 21636.00 frames. ], tot_loss[loss=0.2877, simple_loss=0.3469, pruned_loss=0.1142, over 4280286.59 frames. ], batch size: 441, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 08:55:35,213 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.064e+02 3.671e+02 4.442e+02 6.348e+02, threshold=7.343e+02, percent-clipped=0.0 2023-06-19 08:55:42,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=427014.0, ans=0.1 2023-06-19 08:55:55,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=427074.0, ans=0.0 2023-06-19 08:56:09,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=427134.0, ans=0.1 2023-06-19 08:56:10,029 INFO [train.py:996] (2/4) Epoch 3, batch 10200, loss[loss=0.2745, simple_loss=0.3493, pruned_loss=0.09985, over 21700.00 frames. ], tot_loss[loss=0.2809, simple_loss=0.3427, pruned_loss=0.1096, over 4277373.58 frames. ], batch size: 247, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 08:56:12,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=427134.0, ans=10.0 2023-06-19 08:56:38,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.19 vs. limit=15.0 2023-06-19 08:57:57,564 INFO [train.py:996] (2/4) Epoch 3, batch 10250, loss[loss=0.2525, simple_loss=0.3323, pruned_loss=0.08636, over 21892.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3369, pruned_loss=0.1018, over 4281522.09 frames. ], batch size: 317, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 08:57:58,648 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.52 vs. limit=6.0 2023-06-19 08:59:07,977 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.673e+02 2.442e+02 2.730e+02 3.285e+02 6.537e+02, threshold=5.460e+02, percent-clipped=0.0 2023-06-19 08:59:39,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=427674.0, ans=0.0 2023-06-19 08:59:43,927 INFO [train.py:996] (2/4) Epoch 3, batch 10300, loss[loss=0.2938, simple_loss=0.3674, pruned_loss=0.1101, over 21329.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.3399, pruned_loss=0.1031, over 4285595.55 frames. ], batch size: 549, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 09:00:00,960 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=15.0 2023-06-19 09:00:20,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=427794.0, ans=0.125 2023-06-19 09:00:50,378 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=15.0 2023-06-19 09:00:53,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=427914.0, ans=0.0 2023-06-19 09:00:53,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=427914.0, ans=0.1 2023-06-19 09:01:06,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=427914.0, ans=0.125 2023-06-19 09:01:11,841 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 09:01:19,634 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 09:01:38,598 INFO [train.py:996] (2/4) Epoch 3, batch 10350, loss[loss=0.2756, simple_loss=0.3535, pruned_loss=0.09881, over 21639.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3412, pruned_loss=0.1033, over 4284061.23 frames. ], batch size: 414, lr: 1.13e-02, grad_scale: 16.0 2023-06-19 09:01:39,847 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.69 vs. limit=15.0 2023-06-19 09:01:47,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=428034.0, ans=0.2 2023-06-19 09:02:20,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=428094.0, ans=0.0 2023-06-19 09:02:34,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=428154.0, ans=0.2 2023-06-19 09:02:50,533 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.191e+02 3.696e+02 4.662e+02 9.387e+02, threshold=7.392e+02, percent-clipped=8.0 2023-06-19 09:03:07,806 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.92 vs. limit=10.0 2023-06-19 09:03:26,634 INFO [train.py:996] (2/4) Epoch 3, batch 10400, loss[loss=0.1904, simple_loss=0.2364, pruned_loss=0.07222, over 21118.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.3345, pruned_loss=0.1015, over 4278765.52 frames. ], batch size: 143, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 09:03:57,763 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 09:04:03,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=428394.0, ans=0.0 2023-06-19 09:04:52,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=428514.0, ans=0.125 2023-06-19 09:05:19,084 INFO [train.py:996] (2/4) Epoch 3, batch 10450, loss[loss=0.3284, simple_loss=0.3953, pruned_loss=0.1307, over 21837.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.341, pruned_loss=0.1062, over 4275450.56 frames. ], batch size: 371, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 09:05:25,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=428634.0, ans=0.2 2023-06-19 09:06:28,887 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 3.522e+02 4.182e+02 5.744e+02 1.036e+03, threshold=8.363e+02, percent-clipped=11.0 2023-06-19 09:06:31,573 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.44 vs. limit=15.0 2023-06-19 09:06:46,694 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.78 vs. limit=6.0 2023-06-19 09:06:46,711 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=20.65 vs. limit=22.5 2023-06-19 09:07:01,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=428874.0, ans=0.125 2023-06-19 09:07:03,336 INFO [train.py:996] (2/4) Epoch 3, batch 10500, loss[loss=0.2804, simple_loss=0.3516, pruned_loss=0.1046, over 20040.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3399, pruned_loss=0.1046, over 4265892.01 frames. ], batch size: 704, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 09:07:26,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=428994.0, ans=0.035 2023-06-19 09:07:29,767 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=12.0 2023-06-19 09:08:17,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=429114.0, ans=0.0 2023-06-19 09:08:28,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=429174.0, ans=0.2 2023-06-19 09:08:33,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=429174.0, ans=0.95 2023-06-19 09:08:49,604 INFO [train.py:996] (2/4) Epoch 3, batch 10550, loss[loss=0.2502, simple_loss=0.318, pruned_loss=0.09121, over 21839.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3338, pruned_loss=0.1047, over 4262477.29 frames. ], batch size: 98, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 09:09:00,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=429234.0, ans=0.0 2023-06-19 09:09:02,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=429234.0, ans=0.125 2023-06-19 09:09:13,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=429294.0, ans=0.1 2023-06-19 09:09:36,751 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.24 vs. limit=10.0 2023-06-19 09:09:54,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=429354.0, ans=0.125 2023-06-19 09:10:00,979 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.021e+02 2.936e+02 3.551e+02 4.371e+02 5.985e+02, threshold=7.102e+02, percent-clipped=0.0 2023-06-19 09:10:06,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=429414.0, ans=0.125 2023-06-19 09:10:36,948 INFO [train.py:996] (2/4) Epoch 3, batch 10600, loss[loss=0.2554, simple_loss=0.3315, pruned_loss=0.08959, over 21678.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3302, pruned_loss=0.1029, over 4263893.93 frames. ], batch size: 298, lr: 1.13e-02, grad_scale: 32.0 2023-06-19 09:10:48,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=429534.0, ans=0.2 2023-06-19 09:10:52,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=429534.0, ans=0.1 2023-06-19 09:11:07,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=429594.0, ans=0.0 2023-06-19 09:12:27,690 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.99 vs. limit=22.5 2023-06-19 09:12:31,308 INFO [train.py:996] (2/4) Epoch 3, batch 10650, loss[loss=0.1816, simple_loss=0.2365, pruned_loss=0.06337, over 21237.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.3312, pruned_loss=0.1001, over 4262351.95 frames. ], batch size: 143, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:13:21,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=429954.0, ans=0.2 2023-06-19 09:13:29,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=429954.0, ans=0.015 2023-06-19 09:13:41,874 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.341e+02 3.491e+02 4.438e+02 6.074e+02 1.034e+03, threshold=8.876e+02, percent-clipped=13.0 2023-06-19 09:14:12,427 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.39 vs. limit=6.0 2023-06-19 09:14:17,998 INFO [train.py:996] (2/4) Epoch 3, batch 10700, loss[loss=0.281, simple_loss=0.3392, pruned_loss=0.1114, over 21469.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3294, pruned_loss=0.09965, over 4261535.56 frames. ], batch size: 194, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:15:12,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=430254.0, ans=0.125 2023-06-19 09:15:25,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=430314.0, ans=0.04949747468305833 2023-06-19 09:15:33,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=430314.0, ans=0.125 2023-06-19 09:15:50,437 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.25 vs. limit=15.0 2023-06-19 09:16:09,167 INFO [train.py:996] (2/4) Epoch 3, batch 10750, loss[loss=0.3283, simple_loss=0.4231, pruned_loss=0.1167, over 19854.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.343, pruned_loss=0.106, over 4266846.11 frames. ], batch size: 702, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:16:47,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=430554.0, ans=0.125 2023-06-19 09:16:50,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=430554.0, ans=0.125 2023-06-19 09:17:02,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=430554.0, ans=0.125 2023-06-19 09:17:14,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=430614.0, ans=0.125 2023-06-19 09:17:16,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=430614.0, ans=0.0 2023-06-19 09:17:19,512 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 3.011e+02 3.555e+02 4.505e+02 9.587e+02, threshold=7.110e+02, percent-clipped=1.0 2023-06-19 09:17:26,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=430614.0, ans=0.07 2023-06-19 09:17:38,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=430674.0, ans=0.125 2023-06-19 09:17:40,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=430674.0, ans=0.0 2023-06-19 09:18:01,344 INFO [train.py:996] (2/4) Epoch 3, batch 10800, loss[loss=0.2933, simple_loss=0.3712, pruned_loss=0.1077, over 19901.00 frames. ], tot_loss[loss=0.2814, simple_loss=0.3481, pruned_loss=0.1073, over 4265625.41 frames. ], batch size: 702, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:18:43,863 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.91 vs. limit=10.0 2023-06-19 09:18:48,613 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.26 vs. limit=15.0 2023-06-19 09:19:42,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=430974.0, ans=0.125 2023-06-19 09:19:48,728 INFO [train.py:996] (2/4) Epoch 3, batch 10850, loss[loss=0.2596, simple_loss=0.3152, pruned_loss=0.1019, over 21766.00 frames. ], tot_loss[loss=0.2817, simple_loss=0.3484, pruned_loss=0.1075, over 4268806.96 frames. ], batch size: 102, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:20:33,824 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=22.5 2023-06-19 09:20:56,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=431214.0, ans=0.125 2023-06-19 09:20:59,264 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.993e+02 3.612e+02 4.329e+02 8.050e+02, threshold=7.223e+02, percent-clipped=3.0 2023-06-19 09:21:01,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=431214.0, ans=0.2 2023-06-19 09:21:05,641 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-19 09:21:37,059 INFO [train.py:996] (2/4) Epoch 3, batch 10900, loss[loss=0.3043, simple_loss=0.3858, pruned_loss=0.1114, over 21406.00 frames. ], tot_loss[loss=0.2755, simple_loss=0.3408, pruned_loss=0.105, over 4276695.40 frames. ], batch size: 471, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:22:00,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=431394.0, ans=0.2 2023-06-19 09:22:28,317 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=15.0 2023-06-19 09:22:52,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=431514.0, ans=0.0 2023-06-19 09:23:05,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=431574.0, ans=0.125 2023-06-19 09:23:08,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=431574.0, ans=0.0 2023-06-19 09:23:22,707 INFO [train.py:996] (2/4) Epoch 3, batch 10950, loss[loss=0.2641, simple_loss=0.3112, pruned_loss=0.1085, over 21450.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3336, pruned_loss=0.1029, over 4269698.49 frames. ], batch size: 441, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:23:47,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=431694.0, ans=0.1 2023-06-19 09:24:04,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=431754.0, ans=0.0 2023-06-19 09:24:30,160 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.913e+02 3.689e+02 4.516e+02 9.090e+02, threshold=7.379e+02, percent-clipped=2.0 2023-06-19 09:24:30,894 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 09:24:39,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=431814.0, ans=0.125 2023-06-19 09:25:07,368 INFO [train.py:996] (2/4) Epoch 3, batch 11000, loss[loss=0.334, simple_loss=0.3702, pruned_loss=0.1489, over 21751.00 frames. ], tot_loss[loss=0.271, simple_loss=0.334, pruned_loss=0.104, over 4277983.87 frames. ], batch size: 441, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:25:37,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=431994.0, ans=0.0 2023-06-19 09:25:41,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=431994.0, ans=0.1 2023-06-19 09:25:43,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=431994.0, ans=0.125 2023-06-19 09:26:02,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=432054.0, ans=0.1 2023-06-19 09:26:05,815 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=22.5 2023-06-19 09:26:33,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=432174.0, ans=0.0 2023-06-19 09:26:53,514 INFO [train.py:996] (2/4) Epoch 3, batch 11050, loss[loss=0.2533, simple_loss=0.3046, pruned_loss=0.101, over 21594.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3324, pruned_loss=0.1054, over 4283515.25 frames. ], batch size: 298, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:27:46,794 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-19 09:27:55,636 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.293e+02 3.002e+02 3.534e+02 4.546e+02 1.059e+03, threshold=7.067e+02, percent-clipped=5.0 2023-06-19 09:27:55,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=432414.0, ans=0.0 2023-06-19 09:28:06,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=432414.0, ans=0.125 2023-06-19 09:28:16,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=432474.0, ans=0.125 2023-06-19 09:28:29,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=432474.0, ans=0.125 2023-06-19 09:28:36,742 INFO [train.py:996] (2/4) Epoch 3, batch 11100, loss[loss=0.3271, simple_loss=0.3933, pruned_loss=0.1305, over 21407.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3312, pruned_loss=0.1062, over 4286253.04 frames. ], batch size: 471, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:28:49,723 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.37 vs. limit=12.0 2023-06-19 09:28:52,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=432594.0, ans=0.125 2023-06-19 09:28:55,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=432594.0, ans=0.125 2023-06-19 09:29:24,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=432654.0, ans=0.1 2023-06-19 09:30:23,551 INFO [train.py:996] (2/4) Epoch 3, batch 11150, loss[loss=0.2501, simple_loss=0.3201, pruned_loss=0.09007, over 21512.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3286, pruned_loss=0.1054, over 4271238.96 frames. ], batch size: 230, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:30:25,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=432834.0, ans=0.2 2023-06-19 09:30:39,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=432834.0, ans=0.1 2023-06-19 09:30:48,348 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.36 vs. limit=15.0 2023-06-19 09:31:10,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=432954.0, ans=0.09899494936611666 2023-06-19 09:31:34,245 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.165e+02 2.954e+02 3.953e+02 5.206e+02 1.006e+03, threshold=7.907e+02, percent-clipped=9.0 2023-06-19 09:32:10,327 INFO [train.py:996] (2/4) Epoch 3, batch 11200, loss[loss=0.2685, simple_loss=0.3106, pruned_loss=0.1132, over 22020.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3287, pruned_loss=0.104, over 4267922.71 frames. ], batch size: 103, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:33:54,852 INFO [train.py:996] (2/4) Epoch 3, batch 11250, loss[loss=0.2205, simple_loss=0.2859, pruned_loss=0.07761, over 21189.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3272, pruned_loss=0.1036, over 4265740.69 frames. ], batch size: 176, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:33:57,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=433434.0, ans=0.05 2023-06-19 09:33:58,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=433434.0, ans=0.125 2023-06-19 09:35:00,779 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.899e+02 3.491e+02 4.112e+02 6.627e+02, threshold=6.983e+02, percent-clipped=0.0 2023-06-19 09:35:11,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=433614.0, ans=0.1 2023-06-19 09:35:13,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=433614.0, ans=0.0 2023-06-19 09:35:30,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=433674.0, ans=0.2 2023-06-19 09:35:34,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=433674.0, ans=0.125 2023-06-19 09:35:41,895 INFO [train.py:996] (2/4) Epoch 3, batch 11300, loss[loss=0.2411, simple_loss=0.3026, pruned_loss=0.08977, over 21539.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3281, pruned_loss=0.1038, over 4276553.23 frames. ], batch size: 212, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:35:49,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=433734.0, ans=0.1 2023-06-19 09:35:54,875 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.08 vs. limit=15.0 2023-06-19 09:36:09,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=433794.0, ans=0.125 2023-06-19 09:36:35,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=433854.0, ans=0.125 2023-06-19 09:36:41,201 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=15.0 2023-06-19 09:36:56,988 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.91 vs. limit=10.0 2023-06-19 09:37:19,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=433974.0, ans=0.125 2023-06-19 09:37:28,392 INFO [train.py:996] (2/4) Epoch 3, batch 11350, loss[loss=0.3821, simple_loss=0.4296, pruned_loss=0.1673, over 21420.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.332, pruned_loss=0.1044, over 4278741.33 frames. ], batch size: 471, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:38:12,085 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-19 09:38:28,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=434154.0, ans=0.0 2023-06-19 09:38:45,272 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 3.245e+02 4.001e+02 4.934e+02 9.082e+02, threshold=8.002e+02, percent-clipped=8.0 2023-06-19 09:38:45,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=434214.0, ans=0.125 2023-06-19 09:38:48,061 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=22.5 2023-06-19 09:39:10,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=434274.0, ans=0.0 2023-06-19 09:39:13,773 INFO [train.py:996] (2/4) Epoch 3, batch 11400, loss[loss=0.2952, simple_loss=0.3521, pruned_loss=0.1191, over 21715.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.3387, pruned_loss=0.1079, over 4278474.43 frames. ], batch size: 124, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:39:49,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=434394.0, ans=0.125 2023-06-19 09:40:11,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=434454.0, ans=0.0 2023-06-19 09:40:36,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=434514.0, ans=0.2 2023-06-19 09:41:06,649 INFO [train.py:996] (2/4) Epoch 3, batch 11450, loss[loss=0.3598, simple_loss=0.4068, pruned_loss=0.1564, over 21409.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.3408, pruned_loss=0.1072, over 4274022.69 frames. ], batch size: 508, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:41:10,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=434634.0, ans=0.015 2023-06-19 09:41:43,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=434694.0, ans=0.1 2023-06-19 09:41:45,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=434694.0, ans=0.05 2023-06-19 09:42:19,368 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.967e+02 3.142e+02 3.588e+02 4.835e+02 7.937e+02, threshold=7.176e+02, percent-clipped=0.0 2023-06-19 09:42:37,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=434874.0, ans=0.07 2023-06-19 09:42:53,675 INFO [train.py:996] (2/4) Epoch 3, batch 11500, loss[loss=0.2928, simple_loss=0.3722, pruned_loss=0.1067, over 21647.00 frames. ], tot_loss[loss=0.2804, simple_loss=0.3434, pruned_loss=0.1088, over 4267416.95 frames. ], batch size: 389, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:43:32,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=434994.0, ans=0.125 2023-06-19 09:44:04,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=435114.0, ans=0.0 2023-06-19 09:44:13,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=435114.0, ans=0.1 2023-06-19 09:44:25,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=435174.0, ans=0.125 2023-06-19 09:44:28,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=435174.0, ans=0.125 2023-06-19 09:44:32,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=435174.0, ans=0.1 2023-06-19 09:44:46,156 INFO [train.py:996] (2/4) Epoch 3, batch 11550, loss[loss=0.2859, simple_loss=0.3593, pruned_loss=0.1062, over 21378.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.3481, pruned_loss=0.1077, over 4269230.73 frames. ], batch size: 194, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:45:28,711 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.24 vs. limit=15.0 2023-06-19 09:45:53,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=435414.0, ans=0.0 2023-06-19 09:45:54,467 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 2.996e+02 3.693e+02 5.072e+02 8.592e+02, threshold=7.387e+02, percent-clipped=2.0 2023-06-19 09:46:07,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=435414.0, ans=0.125 2023-06-19 09:46:38,546 INFO [train.py:996] (2/4) Epoch 3, batch 11600, loss[loss=0.2853, simple_loss=0.3697, pruned_loss=0.1005, over 21616.00 frames. ], tot_loss[loss=0.2888, simple_loss=0.3596, pruned_loss=0.109, over 4271020.91 frames. ], batch size: 230, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:46:41,240 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-19 09:47:08,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=435594.0, ans=0.025 2023-06-19 09:47:15,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=435594.0, ans=0.025 2023-06-19 09:47:26,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=435654.0, ans=0.0 2023-06-19 09:48:20,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=435774.0, ans=0.1 2023-06-19 09:48:24,152 INFO [train.py:996] (2/4) Epoch 3, batch 11650, loss[loss=0.3913, simple_loss=0.4626, pruned_loss=0.16, over 21519.00 frames. ], tot_loss[loss=0.2928, simple_loss=0.3663, pruned_loss=0.1097, over 4261924.77 frames. ], batch size: 471, lr: 1.12e-02, grad_scale: 32.0 2023-06-19 09:48:43,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=435834.0, ans=0.0 2023-06-19 09:49:32,529 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 3.087e+02 3.546e+02 4.429e+02 8.703e+02, threshold=7.092e+02, percent-clipped=3.0 2023-06-19 09:49:44,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=436014.0, ans=0.1 2023-06-19 09:50:10,378 INFO [train.py:996] (2/4) Epoch 3, batch 11700, loss[loss=0.2685, simple_loss=0.3188, pruned_loss=0.1091, over 21313.00 frames. ], tot_loss[loss=0.2885, simple_loss=0.3577, pruned_loss=0.1097, over 4264056.03 frames. ], batch size: 160, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:50:53,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=436254.0, ans=0.2 2023-06-19 09:51:33,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=436374.0, ans=0.125 2023-06-19 09:51:45,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=436374.0, ans=0.0 2023-06-19 09:51:52,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=436374.0, ans=0.02 2023-06-19 09:51:56,360 INFO [train.py:996] (2/4) Epoch 3, batch 11750, loss[loss=0.241, simple_loss=0.2964, pruned_loss=0.09285, over 21855.00 frames. ], tot_loss[loss=0.283, simple_loss=0.348, pruned_loss=0.109, over 4261226.26 frames. ], batch size: 250, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:52:33,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=436554.0, ans=0.1 2023-06-19 09:53:03,209 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.275e+02 3.279e+02 3.653e+02 4.613e+02 8.659e+02, threshold=7.305e+02, percent-clipped=4.0 2023-06-19 09:53:41,264 INFO [train.py:996] (2/4) Epoch 3, batch 11800, loss[loss=0.2631, simple_loss=0.3255, pruned_loss=0.1004, over 21207.00 frames. ], tot_loss[loss=0.288, simple_loss=0.3512, pruned_loss=0.1124, over 4264777.40 frames. ], batch size: 143, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:53:53,115 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=22.5 2023-06-19 09:54:00,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=436734.0, ans=0.0 2023-06-19 09:54:36,489 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.10 vs. limit=10.0 2023-06-19 09:55:16,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=436974.0, ans=0.0 2023-06-19 09:55:34,581 INFO [train.py:996] (2/4) Epoch 3, batch 11850, loss[loss=0.2861, simple_loss=0.3575, pruned_loss=0.1074, over 21676.00 frames. ], tot_loss[loss=0.2882, simple_loss=0.3544, pruned_loss=0.111, over 4263367.20 frames. ], batch size: 230, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:56:30,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=437214.0, ans=0.0 2023-06-19 09:56:44,319 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.116e+02 2.974e+02 3.438e+02 3.994e+02 6.906e+02, threshold=6.876e+02, percent-clipped=0.0 2023-06-19 09:57:22,871 INFO [train.py:996] (2/4) Epoch 3, batch 11900, loss[loss=0.2752, simple_loss=0.3504, pruned_loss=0.09997, over 21822.00 frames. ], tot_loss[loss=0.2858, simple_loss=0.3548, pruned_loss=0.1084, over 4260618.20 frames. ], batch size: 371, lr: 1.12e-02, grad_scale: 16.0 2023-06-19 09:57:31,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=437334.0, ans=0.1 2023-06-19 09:58:57,125 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.78 vs. limit=10.0 2023-06-19 09:59:03,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=437574.0, ans=0.125 2023-06-19 09:59:11,135 INFO [train.py:996] (2/4) Epoch 3, batch 11950, loss[loss=0.1929, simple_loss=0.2765, pruned_loss=0.05463, over 21596.00 frames. ], tot_loss[loss=0.281, simple_loss=0.3537, pruned_loss=0.1042, over 4248871.66 frames. ], batch size: 230, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 09:59:31,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=437694.0, ans=0.2 2023-06-19 09:59:37,063 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2023-06-19 10:00:12,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=437754.0, ans=0.125 2023-06-19 10:00:29,707 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.939e+02 2.768e+02 3.443e+02 4.738e+02 7.856e+02, threshold=6.886e+02, percent-clipped=6.0 2023-06-19 10:00:43,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=437874.0, ans=0.125 2023-06-19 10:00:45,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=437874.0, ans=0.125 2023-06-19 10:00:56,225 INFO [train.py:996] (2/4) Epoch 3, batch 12000, loss[loss=0.2673, simple_loss=0.3182, pruned_loss=0.1082, over 21616.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3506, pruned_loss=0.1027, over 4248932.77 frames. ], batch size: 332, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:00:56,225 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 10:01:15,364 INFO [train.py:1028] (2/4) Epoch 3, validation: loss=0.279, simple_loss=0.3755, pruned_loss=0.09124, over 1796401.00 frames. 2023-06-19 10:01:15,365 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-19 10:01:41,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=437994.0, ans=0.125 2023-06-19 10:01:41,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=437994.0, ans=0.125 2023-06-19 10:02:33,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=438114.0, ans=0.125 2023-06-19 10:02:37,538 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-06-19 10:03:02,158 INFO [train.py:996] (2/4) Epoch 3, batch 12050, loss[loss=0.2372, simple_loss=0.2927, pruned_loss=0.09081, over 21571.00 frames. ], tot_loss[loss=0.2782, simple_loss=0.3462, pruned_loss=0.1051, over 4258765.84 frames. ], batch size: 212, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:03:03,389 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.30 vs. limit=10.0 2023-06-19 10:03:04,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=438234.0, ans=0.125 2023-06-19 10:03:14,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=438234.0, ans=0.125 2023-06-19 10:03:15,588 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.89 vs. limit=22.5 2023-06-19 10:03:16,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=438234.0, ans=0.2 2023-06-19 10:03:44,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=438294.0, ans=0.0 2023-06-19 10:03:47,022 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.35 vs. limit=15.0 2023-06-19 10:04:11,561 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:04:18,347 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 3.247e+02 3.603e+02 4.114e+02 7.694e+02, threshold=7.207e+02, percent-clipped=1.0 2023-06-19 10:04:27,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=438474.0, ans=0.125 2023-06-19 10:04:27,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=438474.0, ans=0.0 2023-06-19 10:04:38,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=438474.0, ans=0.2 2023-06-19 10:04:42,134 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.59 vs. limit=15.0 2023-06-19 10:04:49,667 INFO [train.py:996] (2/4) Epoch 3, batch 12100, loss[loss=0.2221, simple_loss=0.3414, pruned_loss=0.05145, over 19861.00 frames. ], tot_loss[loss=0.2871, simple_loss=0.3521, pruned_loss=0.1111, over 4267015.66 frames. ], batch size: 703, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:04:57,701 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.96 vs. limit=15.0 2023-06-19 10:05:00,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=438534.0, ans=0.125 2023-06-19 10:05:05,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=438594.0, ans=0.0 2023-06-19 10:05:07,957 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=15.0 2023-06-19 10:05:18,032 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.71 vs. limit=15.0 2023-06-19 10:05:42,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=438654.0, ans=0.2 2023-06-19 10:06:05,615 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=15.0 2023-06-19 10:06:12,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=438714.0, ans=0.2 2023-06-19 10:06:28,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=438774.0, ans=0.125 2023-06-19 10:06:38,106 INFO [train.py:996] (2/4) Epoch 3, batch 12150, loss[loss=0.3105, simple_loss=0.4025, pruned_loss=0.1092, over 21668.00 frames. ], tot_loss[loss=0.2872, simple_loss=0.354, pruned_loss=0.1102, over 4264177.91 frames. ], batch size: 414, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:07:00,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=438834.0, ans=0.1 2023-06-19 10:07:01,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=438834.0, ans=0.1 2023-06-19 10:07:35,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=438954.0, ans=0.125 2023-06-19 10:07:52,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=439014.0, ans=0.0 2023-06-19 10:07:55,463 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 3.586e+02 4.536e+02 5.596e+02 8.610e+02, threshold=9.073e+02, percent-clipped=5.0 2023-06-19 10:08:22,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=439074.0, ans=0.125 2023-06-19 10:08:33,686 INFO [train.py:996] (2/4) Epoch 3, batch 12200, loss[loss=0.2704, simple_loss=0.3137, pruned_loss=0.1135, over 21496.00 frames. ], tot_loss[loss=0.2833, simple_loss=0.3476, pruned_loss=0.1095, over 4255006.52 frames. ], batch size: 441, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:08:56,142 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-06-19 10:09:32,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=439314.0, ans=0.125 2023-06-19 10:09:35,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=439314.0, ans=0.1 2023-06-19 10:09:52,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=439374.0, ans=0.07 2023-06-19 10:10:12,844 INFO [train.py:996] (2/4) Epoch 3, batch 12250, loss[loss=0.2074, simple_loss=0.2922, pruned_loss=0.06132, over 21740.00 frames. ], tot_loss[loss=0.273, simple_loss=0.3373, pruned_loss=0.1044, over 4255603.61 frames. ], batch size: 351, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:10:21,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=439434.0, ans=0.125 2023-06-19 10:11:22,042 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.723e+02 2.671e+02 3.353e+02 4.398e+02 1.093e+03, threshold=6.707e+02, percent-clipped=1.0 2023-06-19 10:11:54,834 INFO [train.py:996] (2/4) Epoch 3, batch 12300, loss[loss=0.2118, simple_loss=0.2873, pruned_loss=0.06815, over 21344.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3259, pruned_loss=0.09602, over 4261164.07 frames. ], batch size: 176, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:12:20,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=439794.0, ans=0.125 2023-06-19 10:13:00,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=439914.0, ans=10.0 2023-06-19 10:13:10,794 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-19 10:13:17,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=439974.0, ans=0.125 2023-06-19 10:13:39,235 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.93 vs. limit=10.0 2023-06-19 10:13:39,844 INFO [train.py:996] (2/4) Epoch 3, batch 12350, loss[loss=0.2846, simple_loss=0.3469, pruned_loss=0.1111, over 21294.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3304, pruned_loss=0.09547, over 4261805.96 frames. ], batch size: 176, lr: 1.11e-02, grad_scale: 16.0 2023-06-19 10:14:53,526 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.601e+02 2.837e+02 3.590e+02 4.995e+02 8.694e+02, threshold=7.180e+02, percent-clipped=5.0 2023-06-19 10:14:59,919 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.65 vs. limit=6.0 2023-06-19 10:15:30,269 INFO [train.py:996] (2/4) Epoch 3, batch 12400, loss[loss=0.3389, simple_loss=0.3728, pruned_loss=0.1525, over 21755.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3339, pruned_loss=0.1011, over 4266121.25 frames. ], batch size: 508, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:15:56,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=440394.0, ans=0.125 2023-06-19 10:16:15,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=440454.0, ans=0.0 2023-06-19 10:17:10,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=440574.0, ans=0.125 2023-06-19 10:17:23,795 INFO [train.py:996] (2/4) Epoch 3, batch 12450, loss[loss=0.2822, simple_loss=0.3425, pruned_loss=0.111, over 20854.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3387, pruned_loss=0.105, over 4274605.69 frames. ], batch size: 608, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:17:28,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=440634.0, ans=0.0 2023-06-19 10:17:30,629 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.94 vs. limit=15.0 2023-06-19 10:17:59,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=440694.0, ans=0.0 2023-06-19 10:18:02,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=440754.0, ans=0.0 2023-06-19 10:18:24,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=440814.0, ans=0.125 2023-06-19 10:18:35,160 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 3.264e+02 3.825e+02 4.731e+02 7.932e+02, threshold=7.651e+02, percent-clipped=1.0 2023-06-19 10:19:00,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=440874.0, ans=0.125 2023-06-19 10:19:07,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=440874.0, ans=0.125 2023-06-19 10:19:12,385 INFO [train.py:996] (2/4) Epoch 3, batch 12500, loss[loss=0.3366, simple_loss=0.4314, pruned_loss=0.1209, over 21641.00 frames. ], tot_loss[loss=0.2843, simple_loss=0.3503, pruned_loss=0.1092, over 4273143.43 frames. ], batch size: 389, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:20:10,615 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.96 vs. limit=6.0 2023-06-19 10:20:47,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=441174.0, ans=0.125 2023-06-19 10:21:05,746 INFO [train.py:996] (2/4) Epoch 3, batch 12550, loss[loss=0.2398, simple_loss=0.2936, pruned_loss=0.09297, over 19942.00 frames. ], tot_loss[loss=0.2912, simple_loss=0.3566, pruned_loss=0.1129, over 4278542.35 frames. ], batch size: 703, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:21:06,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=441234.0, ans=0.0 2023-06-19 10:22:22,019 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 3.217e+02 3.804e+02 4.730e+02 9.875e+02, threshold=7.608e+02, percent-clipped=0.0 2023-06-19 10:22:27,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=441414.0, ans=0.125 2023-06-19 10:22:52,428 INFO [train.py:996] (2/4) Epoch 3, batch 12600, loss[loss=0.2209, simple_loss=0.2971, pruned_loss=0.07234, over 21182.00 frames. ], tot_loss[loss=0.2878, simple_loss=0.3552, pruned_loss=0.1102, over 4277153.94 frames. ], batch size: 159, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:22:56,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=441534.0, ans=0.1 2023-06-19 10:22:59,698 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:23:33,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=441594.0, ans=0.0 2023-06-19 10:23:40,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=441654.0, ans=0.04949747468305833 2023-06-19 10:23:40,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=441654.0, ans=0.2 2023-06-19 10:24:14,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=441774.0, ans=0.125 2023-06-19 10:24:26,678 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.11 vs. limit=6.0 2023-06-19 10:24:35,255 INFO [train.py:996] (2/4) Epoch 3, batch 12650, loss[loss=0.2178, simple_loss=0.2843, pruned_loss=0.07568, over 21430.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.3466, pruned_loss=0.1051, over 4279006.44 frames. ], batch size: 211, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:24:45,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=441834.0, ans=0.125 2023-06-19 10:25:36,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=441954.0, ans=0.125 2023-06-19 10:25:47,078 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 2.848e+02 3.430e+02 4.434e+02 6.952e+02, threshold=6.860e+02, percent-clipped=1.0 2023-06-19 10:26:05,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=442074.0, ans=0.125 2023-06-19 10:26:09,676 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.00 vs. limit=12.0 2023-06-19 10:26:17,194 INFO [train.py:996] (2/4) Epoch 3, batch 12700, loss[loss=0.2876, simple_loss=0.3504, pruned_loss=0.1124, over 21182.00 frames. ], tot_loss[loss=0.2812, simple_loss=0.3468, pruned_loss=0.1078, over 4275704.15 frames. ], batch size: 143, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:26:52,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=442194.0, ans=0.125 2023-06-19 10:27:56,071 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2023-06-19 10:28:03,215 INFO [train.py:996] (2/4) Epoch 3, batch 12750, loss[loss=0.2747, simple_loss=0.3444, pruned_loss=0.1025, over 21894.00 frames. ], tot_loss[loss=0.2841, simple_loss=0.3499, pruned_loss=0.1092, over 4274097.33 frames. ], batch size: 316, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:28:17,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=442434.0, ans=0.0 2023-06-19 10:28:22,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=442434.0, ans=0.125 2023-06-19 10:29:06,352 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-06-19 10:29:16,189 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.440e+02 3.021e+02 3.668e+02 4.359e+02 6.708e+02, threshold=7.336e+02, percent-clipped=0.0 2023-06-19 10:29:46,082 INFO [train.py:996] (2/4) Epoch 3, batch 12800, loss[loss=0.2841, simple_loss=0.3478, pruned_loss=0.1102, over 20754.00 frames. ], tot_loss[loss=0.2838, simple_loss=0.3486, pruned_loss=0.1095, over 4283610.88 frames. ], batch size: 607, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:31:18,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=442974.0, ans=0.1 2023-06-19 10:31:40,113 INFO [train.py:996] (2/4) Epoch 3, batch 12850, loss[loss=0.3528, simple_loss=0.4364, pruned_loss=0.1346, over 20747.00 frames. ], tot_loss[loss=0.2886, simple_loss=0.3524, pruned_loss=0.1124, over 4278882.52 frames. ], batch size: 607, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:32:48,818 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.315e+02 3.077e+02 3.595e+02 4.462e+02 6.932e+02, threshold=7.189e+02, percent-clipped=0.0 2023-06-19 10:33:11,380 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.25 vs. limit=15.0 2023-06-19 10:33:23,130 INFO [train.py:996] (2/4) Epoch 3, batch 12900, loss[loss=0.2206, simple_loss=0.3027, pruned_loss=0.06927, over 21470.00 frames. ], tot_loss[loss=0.2812, simple_loss=0.3484, pruned_loss=0.107, over 4276573.90 frames. ], batch size: 212, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:33:41,688 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=22.5 2023-06-19 10:33:53,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=443394.0, ans=0.2 2023-06-19 10:34:02,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=443454.0, ans=0.125 2023-06-19 10:35:09,443 INFO [train.py:996] (2/4) Epoch 3, batch 12950, loss[loss=0.2132, simple_loss=0.2853, pruned_loss=0.07053, over 21269.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.346, pruned_loss=0.1045, over 4272985.00 frames. ], batch size: 176, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:35:19,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=443634.0, ans=0.125 2023-06-19 10:35:38,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=443694.0, ans=0.125 2023-06-19 10:36:01,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=443754.0, ans=0.125 2023-06-19 10:36:02,315 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=15.0 2023-06-19 10:36:18,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=443814.0, ans=0.2 2023-06-19 10:36:23,009 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 2.936e+02 3.406e+02 4.166e+02 8.200e+02, threshold=6.811e+02, percent-clipped=3.0 2023-06-19 10:36:52,673 INFO [train.py:996] (2/4) Epoch 3, batch 13000, loss[loss=0.1977, simple_loss=0.2699, pruned_loss=0.06271, over 21283.00 frames. ], tot_loss[loss=0.2803, simple_loss=0.3496, pruned_loss=0.1055, over 4269490.76 frames. ], batch size: 176, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:37:26,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=443994.0, ans=0.125 2023-06-19 10:38:11,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=444114.0, ans=0.2 2023-06-19 10:38:35,632 INFO [train.py:996] (2/4) Epoch 3, batch 13050, loss[loss=0.2946, simple_loss=0.3531, pruned_loss=0.118, over 21516.00 frames. ], tot_loss[loss=0.2734, simple_loss=0.3429, pruned_loss=0.1019, over 4265201.88 frames. ], batch size: 548, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:39:35,117 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-19 10:39:45,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=444414.0, ans=0.2 2023-06-19 10:39:49,654 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.662e+02 3.344e+02 3.864e+02 6.973e+02, threshold=6.687e+02, percent-clipped=1.0 2023-06-19 10:40:04,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=444474.0, ans=0.0 2023-06-19 10:40:08,437 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.83 vs. limit=22.5 2023-06-19 10:40:20,638 INFO [train.py:996] (2/4) Epoch 3, batch 13100, loss[loss=0.3056, simple_loss=0.3751, pruned_loss=0.118, over 21592.00 frames. ], tot_loss[loss=0.2753, simple_loss=0.3446, pruned_loss=0.103, over 4270075.55 frames. ], batch size: 507, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:40:22,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=444534.0, ans=0.125 2023-06-19 10:40:36,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=444534.0, ans=0.04949747468305833 2023-06-19 10:40:39,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=444534.0, ans=0.1 2023-06-19 10:41:45,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=444714.0, ans=0.04949747468305833 2023-06-19 10:42:09,733 INFO [train.py:996] (2/4) Epoch 3, batch 13150, loss[loss=0.2267, simple_loss=0.2942, pruned_loss=0.07963, over 21725.00 frames. ], tot_loss[loss=0.28, simple_loss=0.3469, pruned_loss=0.1066, over 4277321.94 frames. ], batch size: 282, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:43:24,856 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.986e+02 3.199e+02 3.918e+02 5.152e+02 8.520e+02, threshold=7.837e+02, percent-clipped=11.0 2023-06-19 10:43:44,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=445074.0, ans=0.125 2023-06-19 10:43:55,606 INFO [train.py:996] (2/4) Epoch 3, batch 13200, loss[loss=0.2865, simple_loss=0.3475, pruned_loss=0.1127, over 21229.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3453, pruned_loss=0.1069, over 4273247.10 frames. ], batch size: 143, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:45:28,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=445374.0, ans=0.125 2023-06-19 10:45:34,955 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=22.5 2023-06-19 10:45:38,776 INFO [train.py:996] (2/4) Epoch 3, batch 13250, loss[loss=0.2687, simple_loss=0.3452, pruned_loss=0.09615, over 21566.00 frames. ], tot_loss[loss=0.2822, simple_loss=0.3466, pruned_loss=0.1089, over 4278750.46 frames. ], batch size: 263, lr: 1.11e-02, grad_scale: 32.0 2023-06-19 10:45:59,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=445494.0, ans=0.125 2023-06-19 10:46:12,797 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-19 10:46:18,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=445494.0, ans=0.2 2023-06-19 10:46:40,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=445554.0, ans=0.2 2023-06-19 10:46:59,981 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.966e+02 3.440e+02 4.161e+02 7.151e+02, threshold=6.880e+02, percent-clipped=0.0 2023-06-19 10:47:12,865 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=15.0 2023-06-19 10:47:29,774 INFO [train.py:996] (2/4) Epoch 3, batch 13300, loss[loss=0.404, simple_loss=0.4465, pruned_loss=0.1807, over 21356.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.3508, pruned_loss=0.1082, over 4273095.84 frames. ], batch size: 507, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 10:47:49,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=445734.0, ans=0.2 2023-06-19 10:47:50,552 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.12 vs. limit=5.0 2023-06-19 10:48:06,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=445794.0, ans=0.1 2023-06-19 10:48:07,498 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.82 vs. limit=10.0 2023-06-19 10:48:38,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=445914.0, ans=0.125 2023-06-19 10:49:13,533 INFO [train.py:996] (2/4) Epoch 3, batch 13350, loss[loss=0.3104, simple_loss=0.3729, pruned_loss=0.124, over 21405.00 frames. ], tot_loss[loss=0.2883, simple_loss=0.355, pruned_loss=0.1108, over 4270904.99 frames. ], batch size: 194, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 10:49:17,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=446034.0, ans=0.125 2023-06-19 10:50:08,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=446154.0, ans=0.125 2023-06-19 10:50:20,924 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.205e+02 3.225e+02 3.868e+02 4.526e+02 7.710e+02, threshold=7.735e+02, percent-clipped=4.0 2023-06-19 10:50:55,987 INFO [train.py:996] (2/4) Epoch 3, batch 13400, loss[loss=0.314, simple_loss=0.3647, pruned_loss=0.1317, over 21797.00 frames. ], tot_loss[loss=0.2925, simple_loss=0.3575, pruned_loss=0.1138, over 4280454.17 frames. ], batch size: 112, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 10:51:30,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=446394.0, ans=0.2 2023-06-19 10:52:43,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=446574.0, ans=0.125 2023-06-19 10:52:46,486 INFO [train.py:996] (2/4) Epoch 3, batch 13450, loss[loss=0.3038, simple_loss=0.3714, pruned_loss=0.118, over 21493.00 frames. ], tot_loss[loss=0.2965, simple_loss=0.3597, pruned_loss=0.1167, over 4276521.04 frames. ], batch size: 131, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 10:53:28,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=446754.0, ans=22.5 2023-06-19 10:53:56,294 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.398e+02 3.091e+02 3.617e+02 4.599e+02 7.916e+02, threshold=7.234e+02, percent-clipped=2.0 2023-06-19 10:54:31,181 INFO [train.py:996] (2/4) Epoch 3, batch 13500, loss[loss=0.3224, simple_loss=0.3819, pruned_loss=0.1314, over 21347.00 frames. ], tot_loss[loss=0.2843, simple_loss=0.3458, pruned_loss=0.1114, over 4256460.40 frames. ], batch size: 549, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 10:55:08,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=447054.0, ans=0.0 2023-06-19 10:55:08,998 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=22.5 2023-06-19 10:55:14,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=447054.0, ans=0.125 2023-06-19 10:55:49,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=447114.0, ans=0.1 2023-06-19 10:56:14,937 INFO [train.py:996] (2/4) Epoch 3, batch 13550, loss[loss=0.3331, simple_loss=0.4171, pruned_loss=0.1246, over 21720.00 frames. ], tot_loss[loss=0.2875, simple_loss=0.3515, pruned_loss=0.1118, over 4260190.26 frames. ], batch size: 414, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 10:56:21,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=447234.0, ans=0.125 2023-06-19 10:56:26,892 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:57:36,378 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.165e+02 2.954e+02 3.584e+02 4.667e+02 1.065e+03, threshold=7.167e+02, percent-clipped=1.0 2023-06-19 10:57:38,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=447414.0, ans=0.125 2023-06-19 10:57:57,314 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.24 vs. limit=15.0 2023-06-19 10:58:03,823 INFO [train.py:996] (2/4) Epoch 3, batch 13600, loss[loss=0.3002, simple_loss=0.3516, pruned_loss=0.1243, over 21594.00 frames. ], tot_loss[loss=0.2887, simple_loss=0.3521, pruned_loss=0.1126, over 4268383.05 frames. ], batch size: 548, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 10:58:05,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=447534.0, ans=0.1 2023-06-19 10:58:32,180 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 10:59:45,700 INFO [train.py:996] (2/4) Epoch 3, batch 13650, loss[loss=0.2603, simple_loss=0.3168, pruned_loss=0.1019, over 21646.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.3469, pruned_loss=0.1084, over 4270793.88 frames. ], batch size: 332, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 10:59:55,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=447834.0, ans=0.125 2023-06-19 11:01:01,633 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 3.087e+02 4.228e+02 5.339e+02 1.090e+03, threshold=8.456e+02, percent-clipped=10.0 2023-06-19 11:01:23,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=448074.0, ans=0.125 2023-06-19 11:01:28,424 INFO [train.py:996] (2/4) Epoch 3, batch 13700, loss[loss=0.2761, simple_loss=0.3405, pruned_loss=0.1059, over 20735.00 frames. ], tot_loss[loss=0.2785, simple_loss=0.3413, pruned_loss=0.1079, over 4263991.19 frames. ], batch size: 607, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:02:38,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=448314.0, ans=0.07 2023-06-19 11:03:11,627 INFO [train.py:996] (2/4) Epoch 3, batch 13750, loss[loss=0.2634, simple_loss=0.3358, pruned_loss=0.09547, over 21621.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3371, pruned_loss=0.1052, over 4260021.59 frames. ], batch size: 389, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:04:34,595 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.128e+02 3.276e+02 4.043e+02 5.146e+02 9.090e+02, threshold=8.085e+02, percent-clipped=5.0 2023-06-19 11:04:35,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=448614.0, ans=0.0 2023-06-19 11:04:56,207 INFO [train.py:996] (2/4) Epoch 3, batch 13800, loss[loss=0.3206, simple_loss=0.4093, pruned_loss=0.116, over 21846.00 frames. ], tot_loss[loss=0.275, simple_loss=0.3422, pruned_loss=0.1039, over 4253242.37 frames. ], batch size: 371, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:04:57,511 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.51 vs. limit=15.0 2023-06-19 11:05:44,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=448854.0, ans=0.1 2023-06-19 11:06:25,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=448974.0, ans=0.125 2023-06-19 11:06:38,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=448974.0, ans=0.0 2023-06-19 11:06:50,100 INFO [train.py:996] (2/4) Epoch 3, batch 13850, loss[loss=0.3585, simple_loss=0.4238, pruned_loss=0.1466, over 21726.00 frames. ], tot_loss[loss=0.28, simple_loss=0.3492, pruned_loss=0.1054, over 4261483.27 frames. ], batch size: 441, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:06:57,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=449034.0, ans=0.125 2023-06-19 11:07:18,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=449094.0, ans=0.125 2023-06-19 11:07:47,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=449154.0, ans=0.0 2023-06-19 11:08:01,641 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 2.934e+02 3.600e+02 4.326e+02 8.652e+02, threshold=7.199e+02, percent-clipped=1.0 2023-06-19 11:08:32,524 INFO [train.py:996] (2/4) Epoch 3, batch 13900, loss[loss=0.3301, simple_loss=0.4093, pruned_loss=0.1255, over 20793.00 frames. ], tot_loss[loss=0.288, simple_loss=0.3547, pruned_loss=0.1107, over 4266261.76 frames. ], batch size: 608, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:08:48,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=449334.0, ans=0.1 2023-06-19 11:08:55,117 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.83 vs. limit=10.0 2023-06-19 11:09:10,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=449394.0, ans=0.125 2023-06-19 11:09:22,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=449454.0, ans=0.0 2023-06-19 11:09:25,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=449454.0, ans=0.1 2023-06-19 11:10:12,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=449574.0, ans=0.125 2023-06-19 11:10:15,226 INFO [train.py:996] (2/4) Epoch 3, batch 13950, loss[loss=0.2997, simple_loss=0.3928, pruned_loss=0.1033, over 20818.00 frames. ], tot_loss[loss=0.2922, simple_loss=0.3564, pruned_loss=0.114, over 4277921.49 frames. ], batch size: 608, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:10:48,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=449694.0, ans=0.2 2023-06-19 11:11:30,134 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.793e+02 3.017e+02 3.528e+02 4.292e+02 7.550e+02, threshold=7.057e+02, percent-clipped=1.0 2023-06-19 11:11:32,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=449814.0, ans=0.125 2023-06-19 11:11:55,738 INFO [train.py:996] (2/4) Epoch 3, batch 14000, loss[loss=0.252, simple_loss=0.3223, pruned_loss=0.09082, over 21815.00 frames. ], tot_loss[loss=0.2886, simple_loss=0.3545, pruned_loss=0.1114, over 4276753.21 frames. ], batch size: 351, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:12:35,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=449994.0, ans=0.0 2023-06-19 11:12:35,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=449994.0, ans=0.125 2023-06-19 11:13:36,949 INFO [train.py:996] (2/4) Epoch 3, batch 14050, loss[loss=0.2359, simple_loss=0.2945, pruned_loss=0.08864, over 21614.00 frames. ], tot_loss[loss=0.2779, simple_loss=0.346, pruned_loss=0.1049, over 4267253.77 frames. ], batch size: 263, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:14:06,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=450294.0, ans=0.125 2023-06-19 11:14:54,755 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.755e+02 3.797e+02 5.310e+02 9.461e+02, threshold=7.595e+02, percent-clipped=8.0 2023-06-19 11:15:13,942 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=22.5 2023-06-19 11:15:24,586 INFO [train.py:996] (2/4) Epoch 3, batch 14100, loss[loss=0.2418, simple_loss=0.294, pruned_loss=0.09479, over 21664.00 frames. ], tot_loss[loss=0.2753, simple_loss=0.341, pruned_loss=0.1048, over 4268105.49 frames. ], batch size: 282, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:15:36,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=450534.0, ans=0.0 2023-06-19 11:15:48,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=450594.0, ans=0.2 2023-06-19 11:16:14,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=450654.0, ans=0.1 2023-06-19 11:16:21,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=450714.0, ans=0.125 2023-06-19 11:16:37,395 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=15.0 2023-06-19 11:16:58,772 INFO [train.py:996] (2/4) Epoch 3, batch 14150, loss[loss=0.2506, simple_loss=0.3317, pruned_loss=0.08474, over 21908.00 frames. ], tot_loss[loss=0.2781, simple_loss=0.3436, pruned_loss=0.1062, over 4254124.04 frames. ], batch size: 107, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:17:31,984 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.08 vs. limit=6.0 2023-06-19 11:17:44,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=450954.0, ans=22.5 2023-06-19 11:18:14,084 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.701e+02 3.295e+02 4.437e+02 7.217e+02, threshold=6.589e+02, percent-clipped=0.0 2023-06-19 11:18:38,246 INFO [train.py:996] (2/4) Epoch 3, batch 14200, loss[loss=0.2762, simple_loss=0.3342, pruned_loss=0.1091, over 21462.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3402, pruned_loss=0.1038, over 4258342.36 frames. ], batch size: 131, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:18:39,302 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-19 11:18:59,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=451194.0, ans=0.0 2023-06-19 11:19:06,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=451194.0, ans=0.125 2023-06-19 11:19:23,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=451254.0, ans=0.125 2023-06-19 11:19:42,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=451314.0, ans=0.0 2023-06-19 11:19:59,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=451374.0, ans=0.0 2023-06-19 11:20:20,553 INFO [train.py:996] (2/4) Epoch 3, batch 14250, loss[loss=0.2711, simple_loss=0.3489, pruned_loss=0.09666, over 21496.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.3364, pruned_loss=0.1047, over 4266678.92 frames. ], batch size: 211, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:21:38,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=451614.0, ans=0.04949747468305833 2023-06-19 11:21:39,200 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.964e+02 2.808e+02 3.716e+02 4.608e+02 1.130e+03, threshold=7.432e+02, percent-clipped=7.0 2023-06-19 11:22:04,571 INFO [train.py:996] (2/4) Epoch 3, batch 14300, loss[loss=0.2667, simple_loss=0.3135, pruned_loss=0.1099, over 21549.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.3373, pruned_loss=0.1039, over 4260183.92 frames. ], batch size: 247, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:22:26,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=451794.0, ans=0.125 2023-06-19 11:22:33,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=451794.0, ans=0.125 2023-06-19 11:23:27,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=451974.0, ans=0.0 2023-06-19 11:23:46,823 INFO [train.py:996] (2/4) Epoch 3, batch 14350, loss[loss=0.2605, simple_loss=0.3361, pruned_loss=0.09245, over 19976.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3417, pruned_loss=0.105, over 4240960.18 frames. ], batch size: 703, lr: 1.10e-02, grad_scale: 16.0 2023-06-19 11:23:52,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=452034.0, ans=0.125 2023-06-19 11:25:04,031 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.789e+02 3.074e+02 3.738e+02 4.858e+02 1.364e+03, threshold=7.476e+02, percent-clipped=9.0 2023-06-19 11:25:28,329 INFO [train.py:996] (2/4) Epoch 3, batch 14400, loss[loss=0.2699, simple_loss=0.3215, pruned_loss=0.1092, over 21612.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.3415, pruned_loss=0.1065, over 4252504.00 frames. ], batch size: 414, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:25:54,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=452394.0, ans=10.0 2023-06-19 11:26:29,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=452514.0, ans=0.125 2023-06-19 11:26:38,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=452514.0, ans=0.2 2023-06-19 11:27:06,992 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.68 vs. limit=15.0 2023-06-19 11:27:09,072 INFO [train.py:996] (2/4) Epoch 3, batch 14450, loss[loss=0.2703, simple_loss=0.3246, pruned_loss=0.108, over 21764.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.3354, pruned_loss=0.1064, over 4254638.02 frames. ], batch size: 351, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:27:59,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=452754.0, ans=0.0 2023-06-19 11:28:26,402 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.371e+02 3.178e+02 3.835e+02 4.854e+02 8.477e+02, threshold=7.671e+02, percent-clipped=4.0 2023-06-19 11:28:29,059 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=22.5 2023-06-19 11:28:36,952 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=15.0 2023-06-19 11:28:44,168 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.62 vs. limit=6.0 2023-06-19 11:28:51,113 INFO [train.py:996] (2/4) Epoch 3, batch 14500, loss[loss=0.3249, simple_loss=0.4053, pruned_loss=0.1223, over 19802.00 frames. ], tot_loss[loss=0.2734, simple_loss=0.3342, pruned_loss=0.1063, over 4258610.72 frames. ], batch size: 702, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:28:51,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=452934.0, ans=0.1 2023-06-19 11:29:06,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=452994.0, ans=0.0 2023-06-19 11:29:20,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=452994.0, ans=0.0 2023-06-19 11:29:41,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=453054.0, ans=0.1 2023-06-19 11:29:59,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=453114.0, ans=0.125 2023-06-19 11:30:32,354 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=15.0 2023-06-19 11:30:34,342 INFO [train.py:996] (2/4) Epoch 3, batch 14550, loss[loss=0.3059, simple_loss=0.3686, pruned_loss=0.1216, over 21571.00 frames. ], tot_loss[loss=0.2793, simple_loss=0.3401, pruned_loss=0.1092, over 4267472.98 frames. ], batch size: 389, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:31:36,529 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-19 11:31:39,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=453414.0, ans=0.125 2023-06-19 11:31:52,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=453414.0, ans=0.0 2023-06-19 11:31:57,101 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 3.259e+02 4.086e+02 5.797e+02 9.548e+02, threshold=8.171e+02, percent-clipped=6.0 2023-06-19 11:31:57,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=453414.0, ans=0.0 2023-06-19 11:31:59,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=453474.0, ans=0.0 2023-06-19 11:32:11,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=453474.0, ans=0.0 2023-06-19 11:32:16,966 INFO [train.py:996] (2/4) Epoch 3, batch 14600, loss[loss=0.2839, simple_loss=0.3609, pruned_loss=0.1034, over 21409.00 frames. ], tot_loss[loss=0.2879, simple_loss=0.349, pruned_loss=0.1135, over 4268666.63 frames. ], batch size: 211, lr: 1.10e-02, grad_scale: 32.0 2023-06-19 11:33:05,658 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-19 11:33:45,365 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-19 11:33:58,776 INFO [train.py:996] (2/4) Epoch 3, batch 14650, loss[loss=0.1831, simple_loss=0.2644, pruned_loss=0.05091, over 21388.00 frames. ], tot_loss[loss=0.2857, simple_loss=0.3485, pruned_loss=0.1115, over 4272252.09 frames. ], batch size: 211, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:34:19,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=453834.0, ans=0.1 2023-06-19 11:34:49,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=453954.0, ans=0.125 2023-06-19 11:35:20,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=454014.0, ans=0.0 2023-06-19 11:35:24,899 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.424e+02 2.524e+02 3.223e+02 4.029e+02 6.900e+02, threshold=6.446e+02, percent-clipped=0.0 2023-06-19 11:35:50,410 INFO [train.py:996] (2/4) Epoch 3, batch 14700, loss[loss=0.2417, simple_loss=0.3234, pruned_loss=0.07999, over 21499.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3443, pruned_loss=0.1059, over 4264853.48 frames. ], batch size: 212, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:36:13,830 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=15.0 2023-06-19 11:36:34,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=454254.0, ans=0.125 2023-06-19 11:37:38,442 INFO [train.py:996] (2/4) Epoch 3, batch 14750, loss[loss=0.2972, simple_loss=0.3589, pruned_loss=0.1178, over 20748.00 frames. ], tot_loss[loss=0.2833, simple_loss=0.349, pruned_loss=0.1088, over 4266448.62 frames. ], batch size: 608, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:38:24,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=454554.0, ans=0.125 2023-06-19 11:38:28,552 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-19 11:38:42,623 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.87 vs. limit=15.0 2023-06-19 11:38:51,301 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 3.078e+02 3.855e+02 4.833e+02 8.936e+02, threshold=7.710e+02, percent-clipped=7.0 2023-06-19 11:39:00,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=454674.0, ans=0.1 2023-06-19 11:39:20,991 INFO [train.py:996] (2/4) Epoch 3, batch 14800, loss[loss=0.3115, simple_loss=0.3707, pruned_loss=0.1261, over 20646.00 frames. ], tot_loss[loss=0.2948, simple_loss=0.36, pruned_loss=0.1148, over 4265960.39 frames. ], batch size: 607, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:40:07,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=454854.0, ans=0.125 2023-06-19 11:40:09,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=454854.0, ans=0.125 2023-06-19 11:40:14,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=454854.0, ans=0.125 2023-06-19 11:40:16,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=454854.0, ans=0.2 2023-06-19 11:41:04,771 INFO [train.py:996] (2/4) Epoch 3, batch 14850, loss[loss=0.2812, simple_loss=0.3264, pruned_loss=0.118, over 21204.00 frames. ], tot_loss[loss=0.2919, simple_loss=0.3546, pruned_loss=0.1147, over 4264341.54 frames. ], batch size: 176, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:41:09,351 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=20.07 vs. limit=15.0 2023-06-19 11:41:18,983 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.41 vs. limit=22.5 2023-06-19 11:41:19,985 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 11:41:20,491 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=15.0 2023-06-19 11:41:47,657 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=12.0 2023-06-19 11:41:47,758 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-06-19 11:42:13,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=455214.0, ans=0.125 2023-06-19 11:42:17,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=455214.0, ans=0.125 2023-06-19 11:42:23,065 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.372e+02 3.130e+02 3.900e+02 4.785e+02 9.691e+02, threshold=7.799e+02, percent-clipped=2.0 2023-06-19 11:42:53,934 INFO [train.py:996] (2/4) Epoch 3, batch 14900, loss[loss=0.3013, simple_loss=0.3556, pruned_loss=0.1235, over 21291.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.356, pruned_loss=0.1143, over 4260862.07 frames. ], batch size: 159, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:43:07,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=455334.0, ans=0.125 2023-06-19 11:43:13,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=455394.0, ans=0.2 2023-06-19 11:43:16,273 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=22.5 2023-06-19 11:43:41,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=455454.0, ans=0.2 2023-06-19 11:44:18,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=455574.0, ans=0.0 2023-06-19 11:44:36,800 INFO [train.py:996] (2/4) Epoch 3, batch 14950, loss[loss=0.2433, simple_loss=0.2941, pruned_loss=0.09623, over 21854.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.3568, pruned_loss=0.1139, over 4270007.64 frames. ], batch size: 98, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:45:54,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=455814.0, ans=0.05 2023-06-19 11:45:55,113 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 3.136e+02 3.778e+02 4.659e+02 7.505e+02, threshold=7.556e+02, percent-clipped=0.0 2023-06-19 11:46:15,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=455874.0, ans=0.0 2023-06-19 11:46:20,065 INFO [train.py:996] (2/4) Epoch 3, batch 15000, loss[loss=0.2575, simple_loss=0.3253, pruned_loss=0.09482, over 21774.00 frames. ], tot_loss[loss=0.2951, simple_loss=0.359, pruned_loss=0.1156, over 4262926.80 frames. ], batch size: 102, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:46:20,065 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 11:46:36,895 INFO [train.py:1028] (2/4) Epoch 3, validation: loss=0.2722, simple_loss=0.3734, pruned_loss=0.08553, over 1796401.00 frames. 2023-06-19 11:46:36,896 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-19 11:46:49,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=455934.0, ans=0.0 2023-06-19 11:47:07,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=455994.0, ans=0.125 2023-06-19 11:47:15,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=455994.0, ans=0.2 2023-06-19 11:48:00,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=456114.0, ans=0.0 2023-06-19 11:48:07,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=456174.0, ans=0.125 2023-06-19 11:48:26,274 INFO [train.py:996] (2/4) Epoch 3, batch 15050, loss[loss=0.2571, simple_loss=0.3053, pruned_loss=0.1045, over 21899.00 frames. ], tot_loss[loss=0.2934, simple_loss=0.3567, pruned_loss=0.115, over 4241140.79 frames. ], batch size: 107, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:48:26,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=456234.0, ans=0.125 2023-06-19 11:48:48,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=456294.0, ans=0.125 2023-06-19 11:49:43,747 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 3.179e+02 3.857e+02 4.850e+02 8.474e+02, threshold=7.714e+02, percent-clipped=3.0 2023-06-19 11:50:08,236 INFO [train.py:996] (2/4) Epoch 3, batch 15100, loss[loss=0.3388, simple_loss=0.4009, pruned_loss=0.1384, over 21340.00 frames. ], tot_loss[loss=0.295, simple_loss=0.3597, pruned_loss=0.1151, over 4254192.67 frames. ], batch size: 548, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:50:25,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=456534.0, ans=0.1 2023-06-19 11:50:28,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=456594.0, ans=0.125 2023-06-19 11:51:05,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=456654.0, ans=0.1 2023-06-19 11:51:05,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=456654.0, ans=0.125 2023-06-19 11:51:56,401 INFO [train.py:996] (2/4) Epoch 3, batch 15150, loss[loss=0.2685, simple_loss=0.3198, pruned_loss=0.1086, over 21770.00 frames. ], tot_loss[loss=0.2948, simple_loss=0.3576, pruned_loss=0.116, over 4253366.65 frames. ], batch size: 317, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:52:31,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=456894.0, ans=0.0 2023-06-19 11:52:45,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=456954.0, ans=0.125 2023-06-19 11:52:48,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=456954.0, ans=0.125 2023-06-19 11:53:13,806 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 3.143e+02 3.638e+02 4.416e+02 6.792e+02, threshold=7.275e+02, percent-clipped=0.0 2023-06-19 11:53:38,825 INFO [train.py:996] (2/4) Epoch 3, batch 15200, loss[loss=0.2893, simple_loss=0.3633, pruned_loss=0.1077, over 20127.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3487, pruned_loss=0.1119, over 4255499.73 frames. ], batch size: 702, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:54:22,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=457254.0, ans=0.1 2023-06-19 11:54:35,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=457254.0, ans=0.125 2023-06-19 11:54:35,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=457254.0, ans=0.5 2023-06-19 11:55:20,846 INFO [train.py:996] (2/4) Epoch 3, batch 15250, loss[loss=0.2347, simple_loss=0.2987, pruned_loss=0.08531, over 21646.00 frames. ], tot_loss[loss=0.2804, simple_loss=0.342, pruned_loss=0.1094, over 4258002.00 frames. ], batch size: 282, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:55:53,936 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.29 vs. limit=15.0 2023-06-19 11:56:06,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=457554.0, ans=0.125 2023-06-19 11:56:06,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=457554.0, ans=0.125 2023-06-19 11:56:16,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=457554.0, ans=0.125 2023-06-19 11:56:35,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=457614.0, ans=0.1 2023-06-19 11:56:43,900 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.938e+02 3.532e+02 4.223e+02 6.837e+02, threshold=7.064e+02, percent-clipped=0.0 2023-06-19 11:57:08,079 INFO [train.py:996] (2/4) Epoch 3, batch 15300, loss[loss=0.2944, simple_loss=0.3579, pruned_loss=0.1155, over 21956.00 frames. ], tot_loss[loss=0.2852, simple_loss=0.3451, pruned_loss=0.1127, over 4269835.42 frames. ], batch size: 372, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:58:44,341 INFO [train.py:996] (2/4) Epoch 3, batch 15350, loss[loss=0.319, simple_loss=0.3879, pruned_loss=0.125, over 21821.00 frames. ], tot_loss[loss=0.2906, simple_loss=0.3494, pruned_loss=0.1159, over 4275745.60 frames. ], batch size: 118, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 11:59:31,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=458154.0, ans=0.125 2023-06-19 11:59:37,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=458154.0, ans=0.025 2023-06-19 11:59:54,550 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.081e+02 3.004e+02 3.619e+02 4.702e+02 1.047e+03, threshold=7.238e+02, percent-clipped=6.0 2023-06-19 12:00:24,064 INFO [train.py:996] (2/4) Epoch 3, batch 15400, loss[loss=0.2631, simple_loss=0.3327, pruned_loss=0.0967, over 21920.00 frames. ], tot_loss[loss=0.2886, simple_loss=0.35, pruned_loss=0.1136, over 4275897.37 frames. ], batch size: 107, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:00:47,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=458394.0, ans=0.125 2023-06-19 12:00:55,902 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.52 vs. limit=10.0 2023-06-19 12:01:15,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=458454.0, ans=0.05 2023-06-19 12:01:39,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=458574.0, ans=0.2 2023-06-19 12:01:50,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=458574.0, ans=0.0 2023-06-19 12:02:06,925 INFO [train.py:996] (2/4) Epoch 3, batch 15450, loss[loss=0.2687, simple_loss=0.3236, pruned_loss=0.1069, over 21625.00 frames. ], tot_loss[loss=0.2862, simple_loss=0.3471, pruned_loss=0.1126, over 4286104.35 frames. ], batch size: 548, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:02:08,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=458634.0, ans=0.125 2023-06-19 12:02:43,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=458694.0, ans=0.125 2023-06-19 12:03:23,953 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.117e+02 2.889e+02 3.450e+02 3.978e+02 6.262e+02, threshold=6.899e+02, percent-clipped=0.0 2023-06-19 12:03:31,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=458874.0, ans=0.125 2023-06-19 12:03:54,376 INFO [train.py:996] (2/4) Epoch 3, batch 15500, loss[loss=0.2946, simple_loss=0.3582, pruned_loss=0.1155, over 21671.00 frames. ], tot_loss[loss=0.2869, simple_loss=0.3509, pruned_loss=0.1115, over 4278481.83 frames. ], batch size: 351, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:05:26,816 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-06-19 12:05:27,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=459174.0, ans=0.0 2023-06-19 12:05:37,228 INFO [train.py:996] (2/4) Epoch 3, batch 15550, loss[loss=0.2106, simple_loss=0.271, pruned_loss=0.07514, over 21887.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.3496, pruned_loss=0.1088, over 4275306.89 frames. ], batch size: 98, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:05:45,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=459234.0, ans=0.0 2023-06-19 12:05:59,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=459294.0, ans=0.125 2023-06-19 12:06:02,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=459294.0, ans=0.0 2023-06-19 12:06:15,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=459354.0, ans=0.2 2023-06-19 12:06:34,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=459354.0, ans=0.2 2023-06-19 12:06:40,875 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-06-19 12:06:42,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=459414.0, ans=0.125 2023-06-19 12:06:54,193 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.246e+02 2.997e+02 3.470e+02 4.241e+02 8.422e+02, threshold=6.941e+02, percent-clipped=1.0 2023-06-19 12:07:15,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=459474.0, ans=0.125 2023-06-19 12:07:17,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=459534.0, ans=0.125 2023-06-19 12:07:17,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=459534.0, ans=0.025 2023-06-19 12:07:18,553 INFO [train.py:996] (2/4) Epoch 3, batch 15600, loss[loss=0.2599, simple_loss=0.3096, pruned_loss=0.1051, over 21270.00 frames. ], tot_loss[loss=0.2776, simple_loss=0.342, pruned_loss=0.1066, over 4279243.62 frames. ], batch size: 160, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:07:31,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=459534.0, ans=0.025 2023-06-19 12:07:50,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=459594.0, ans=0.0 2023-06-19 12:07:52,642 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.00 vs. limit=15.0 2023-06-19 12:09:03,346 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=1.867e-02 2023-06-19 12:09:05,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=459834.0, ans=0.125 2023-06-19 12:09:06,303 INFO [train.py:996] (2/4) Epoch 3, batch 15650, loss[loss=0.2575, simple_loss=0.3118, pruned_loss=0.1016, over 21608.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3409, pruned_loss=0.1061, over 4279646.07 frames. ], batch size: 247, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:10:22,755 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.856e+02 3.445e+02 4.569e+02 7.529e+02, threshold=6.891e+02, percent-clipped=2.0 2023-06-19 12:10:47,552 INFO [train.py:996] (2/4) Epoch 3, batch 15700, loss[loss=0.3033, simple_loss=0.3501, pruned_loss=0.1282, over 21280.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.3368, pruned_loss=0.1048, over 4272849.28 frames. ], batch size: 471, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:11:19,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=460254.0, ans=0.0 2023-06-19 12:11:47,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=460314.0, ans=0.125 2023-06-19 12:11:50,805 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=22.5 2023-06-19 12:12:15,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=460374.0, ans=0.0 2023-06-19 12:12:28,116 INFO [train.py:996] (2/4) Epoch 3, batch 15750, loss[loss=0.2791, simple_loss=0.3307, pruned_loss=0.1138, over 21565.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3326, pruned_loss=0.1051, over 4266929.02 frames. ], batch size: 414, lr: 1.09e-02, grad_scale: 16.0 2023-06-19 12:12:43,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=460494.0, ans=0.0 2023-06-19 12:13:46,491 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.862e+02 3.455e+02 4.105e+02 6.683e+02, threshold=6.910e+02, percent-clipped=1.0 2023-06-19 12:14:09,466 INFO [train.py:996] (2/4) Epoch 3, batch 15800, loss[loss=0.2766, simple_loss=0.3167, pruned_loss=0.1183, over 21518.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3286, pruned_loss=0.1052, over 4271951.22 frames. ], batch size: 442, lr: 1.09e-02, grad_scale: 16.0 2023-06-19 12:14:18,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=460734.0, ans=0.2 2023-06-19 12:14:25,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=460794.0, ans=0.125 2023-06-19 12:15:06,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=460854.0, ans=0.125 2023-06-19 12:15:25,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=460914.0, ans=0.0 2023-06-19 12:15:35,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=460974.0, ans=0.2 2023-06-19 12:15:52,364 INFO [train.py:996] (2/4) Epoch 3, batch 15850, loss[loss=0.2621, simple_loss=0.3344, pruned_loss=0.09489, over 21263.00 frames. ], tot_loss[loss=0.2726, simple_loss=0.3305, pruned_loss=0.1073, over 4275418.30 frames. ], batch size: 159, lr: 1.09e-02, grad_scale: 16.0 2023-06-19 12:16:01,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=461034.0, ans=0.05 2023-06-19 12:16:09,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=461094.0, ans=0.035 2023-06-19 12:16:09,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=461094.0, ans=0.125 2023-06-19 12:16:12,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=461094.0, ans=0.125 2023-06-19 12:16:15,945 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.20 vs. limit=15.0 2023-06-19 12:17:10,945 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.378e+02 2.908e+02 3.639e+02 4.203e+02 7.869e+02, threshold=7.277e+02, percent-clipped=1.0 2023-06-19 12:17:35,040 INFO [train.py:996] (2/4) Epoch 3, batch 15900, loss[loss=0.2631, simple_loss=0.3151, pruned_loss=0.1056, over 20168.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.328, pruned_loss=0.1079, over 4274800.01 frames. ], batch size: 707, lr: 1.09e-02, grad_scale: 16.0 2023-06-19 12:17:37,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=461334.0, ans=0.125 2023-06-19 12:17:45,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=461334.0, ans=0.0 2023-06-19 12:18:03,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=461394.0, ans=0.125 2023-06-19 12:18:19,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=461454.0, ans=0.125 2023-06-19 12:18:36,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=461514.0, ans=0.125 2023-06-19 12:18:37,371 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.74 vs. limit=10.0 2023-06-19 12:18:50,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=461514.0, ans=0.05 2023-06-19 12:18:50,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=461514.0, ans=0.0 2023-06-19 12:19:11,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=461574.0, ans=0.1 2023-06-19 12:19:17,505 INFO [train.py:996] (2/4) Epoch 3, batch 15950, loss[loss=0.2034, simple_loss=0.2986, pruned_loss=0.0541, over 21772.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3305, pruned_loss=0.1053, over 4267935.80 frames. ], batch size: 332, lr: 1.09e-02, grad_scale: 16.0 2023-06-19 12:19:22,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=461634.0, ans=0.125 2023-06-19 12:19:26,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=461634.0, ans=0.125 2023-06-19 12:19:28,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=461634.0, ans=0.2 2023-06-19 12:19:32,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=461694.0, ans=0.0 2023-06-19 12:20:36,058 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.763e+02 2.793e+02 3.209e+02 4.219e+02 1.070e+03, threshold=6.418e+02, percent-clipped=5.0 2023-06-19 12:20:59,873 INFO [train.py:996] (2/4) Epoch 3, batch 16000, loss[loss=0.2785, simple_loss=0.362, pruned_loss=0.09751, over 21676.00 frames. ], tot_loss[loss=0.2693, simple_loss=0.3314, pruned_loss=0.1036, over 4268641.85 frames. ], batch size: 389, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:21:14,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=461994.0, ans=0.125 2023-06-19 12:21:16,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=461994.0, ans=0.125 2023-06-19 12:21:19,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=461994.0, ans=0.0 2023-06-19 12:22:42,087 INFO [train.py:996] (2/4) Epoch 3, batch 16050, loss[loss=0.2261, simple_loss=0.2923, pruned_loss=0.07993, over 21410.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3372, pruned_loss=0.1018, over 4261762.06 frames. ], batch size: 176, lr: 1.09e-02, grad_scale: 32.0 2023-06-19 12:22:49,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=462234.0, ans=0.125 2023-06-19 12:23:23,072 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.61 vs. limit=15.0 2023-06-19 12:24:00,108 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.215e+02 3.084e+02 3.532e+02 4.519e+02 7.240e+02, threshold=7.063e+02, percent-clipped=4.0 2023-06-19 12:24:23,194 INFO [train.py:996] (2/4) Epoch 3, batch 16100, loss[loss=0.2846, simple_loss=0.3424, pruned_loss=0.1134, over 21782.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.3407, pruned_loss=0.1025, over 4266400.31 frames. ], batch size: 112, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:24:25,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=462534.0, ans=0.1 2023-06-19 12:24:58,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=462654.0, ans=0.1 2023-06-19 12:25:57,549 INFO [train.py:996] (2/4) Epoch 3, batch 16150, loss[loss=0.2838, simple_loss=0.3506, pruned_loss=0.1085, over 21761.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3416, pruned_loss=0.1051, over 4280862.07 frames. ], batch size: 389, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:26:03,263 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=11.51 vs. limit=15.0 2023-06-19 12:26:15,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=462894.0, ans=0.125 2023-06-19 12:26:26,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=462894.0, ans=0.0 2023-06-19 12:26:33,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=462954.0, ans=0.125 2023-06-19 12:26:42,290 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=15.0 2023-06-19 12:27:16,611 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 2.948e+02 3.427e+02 4.312e+02 9.423e+02, threshold=6.854e+02, percent-clipped=2.0 2023-06-19 12:27:20,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=463074.0, ans=0.0 2023-06-19 12:27:39,987 INFO [train.py:996] (2/4) Epoch 3, batch 16200, loss[loss=0.2803, simple_loss=0.3642, pruned_loss=0.09817, over 21682.00 frames. ], tot_loss[loss=0.2785, simple_loss=0.3442, pruned_loss=0.1064, over 4287208.18 frames. ], batch size: 263, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:27:40,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=463134.0, ans=0.125 2023-06-19 12:27:48,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=463134.0, ans=0.2 2023-06-19 12:27:49,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=463134.0, ans=0.125 2023-06-19 12:28:02,149 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:28:08,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=463194.0, ans=0.0 2023-06-19 12:29:06,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=463374.0, ans=0.125 2023-06-19 12:29:21,939 INFO [train.py:996] (2/4) Epoch 3, batch 16250, loss[loss=0.2745, simple_loss=0.3435, pruned_loss=0.1027, over 21761.00 frames. ], tot_loss[loss=0.279, simple_loss=0.3442, pruned_loss=0.1069, over 4276209.13 frames. ], batch size: 371, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:29:37,803 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-19 12:29:39,622 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=15.0 2023-06-19 12:30:22,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=463614.0, ans=0.125 2023-06-19 12:30:46,013 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.838e+02 2.795e+02 3.241e+02 4.405e+02 7.562e+02, threshold=6.482e+02, percent-clipped=2.0 2023-06-19 12:30:48,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=463674.0, ans=0.0 2023-06-19 12:31:03,324 INFO [train.py:996] (2/4) Epoch 3, batch 16300, loss[loss=0.216, simple_loss=0.2874, pruned_loss=0.07232, over 21601.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.3372, pruned_loss=0.1021, over 4271227.52 frames. ], batch size: 263, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:31:18,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=463794.0, ans=0.125 2023-06-19 12:31:26,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=463794.0, ans=0.0 2023-06-19 12:31:34,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=463854.0, ans=0.0 2023-06-19 12:32:37,024 INFO [train.py:996] (2/4) Epoch 3, batch 16350, loss[loss=0.402, simple_loss=0.435, pruned_loss=0.1845, over 21425.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3357, pruned_loss=0.1017, over 4269224.01 frames. ], batch size: 510, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:32:43,033 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.83 vs. limit=15.0 2023-06-19 12:33:22,949 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-19 12:33:53,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=464214.0, ans=0.1 2023-06-19 12:34:02,466 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 3.064e+02 3.648e+02 5.135e+02 1.076e+03, threshold=7.296e+02, percent-clipped=9.0 2023-06-19 12:34:11,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=464274.0, ans=0.0 2023-06-19 12:34:18,669 INFO [train.py:996] (2/4) Epoch 3, batch 16400, loss[loss=0.2776, simple_loss=0.3357, pruned_loss=0.1098, over 21862.00 frames. ], tot_loss[loss=0.2747, simple_loss=0.3405, pruned_loss=0.1044, over 4278043.01 frames. ], batch size: 371, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:35:07,897 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.68 vs. limit=10.0 2023-06-19 12:35:38,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=464514.0, ans=0.125 2023-06-19 12:35:52,306 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=15.0 2023-06-19 12:36:00,618 INFO [train.py:996] (2/4) Epoch 3, batch 16450, loss[loss=0.2638, simple_loss=0.3287, pruned_loss=0.09945, over 21852.00 frames. ], tot_loss[loss=0.2771, simple_loss=0.3416, pruned_loss=0.1064, over 4279951.96 frames. ], batch size: 107, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:36:20,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=464694.0, ans=0.05 2023-06-19 12:37:21,754 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 3.011e+02 3.468e+02 3.986e+02 7.351e+02, threshold=6.935e+02, percent-clipped=1.0 2023-06-19 12:37:33,090 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=22.5 2023-06-19 12:37:38,567 INFO [train.py:996] (2/4) Epoch 3, batch 16500, loss[loss=0.3585, simple_loss=0.4141, pruned_loss=0.1515, over 21510.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.3402, pruned_loss=0.1067, over 4280936.48 frames. ], batch size: 508, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:37:50,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=464934.0, ans=0.125 2023-06-19 12:38:46,371 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.35 vs. limit=15.0 2023-06-19 12:38:47,862 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-06-19 12:38:52,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=465114.0, ans=0.125 2023-06-19 12:39:15,971 INFO [train.py:996] (2/4) Epoch 3, batch 16550, loss[loss=0.2607, simple_loss=0.3266, pruned_loss=0.09734, over 21473.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3362, pruned_loss=0.103, over 4283603.94 frames. ], batch size: 131, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:40:42,243 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.219e+02 3.124e+02 3.787e+02 4.304e+02 9.133e+02, threshold=7.574e+02, percent-clipped=3.0 2023-06-19 12:41:09,210 INFO [train.py:996] (2/4) Epoch 3, batch 16600, loss[loss=0.3138, simple_loss=0.4013, pruned_loss=0.1131, over 21839.00 frames. ], tot_loss[loss=0.2794, simple_loss=0.3456, pruned_loss=0.1066, over 4281979.53 frames. ], batch size: 371, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:42:27,525 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.95 vs. limit=15.0 2023-06-19 12:42:42,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=465774.0, ans=0.125 2023-06-19 12:42:47,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=465774.0, ans=0.09899494936611666 2023-06-19 12:42:59,038 INFO [train.py:996] (2/4) Epoch 3, batch 16650, loss[loss=0.3154, simple_loss=0.4064, pruned_loss=0.1122, over 20758.00 frames. ], tot_loss[loss=0.2894, simple_loss=0.3583, pruned_loss=0.1103, over 4277463.47 frames. ], batch size: 607, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:43:08,296 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:43:15,294 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.04 vs. limit=12.0 2023-06-19 12:43:46,382 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.86 vs. limit=15.0 2023-06-19 12:43:57,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=465954.0, ans=0.0 2023-06-19 12:44:10,343 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-19 12:44:11,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=466014.0, ans=0.0 2023-06-19 12:44:27,312 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.344e+02 3.247e+02 3.781e+02 4.657e+02 6.369e+02, threshold=7.563e+02, percent-clipped=0.0 2023-06-19 12:44:43,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=466074.0, ans=0.125 2023-06-19 12:44:49,111 INFO [train.py:996] (2/4) Epoch 3, batch 16700, loss[loss=0.2363, simple_loss=0.2983, pruned_loss=0.08719, over 21724.00 frames. ], tot_loss[loss=0.2917, simple_loss=0.3593, pruned_loss=0.112, over 4280267.58 frames. ], batch size: 247, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:44:53,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=466134.0, ans=0.0 2023-06-19 12:45:28,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=466254.0, ans=0.04949747468305833 2023-06-19 12:45:53,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=466314.0, ans=0.0 2023-06-19 12:46:07,064 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.45 vs. limit=6.0 2023-06-19 12:46:08,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=466314.0, ans=0.125 2023-06-19 12:46:21,320 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.55 vs. limit=15.0 2023-06-19 12:46:34,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=466434.0, ans=0.5 2023-06-19 12:46:35,206 INFO [train.py:996] (2/4) Epoch 3, batch 16750, loss[loss=0.2985, simple_loss=0.3674, pruned_loss=0.1148, over 20721.00 frames. ], tot_loss[loss=0.2946, simple_loss=0.361, pruned_loss=0.1141, over 4279291.65 frames. ], batch size: 607, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:46:37,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=466434.0, ans=0.125 2023-06-19 12:46:50,127 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.63 vs. limit=15.0 2023-06-19 12:46:55,363 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=15.0 2023-06-19 12:46:59,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.out_whiten.whitening_limit, batch_count=466494.0, ans=8.0 2023-06-19 12:48:01,819 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 2.898e+02 3.370e+02 4.211e+02 9.702e+02, threshold=6.740e+02, percent-clipped=1.0 2023-06-19 12:48:03,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=466674.0, ans=0.125 2023-06-19 12:48:17,813 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-19 12:48:22,825 INFO [train.py:996] (2/4) Epoch 3, batch 16800, loss[loss=0.3251, simple_loss=0.3851, pruned_loss=0.1325, over 21782.00 frames. ], tot_loss[loss=0.2957, simple_loss=0.3653, pruned_loss=0.1131, over 4275133.10 frames. ], batch size: 441, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:49:38,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=466914.0, ans=0.125 2023-06-19 12:50:04,953 INFO [train.py:996] (2/4) Epoch 3, batch 16850, loss[loss=0.2573, simple_loss=0.3177, pruned_loss=0.09844, over 21242.00 frames. ], tot_loss[loss=0.2943, simple_loss=0.3617, pruned_loss=0.1135, over 4278582.30 frames. ], batch size: 176, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:50:10,839 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.32 vs. limit=15.0 2023-06-19 12:50:34,103 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=22.5 2023-06-19 12:51:06,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=467154.0, ans=0.0 2023-06-19 12:51:10,356 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=15.0 2023-06-19 12:51:16,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=467214.0, ans=0.0 2023-06-19 12:51:25,342 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 2.898e+02 3.370e+02 4.330e+02 9.168e+02, threshold=6.739e+02, percent-clipped=5.0 2023-06-19 12:51:45,967 INFO [train.py:996] (2/4) Epoch 3, batch 16900, loss[loss=0.2664, simple_loss=0.3136, pruned_loss=0.1096, over 21191.00 frames. ], tot_loss[loss=0.2897, simple_loss=0.3555, pruned_loss=0.1119, over 4281776.43 frames. ], batch size: 159, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:52:39,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=467454.0, ans=0.125 2023-06-19 12:52:42,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=467454.0, ans=0.0 2023-06-19 12:52:45,941 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-19 12:53:25,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=467634.0, ans=0.1 2023-06-19 12:53:26,676 INFO [train.py:996] (2/4) Epoch 3, batch 16950, loss[loss=0.2911, simple_loss=0.3426, pruned_loss=0.1198, over 21943.00 frames. ], tot_loss[loss=0.2852, simple_loss=0.3488, pruned_loss=0.1108, over 4290585.59 frames. ], batch size: 316, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:54:20,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=467754.0, ans=0.2 2023-06-19 12:54:32,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=467814.0, ans=0.0 2023-06-19 12:54:46,694 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.133e+02 2.834e+02 3.306e+02 3.951e+02 5.809e+02, threshold=6.612e+02, percent-clipped=0.0 2023-06-19 12:55:07,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=467934.0, ans=0.0 2023-06-19 12:55:08,165 INFO [train.py:996] (2/4) Epoch 3, batch 17000, loss[loss=0.2601, simple_loss=0.2901, pruned_loss=0.115, over 20042.00 frames. ], tot_loss[loss=0.2827, simple_loss=0.3445, pruned_loss=0.1105, over 4292566.62 frames. ], batch size: 704, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:55:23,560 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=22.5 2023-06-19 12:55:24,544 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 12:55:50,018 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=15.0 2023-06-19 12:56:17,118 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.54 vs. limit=22.5 2023-06-19 12:56:25,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=468114.0, ans=0.2 2023-06-19 12:56:49,433 INFO [train.py:996] (2/4) Epoch 3, batch 17050, loss[loss=0.3132, simple_loss=0.3694, pruned_loss=0.1285, over 21423.00 frames. ], tot_loss[loss=0.2895, simple_loss=0.3524, pruned_loss=0.1133, over 4294693.73 frames. ], batch size: 131, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:57:00,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=468234.0, ans=0.07 2023-06-19 12:57:43,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=468354.0, ans=0.125 2023-06-19 12:58:00,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=468414.0, ans=0.125 2023-06-19 12:58:14,530 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.418e+02 3.024e+02 3.438e+02 4.032e+02 7.555e+02, threshold=6.877e+02, percent-clipped=1.0 2023-06-19 12:58:30,297 INFO [train.py:996] (2/4) Epoch 3, batch 17100, loss[loss=0.2773, simple_loss=0.3369, pruned_loss=0.1089, over 21952.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3508, pruned_loss=0.1137, over 4297454.83 frames. ], batch size: 333, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 12:58:40,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=468534.0, ans=0.04949747468305833 2023-06-19 12:59:34,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=468714.0, ans=0.125 2023-06-19 13:00:11,711 INFO [train.py:996] (2/4) Epoch 3, batch 17150, loss[loss=0.2445, simple_loss=0.3119, pruned_loss=0.08851, over 21161.00 frames. ], tot_loss[loss=0.2867, simple_loss=0.3465, pruned_loss=0.1134, over 4301616.63 frames. ], batch size: 608, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:00:59,139 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-19 13:01:03,967 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=22.5 2023-06-19 13:01:32,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=469014.0, ans=0.125 2023-06-19 13:01:38,244 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.355e+02 2.874e+02 3.285e+02 3.849e+02 6.375e+02, threshold=6.570e+02, percent-clipped=0.0 2023-06-19 13:02:09,775 INFO [train.py:996] (2/4) Epoch 3, batch 17200, loss[loss=0.2833, simple_loss=0.3481, pruned_loss=0.1093, over 21734.00 frames. ], tot_loss[loss=0.2855, simple_loss=0.3454, pruned_loss=0.1129, over 4296386.34 frames. ], batch size: 332, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:02:23,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=469134.0, ans=0.125 2023-06-19 13:02:28,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=469194.0, ans=0.125 2023-06-19 13:02:30,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=469194.0, ans=0.125 2023-06-19 13:02:55,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=469254.0, ans=0.015 2023-06-19 13:03:22,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=469314.0, ans=0.0 2023-06-19 13:03:23,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=469314.0, ans=0.0 2023-06-19 13:03:47,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=469374.0, ans=0.125 2023-06-19 13:03:53,553 INFO [train.py:996] (2/4) Epoch 3, batch 17250, loss[loss=0.3314, simple_loss=0.3971, pruned_loss=0.1329, over 21843.00 frames. ], tot_loss[loss=0.2896, simple_loss=0.3496, pruned_loss=0.1148, over 4293521.84 frames. ], batch size: 124, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:04:04,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=469434.0, ans=0.125 2023-06-19 13:05:19,707 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.15 vs. limit=6.0 2023-06-19 13:05:20,323 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.292e+02 3.993e+02 5.117e+02 9.442e+02, threshold=7.987e+02, percent-clipped=7.0 2023-06-19 13:05:37,159 INFO [train.py:996] (2/4) Epoch 3, batch 17300, loss[loss=0.3283, simple_loss=0.3817, pruned_loss=0.1374, over 21387.00 frames. ], tot_loss[loss=0.2978, simple_loss=0.3588, pruned_loss=0.1184, over 4286658.66 frames. ], batch size: 549, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:05:41,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=469734.0, ans=0.125 2023-06-19 13:05:56,946 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.70 vs. limit=6.0 2023-06-19 13:06:21,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=469854.0, ans=0.125 2023-06-19 13:06:46,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=469914.0, ans=0.125 2023-06-19 13:06:56,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=469914.0, ans=0.125 2023-06-19 13:07:08,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=469974.0, ans=0.125 2023-06-19 13:07:15,553 INFO [train.py:996] (2/4) Epoch 3, batch 17350, loss[loss=0.3577, simple_loss=0.4099, pruned_loss=0.1527, over 21383.00 frames. ], tot_loss[loss=0.297, simple_loss=0.3591, pruned_loss=0.1174, over 4285583.76 frames. ], batch size: 508, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:07:44,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=470094.0, ans=0.125 2023-06-19 13:07:44,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=470094.0, ans=0.0 2023-06-19 13:08:16,873 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=22.5 2023-06-19 13:08:42,141 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.156e+02 2.889e+02 3.414e+02 4.320e+02 8.908e+02, threshold=6.829e+02, percent-clipped=3.0 2023-06-19 13:08:53,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=470274.0, ans=0.0 2023-06-19 13:08:58,928 INFO [train.py:996] (2/4) Epoch 3, batch 17400, loss[loss=0.2224, simple_loss=0.2697, pruned_loss=0.08755, over 21139.00 frames. ], tot_loss[loss=0.2898, simple_loss=0.3538, pruned_loss=0.1128, over 4269674.28 frames. ], batch size: 143, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:09:21,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=470394.0, ans=0.125 2023-06-19 13:10:20,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=470514.0, ans=0.125 2023-06-19 13:10:47,919 INFO [train.py:996] (2/4) Epoch 3, batch 17450, loss[loss=0.2398, simple_loss=0.3374, pruned_loss=0.07107, over 21154.00 frames. ], tot_loss[loss=0.2839, simple_loss=0.3492, pruned_loss=0.1093, over 4275299.42 frames. ], batch size: 548, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:11:11,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=470634.0, ans=0.2 2023-06-19 13:11:35,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=470754.0, ans=0.0 2023-06-19 13:11:56,785 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=15.0 2023-06-19 13:11:59,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=470814.0, ans=0.2 2023-06-19 13:12:06,949 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 2.874e+02 3.534e+02 4.725e+02 8.315e+02, threshold=7.067e+02, percent-clipped=5.0 2023-06-19 13:12:24,162 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.52 vs. limit=22.5 2023-06-19 13:12:27,752 INFO [train.py:996] (2/4) Epoch 3, batch 17500, loss[loss=0.2603, simple_loss=0.3191, pruned_loss=0.1008, over 21265.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3439, pruned_loss=0.1068, over 4278889.63 frames. ], batch size: 176, lr: 1.08e-02, grad_scale: 32.0 2023-06-19 13:13:13,827 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:13:26,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=471054.0, ans=0.125 2023-06-19 13:14:07,330 INFO [train.py:996] (2/4) Epoch 3, batch 17550, loss[loss=0.2557, simple_loss=0.3417, pruned_loss=0.08488, over 21367.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3451, pruned_loss=0.1055, over 4282635.91 frames. ], batch size: 131, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:14:13,085 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-19 13:14:16,389 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2023-06-19 13:15:26,296 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.757e+02 3.626e+02 4.370e+02 8.420e+02, threshold=7.252e+02, percent-clipped=1.0 2023-06-19 13:15:28,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=471474.0, ans=0.125 2023-06-19 13:15:48,156 INFO [train.py:996] (2/4) Epoch 3, batch 17600, loss[loss=0.3742, simple_loss=0.4699, pruned_loss=0.1393, over 19813.00 frames. ], tot_loss[loss=0.2797, simple_loss=0.3478, pruned_loss=0.1058, over 4275136.35 frames. ], batch size: 703, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:16:06,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=471534.0, ans=0.1 2023-06-19 13:16:07,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=471534.0, ans=0.125 2023-06-19 13:16:38,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=471654.0, ans=0.0 2023-06-19 13:16:56,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=471714.0, ans=0.0 2023-06-19 13:17:35,478 INFO [train.py:996] (2/4) Epoch 3, batch 17650, loss[loss=0.2346, simple_loss=0.2949, pruned_loss=0.08715, over 21717.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.3437, pruned_loss=0.1053, over 4269080.74 frames. ], batch size: 298, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:17:38,337 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=22.5 2023-06-19 13:18:23,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=471954.0, ans=15.0 2023-06-19 13:18:33,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=472014.0, ans=0.125 2023-06-19 13:18:33,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=472014.0, ans=0.1 2023-06-19 13:18:56,013 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.888e+02 3.326e+02 4.505e+02 7.697e+02, threshold=6.651e+02, percent-clipped=2.0 2023-06-19 13:18:56,973 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:19:17,605 INFO [train.py:996] (2/4) Epoch 3, batch 17700, loss[loss=0.2716, simple_loss=0.3352, pruned_loss=0.104, over 21374.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.3353, pruned_loss=0.1011, over 4260568.09 frames. ], batch size: 143, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:19:32,372 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=15.0 2023-06-19 13:19:44,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=472194.0, ans=0.0 2023-06-19 13:19:45,460 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.02 vs. limit=6.0 2023-06-19 13:20:05,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=472254.0, ans=0.95 2023-06-19 13:20:07,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=472254.0, ans=0.025 2023-06-19 13:20:08,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=472254.0, ans=0.1 2023-06-19 13:20:15,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=472254.0, ans=0.2 2023-06-19 13:20:23,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=472314.0, ans=0.1 2023-06-19 13:20:36,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=472314.0, ans=0.125 2023-06-19 13:20:58,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=472374.0, ans=0.0 2023-06-19 13:21:10,367 INFO [train.py:996] (2/4) Epoch 3, batch 17750, loss[loss=0.2897, simple_loss=0.3597, pruned_loss=0.1098, over 21846.00 frames. ], tot_loss[loss=0.2794, simple_loss=0.3457, pruned_loss=0.1065, over 4266537.06 frames. ], batch size: 282, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:21:14,875 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=22.5 2023-06-19 13:21:19,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=472434.0, ans=0.2 2023-06-19 13:21:25,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=472494.0, ans=0.05 2023-06-19 13:21:29,245 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:22:32,936 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 2.975e+02 3.463e+02 4.383e+02 8.374e+02, threshold=6.927e+02, percent-clipped=5.0 2023-06-19 13:22:54,301 INFO [train.py:996] (2/4) Epoch 3, batch 17800, loss[loss=0.2382, simple_loss=0.2997, pruned_loss=0.08835, over 21304.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.3457, pruned_loss=0.1055, over 4269032.67 frames. ], batch size: 159, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:24:37,172 INFO [train.py:996] (2/4) Epoch 3, batch 17850, loss[loss=0.3263, simple_loss=0.3873, pruned_loss=0.1327, over 21435.00 frames. ], tot_loss[loss=0.2776, simple_loss=0.3446, pruned_loss=0.1053, over 4270661.05 frames. ], batch size: 131, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:25:16,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=473094.0, ans=0.125 2023-06-19 13:25:16,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=473094.0, ans=0.125 2023-06-19 13:25:17,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=473094.0, ans=0.1 2023-06-19 13:25:24,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=473154.0, ans=0.2 2023-06-19 13:25:26,354 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.84 vs. limit=10.0 2023-06-19 13:25:33,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=473154.0, ans=0.125 2023-06-19 13:25:54,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=473214.0, ans=0.125 2023-06-19 13:25:56,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=473214.0, ans=0.1 2023-06-19 13:25:59,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=473214.0, ans=0.0 2023-06-19 13:26:02,590 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 3.042e+02 3.981e+02 5.013e+02 8.666e+02, threshold=7.962e+02, percent-clipped=5.0 2023-06-19 13:26:06,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=473274.0, ans=0.2 2023-06-19 13:26:18,599 INFO [train.py:996] (2/4) Epoch 3, batch 17900, loss[loss=0.2841, simple_loss=0.3505, pruned_loss=0.1088, over 21272.00 frames. ], tot_loss[loss=0.2821, simple_loss=0.3493, pruned_loss=0.1074, over 4273806.26 frames. ], batch size: 159, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:26:34,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=473334.0, ans=0.2 2023-06-19 13:26:34,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=473334.0, ans=0.04949747468305833 2023-06-19 13:27:13,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=473454.0, ans=0.125 2023-06-19 13:27:47,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=473574.0, ans=0.1 2023-06-19 13:28:06,436 INFO [train.py:996] (2/4) Epoch 3, batch 17950, loss[loss=0.2108, simple_loss=0.2817, pruned_loss=0.06995, over 21812.00 frames. ], tot_loss[loss=0.2779, simple_loss=0.3487, pruned_loss=0.1035, over 4269979.56 frames. ], batch size: 118, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:28:58,805 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.22 vs. limit=15.0 2023-06-19 13:29:26,503 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.921e+02 2.695e+02 3.486e+02 4.539e+02 1.017e+03, threshold=6.972e+02, percent-clipped=4.0 2023-06-19 13:29:28,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=473874.0, ans=0.125 2023-06-19 13:29:42,355 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=12.0 2023-06-19 13:29:47,390 INFO [train.py:996] (2/4) Epoch 3, batch 18000, loss[loss=0.2613, simple_loss=0.3082, pruned_loss=0.1072, over 21244.00 frames. ], tot_loss[loss=0.2746, simple_loss=0.343, pruned_loss=0.1031, over 4275232.87 frames. ], batch size: 471, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:29:47,390 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 13:30:08,405 INFO [train.py:1028] (2/4) Epoch 3, validation: loss=0.2748, simple_loss=0.3795, pruned_loss=0.08502, over 1796401.00 frames. 2023-06-19 13:30:08,405 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-19 13:30:44,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=473994.0, ans=0.1 2023-06-19 13:30:44,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=473994.0, ans=0.1 2023-06-19 13:31:49,937 INFO [train.py:996] (2/4) Epoch 3, batch 18050, loss[loss=0.2773, simple_loss=0.3359, pruned_loss=0.1094, over 21639.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3383, pruned_loss=0.1025, over 4267605.97 frames. ], batch size: 332, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:31:52,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=474234.0, ans=0.125 2023-06-19 13:32:28,214 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.98 vs. limit=15.0 2023-06-19 13:33:10,555 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 3.274e+02 3.707e+02 4.391e+02 9.006e+02, threshold=7.414e+02, percent-clipped=2.0 2023-06-19 13:33:32,294 INFO [train.py:996] (2/4) Epoch 3, batch 18100, loss[loss=0.2869, simple_loss=0.3719, pruned_loss=0.1009, over 21691.00 frames. ], tot_loss[loss=0.2766, simple_loss=0.3426, pruned_loss=0.1053, over 4268443.72 frames. ], batch size: 351, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:33:40,540 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-19 13:33:55,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=474534.0, ans=0.0 2023-06-19 13:34:09,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=474594.0, ans=0.125 2023-06-19 13:35:18,758 INFO [train.py:996] (2/4) Epoch 3, batch 18150, loss[loss=0.251, simple_loss=0.3033, pruned_loss=0.09937, over 15638.00 frames. ], tot_loss[loss=0.2776, simple_loss=0.3441, pruned_loss=0.1055, over 4260257.63 frames. ], batch size: 61, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:35:23,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=474834.0, ans=0.0 2023-06-19 13:35:41,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=474894.0, ans=0.125 2023-06-19 13:35:50,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=474894.0, ans=0.1 2023-06-19 13:35:51,280 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.41 vs. limit=10.0 2023-06-19 13:35:53,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=474954.0, ans=0.0 2023-06-19 13:36:31,890 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.411e+02 3.099e+02 3.760e+02 4.824e+02 9.400e+02, threshold=7.520e+02, percent-clipped=8.0 2023-06-19 13:36:52,795 INFO [train.py:996] (2/4) Epoch 3, batch 18200, loss[loss=0.2483, simple_loss=0.3124, pruned_loss=0.09209, over 21532.00 frames. ], tot_loss[loss=0.2757, simple_loss=0.3395, pruned_loss=0.106, over 4246585.54 frames. ], batch size: 195, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:36:56,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=475134.0, ans=0.95 2023-06-19 13:37:48,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=475314.0, ans=0.125 2023-06-19 13:37:58,594 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 13:38:06,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=475374.0, ans=0.0 2023-06-19 13:38:32,011 INFO [train.py:996] (2/4) Epoch 3, batch 18250, loss[loss=0.2082, simple_loss=0.2674, pruned_loss=0.07455, over 21319.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3299, pruned_loss=0.1017, over 4247883.70 frames. ], batch size: 144, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:38:32,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=475434.0, ans=0.125 2023-06-19 13:38:32,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=475434.0, ans=0.125 2023-06-19 13:38:42,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=475434.0, ans=0.0 2023-06-19 13:39:45,994 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 2.479e+02 3.020e+02 3.989e+02 8.042e+02, threshold=6.040e+02, percent-clipped=2.0 2023-06-19 13:40:06,834 INFO [train.py:996] (2/4) Epoch 3, batch 18300, loss[loss=0.2997, simple_loss=0.3735, pruned_loss=0.1129, over 21699.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3298, pruned_loss=0.102, over 4258425.58 frames. ], batch size: 389, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:40:22,023 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-19 13:40:28,475 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=15.0 2023-06-19 13:40:51,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=475854.0, ans=0.1 2023-06-19 13:41:03,675 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.92 vs. limit=22.5 2023-06-19 13:41:22,881 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.04 vs. limit=6.0 2023-06-19 13:41:42,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=475974.0, ans=0.05 2023-06-19 13:41:46,938 INFO [train.py:996] (2/4) Epoch 3, batch 18350, loss[loss=0.256, simple_loss=0.3147, pruned_loss=0.09863, over 21236.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3365, pruned_loss=0.1021, over 4251521.35 frames. ], batch size: 159, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:42:31,572 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.51 vs. limit=15.0 2023-06-19 13:42:46,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=476154.0, ans=0.2 2023-06-19 13:42:53,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=476214.0, ans=0.1 2023-06-19 13:42:59,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=476214.0, ans=0.125 2023-06-19 13:43:08,853 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.966e+02 2.933e+02 3.430e+02 4.228e+02 7.523e+02, threshold=6.860e+02, percent-clipped=6.0 2023-06-19 13:43:26,121 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.16 vs. limit=10.0 2023-06-19 13:43:28,070 INFO [train.py:996] (2/4) Epoch 3, batch 18400, loss[loss=0.2421, simple_loss=0.316, pruned_loss=0.08405, over 21620.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3325, pruned_loss=0.1008, over 4246207.92 frames. ], batch size: 391, lr: 1.07e-02, grad_scale: 32.0 2023-06-19 13:43:47,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=476394.0, ans=0.1 2023-06-19 13:43:49,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=476394.0, ans=0.0 2023-06-19 13:44:00,824 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=15.0 2023-06-19 13:45:08,802 INFO [train.py:996] (2/4) Epoch 3, batch 18450, loss[loss=0.1885, simple_loss=0.2575, pruned_loss=0.05975, over 21198.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3275, pruned_loss=0.0965, over 4232487.02 frames. ], batch size: 143, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:45:42,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=476694.0, ans=0.1 2023-06-19 13:45:48,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=476694.0, ans=0.125 2023-06-19 13:46:31,979 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.812e+02 3.346e+02 4.382e+02 1.092e+03, threshold=6.692e+02, percent-clipped=3.0 2023-06-19 13:46:34,778 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=15.0 2023-06-19 13:46:49,843 INFO [train.py:996] (2/4) Epoch 3, batch 18500, loss[loss=0.2425, simple_loss=0.2963, pruned_loss=0.09434, over 21520.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3216, pruned_loss=0.09403, over 4231664.77 frames. ], batch size: 442, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:46:50,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff3.min_abs, batch_count=476934.0, ans=0.2 2023-06-19 13:46:56,301 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.15 vs. limit=6.0 2023-06-19 13:47:51,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=477054.0, ans=0.0 2023-06-19 13:47:56,716 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.39 vs. limit=6.0 2023-06-19 13:48:05,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=477114.0, ans=10.0 2023-06-19 13:48:23,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=477174.0, ans=0.015 2023-06-19 13:48:26,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=477174.0, ans=0.125 2023-06-19 13:48:30,918 INFO [train.py:996] (2/4) Epoch 3, batch 18550, loss[loss=0.2179, simple_loss=0.3099, pruned_loss=0.06297, over 20773.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.32, pruned_loss=0.09352, over 4234540.79 frames. ], batch size: 608, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:48:52,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=477294.0, ans=0.0 2023-06-19 13:48:56,640 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-19 13:49:59,815 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.071e+02 3.099e+02 3.527e+02 4.215e+02 7.049e+02, threshold=7.053e+02, percent-clipped=1.0 2023-06-19 13:50:02,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=477474.0, ans=0.09899494936611666 2023-06-19 13:50:13,156 INFO [train.py:996] (2/4) Epoch 3, batch 18600, loss[loss=0.3675, simple_loss=0.4128, pruned_loss=0.1611, over 21434.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3184, pruned_loss=0.09494, over 4237876.03 frames. ], batch size: 508, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:51:00,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=477654.0, ans=0.125 2023-06-19 13:51:59,742 INFO [train.py:996] (2/4) Epoch 3, batch 18650, loss[loss=0.2585, simple_loss=0.3169, pruned_loss=0.1, over 21720.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3198, pruned_loss=0.09586, over 4248282.75 frames. ], batch size: 333, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:52:50,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=477954.0, ans=0.125 2023-06-19 13:53:21,449 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 3.016e+02 3.610e+02 4.241e+02 7.263e+02, threshold=7.220e+02, percent-clipped=2.0 2023-06-19 13:53:33,752 INFO [train.py:996] (2/4) Epoch 3, batch 18700, loss[loss=0.2939, simple_loss=0.3502, pruned_loss=0.1188, over 21801.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3183, pruned_loss=0.09801, over 4263213.95 frames. ], batch size: 124, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:53:53,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=478194.0, ans=0.2 2023-06-19 13:55:06,690 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-19 13:55:15,235 INFO [train.py:996] (2/4) Epoch 3, batch 18750, loss[loss=0.2983, simple_loss=0.365, pruned_loss=0.1158, over 21596.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3195, pruned_loss=0.1001, over 4274428.17 frames. ], batch size: 230, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:56:07,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=478554.0, ans=0.125 2023-06-19 13:56:08,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=478554.0, ans=0.0 2023-06-19 13:56:43,454 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 3.016e+02 3.473e+02 4.351e+02 6.634e+02, threshold=6.946e+02, percent-clipped=0.0 2023-06-19 13:56:55,942 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-19 13:56:56,602 INFO [train.py:996] (2/4) Epoch 3, batch 18800, loss[loss=0.2099, simple_loss=0.2896, pruned_loss=0.06509, over 21289.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.328, pruned_loss=0.1023, over 4269745.82 frames. ], batch size: 176, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:57:33,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=478794.0, ans=0.125 2023-06-19 13:58:31,642 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.25 vs. limit=15.0 2023-06-19 13:58:38,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=478974.0, ans=0.125 2023-06-19 13:58:44,205 INFO [train.py:996] (2/4) Epoch 3, batch 18850, loss[loss=0.2292, simple_loss=0.2954, pruned_loss=0.08155, over 21671.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3246, pruned_loss=0.09723, over 4268199.96 frames. ], batch size: 298, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 13:58:57,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=479034.0, ans=0.1 2023-06-19 13:59:15,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=479094.0, ans=0.125 2023-06-19 13:59:31,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=479154.0, ans=0.0 2023-06-19 14:00:05,300 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.64 vs. limit=10.0 2023-06-19 14:00:08,736 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.815e+02 2.738e+02 3.222e+02 4.135e+02 8.390e+02, threshold=6.445e+02, percent-clipped=2.0 2023-06-19 14:00:24,842 INFO [train.py:996] (2/4) Epoch 3, batch 18900, loss[loss=0.2706, simple_loss=0.3255, pruned_loss=0.1079, over 22026.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3199, pruned_loss=0.0966, over 4270269.10 frames. ], batch size: 119, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 14:02:07,439 INFO [train.py:996] (2/4) Epoch 3, batch 18950, loss[loss=0.3023, simple_loss=0.3716, pruned_loss=0.1165, over 21304.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3235, pruned_loss=0.1007, over 4271291.66 frames. ], batch size: 176, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 14:02:22,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=479634.0, ans=0.125 2023-06-19 14:03:08,401 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.97 vs. limit=10.0 2023-06-19 14:03:22,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=479814.0, ans=0.0 2023-06-19 14:03:40,011 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.897e+02 3.488e+02 4.402e+02 6.601e+02, threshold=6.976e+02, percent-clipped=2.0 2023-06-19 14:03:52,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=479874.0, ans=0.05 2023-06-19 14:03:55,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=479934.0, ans=0.125 2023-06-19 14:03:56,939 INFO [train.py:996] (2/4) Epoch 3, batch 19000, loss[loss=0.3941, simple_loss=0.4293, pruned_loss=0.1794, over 21321.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3339, pruned_loss=0.1029, over 4273832.07 frames. ], batch size: 507, lr: 1.07e-02, grad_scale: 16.0 2023-06-19 14:04:04,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=479934.0, ans=0.2 2023-06-19 14:04:45,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=480054.0, ans=0.125 2023-06-19 14:04:47,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=480054.0, ans=0.125 2023-06-19 14:05:05,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=480114.0, ans=0.2 2023-06-19 14:05:39,676 INFO [train.py:996] (2/4) Epoch 3, batch 19050, loss[loss=0.2726, simple_loss=0.3269, pruned_loss=0.1092, over 21777.00 frames. ], tot_loss[loss=0.2773, simple_loss=0.3395, pruned_loss=0.1075, over 4278749.68 frames. ], batch size: 247, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:06:05,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=480294.0, ans=0.0 2023-06-19 14:06:11,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=480294.0, ans=0.125 2023-06-19 14:06:12,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=480294.0, ans=0.0 2023-06-19 14:06:29,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=480354.0, ans=0.125 2023-06-19 14:06:46,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=480414.0, ans=0.0 2023-06-19 14:06:46,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=480414.0, ans=0.2 2023-06-19 14:07:04,466 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.299e+02 3.242e+02 3.668e+02 4.263e+02 6.635e+02, threshold=7.336e+02, percent-clipped=0.0 2023-06-19 14:07:08,724 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=15.0 2023-06-19 14:07:21,801 INFO [train.py:996] (2/4) Epoch 3, batch 19100, loss[loss=0.2785, simple_loss=0.3208, pruned_loss=0.1181, over 21300.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.3378, pruned_loss=0.1086, over 4284925.44 frames. ], batch size: 144, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:07:50,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=480594.0, ans=10.0 2023-06-19 14:08:20,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=480714.0, ans=0.125 2023-06-19 14:08:22,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=480714.0, ans=0.125 2023-06-19 14:08:36,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=480714.0, ans=0.1 2023-06-19 14:09:11,240 INFO [train.py:996] (2/4) Epoch 3, batch 19150, loss[loss=0.3843, simple_loss=0.4543, pruned_loss=0.1572, over 21496.00 frames. ], tot_loss[loss=0.2799, simple_loss=0.3406, pruned_loss=0.1096, over 4281968.61 frames. ], batch size: 471, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:09:17,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=480834.0, ans=0.1 2023-06-19 14:09:44,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=480894.0, ans=0.125 2023-06-19 14:09:58,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=480954.0, ans=0.125 2023-06-19 14:10:23,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=481014.0, ans=0.125 2023-06-19 14:10:31,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=481014.0, ans=0.1 2023-06-19 14:10:43,717 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.101e+02 3.009e+02 3.597e+02 4.510e+02 7.028e+02, threshold=7.194e+02, percent-clipped=0.0 2023-06-19 14:10:55,135 INFO [train.py:996] (2/4) Epoch 3, batch 19200, loss[loss=0.2952, simple_loss=0.3925, pruned_loss=0.09897, over 21780.00 frames. ], tot_loss[loss=0.2869, simple_loss=0.3526, pruned_loss=0.1106, over 4284616.96 frames. ], batch size: 316, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:10:57,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=481134.0, ans=0.125 2023-06-19 14:10:59,781 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=15.0 2023-06-19 14:11:20,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=481194.0, ans=0.125 2023-06-19 14:12:06,887 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.06 vs. limit=6.0 2023-06-19 14:12:35,850 INFO [train.py:996] (2/4) Epoch 3, batch 19250, loss[loss=0.2092, simple_loss=0.2887, pruned_loss=0.06486, over 21346.00 frames. ], tot_loss[loss=0.2801, simple_loss=0.3511, pruned_loss=0.1045, over 4277484.71 frames. ], batch size: 194, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:12:55,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=481494.0, ans=0.0 2023-06-19 14:13:08,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=481494.0, ans=0.125 2023-06-19 14:13:22,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=481554.0, ans=0.0 2023-06-19 14:13:54,167 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.656e+02 2.436e+02 3.016e+02 3.592e+02 9.679e+02, threshold=6.032e+02, percent-clipped=2.0 2023-06-19 14:14:10,921 INFO [train.py:996] (2/4) Epoch 3, batch 19300, loss[loss=0.2363, simple_loss=0.2939, pruned_loss=0.08932, over 21291.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.3471, pruned_loss=0.1033, over 4284363.54 frames. ], batch size: 159, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:14:49,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=481854.0, ans=0.0 2023-06-19 14:14:55,549 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-06-19 14:15:37,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=481974.0, ans=0.1 2023-06-19 14:15:46,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=481974.0, ans=0.125 2023-06-19 14:15:54,301 INFO [train.py:996] (2/4) Epoch 3, batch 19350, loss[loss=0.1913, simple_loss=0.2663, pruned_loss=0.05817, over 21213.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.3399, pruned_loss=0.0983, over 4284414.07 frames. ], batch size: 176, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:16:01,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=482034.0, ans=0.125 2023-06-19 14:16:02,094 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=12.0 2023-06-19 14:16:29,178 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:16:33,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=482154.0, ans=0.125 2023-06-19 14:17:13,361 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.948e+02 2.752e+02 3.460e+02 4.444e+02 7.574e+02, threshold=6.920e+02, percent-clipped=6.0 2023-06-19 14:17:24,689 INFO [train.py:996] (2/4) Epoch 3, batch 19400, loss[loss=0.245, simple_loss=0.3069, pruned_loss=0.09156, over 21242.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3382, pruned_loss=0.09762, over 4284131.20 frames. ], batch size: 143, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:17:30,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=482334.0, ans=0.125 2023-06-19 14:17:37,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=482334.0, ans=0.0 2023-06-19 14:18:04,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=482454.0, ans=0.1 2023-06-19 14:18:28,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=482514.0, ans=0.1 2023-06-19 14:19:05,706 INFO [train.py:996] (2/4) Epoch 3, batch 19450, loss[loss=0.2994, simple_loss=0.355, pruned_loss=0.1219, over 14797.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3346, pruned_loss=0.0988, over 4280542.12 frames. ], batch size: 60, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:19:40,995 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=22.5 2023-06-19 14:20:12,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=482814.0, ans=0.125 2023-06-19 14:20:25,266 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:20:37,743 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.381e+02 3.021e+02 3.528e+02 4.324e+02 6.786e+02, threshold=7.055e+02, percent-clipped=0.0 2023-06-19 14:20:52,481 INFO [train.py:996] (2/4) Epoch 3, batch 19500, loss[loss=0.2909, simple_loss=0.3615, pruned_loss=0.1101, over 21177.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3319, pruned_loss=0.1013, over 4268110.85 frames. ], batch size: 548, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:22:34,942 INFO [train.py:996] (2/4) Epoch 3, batch 19550, loss[loss=0.271, simple_loss=0.3549, pruned_loss=0.09361, over 21505.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3265, pruned_loss=0.09907, over 4255324.24 frames. ], batch size: 471, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:22:35,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=483234.0, ans=0.1 2023-06-19 14:23:06,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=483354.0, ans=0.025 2023-06-19 14:24:06,645 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.887e+02 3.715e+02 4.750e+02 9.269e+02, threshold=7.430e+02, percent-clipped=4.0 2023-06-19 14:24:10,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=483474.0, ans=0.0 2023-06-19 14:24:16,387 INFO [train.py:996] (2/4) Epoch 3, batch 19600, loss[loss=0.2654, simple_loss=0.3186, pruned_loss=0.1061, over 21618.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3288, pruned_loss=0.1001, over 4269712.84 frames. ], batch size: 195, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:25:02,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=483654.0, ans=0.125 2023-06-19 14:25:25,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=483714.0, ans=0.0 2023-06-19 14:25:31,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=483714.0, ans=0.0 2023-06-19 14:25:58,822 INFO [train.py:996] (2/4) Epoch 3, batch 19650, loss[loss=0.3337, simple_loss=0.3792, pruned_loss=0.1441, over 21784.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3355, pruned_loss=0.1061, over 4276923.17 frames. ], batch size: 414, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:26:14,934 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.21 vs. limit=15.0 2023-06-19 14:26:46,319 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.99 vs. limit=10.0 2023-06-19 14:27:19,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=484014.0, ans=0.125 2023-06-19 14:27:34,297 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.382e+02 2.992e+02 3.430e+02 3.953e+02 7.302e+02, threshold=6.859e+02, percent-clipped=0.0 2023-06-19 14:27:41,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=484074.0, ans=0.125 2023-06-19 14:27:44,534 INFO [train.py:996] (2/4) Epoch 3, batch 19700, loss[loss=0.2655, simple_loss=0.3294, pruned_loss=0.1008, over 20184.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3388, pruned_loss=0.1065, over 4265871.07 frames. ], batch size: 707, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:28:27,730 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:28:27,866 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:29:06,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=484314.0, ans=0.2 2023-06-19 14:29:33,045 INFO [train.py:996] (2/4) Epoch 3, batch 19750, loss[loss=0.3008, simple_loss=0.3803, pruned_loss=0.1106, over 21847.00 frames. ], tot_loss[loss=0.2831, simple_loss=0.3493, pruned_loss=0.1084, over 4265768.71 frames. ], batch size: 351, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:31:05,870 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 3.206e+02 3.832e+02 4.660e+02 9.927e+02, threshold=7.664e+02, percent-clipped=2.0 2023-06-19 14:31:15,135 INFO [train.py:996] (2/4) Epoch 3, batch 19800, loss[loss=0.2385, simple_loss=0.3121, pruned_loss=0.0824, over 21808.00 frames. ], tot_loss[loss=0.284, simple_loss=0.3501, pruned_loss=0.109, over 4272885.20 frames. ], batch size: 316, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:31:48,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=484794.0, ans=0.09899494936611666 2023-06-19 14:32:25,727 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=15.0 2023-06-19 14:32:30,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=484914.0, ans=0.125 2023-06-19 14:32:50,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=484974.0, ans=0.0 2023-06-19 14:33:03,137 INFO [train.py:996] (2/4) Epoch 3, batch 19850, loss[loss=0.2013, simple_loss=0.2764, pruned_loss=0.06309, over 21394.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.3408, pruned_loss=0.1029, over 4281211.61 frames. ], batch size: 211, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:33:22,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=485034.0, ans=0.125 2023-06-19 14:33:45,793 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-19 14:34:12,916 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.58 vs. limit=22.5 2023-06-19 14:34:29,611 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.663e+02 3.192e+02 3.932e+02 5.931e+02, threshold=6.384e+02, percent-clipped=0.0 2023-06-19 14:34:30,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=485274.0, ans=0.125 2023-06-19 14:34:45,248 INFO [train.py:996] (2/4) Epoch 3, batch 19900, loss[loss=0.2425, simple_loss=0.3585, pruned_loss=0.06327, over 19605.00 frames. ], tot_loss[loss=0.2693, simple_loss=0.3401, pruned_loss=0.09921, over 4275293.52 frames. ], batch size: 702, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:34:45,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=485334.0, ans=0.2 2023-06-19 14:35:03,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=485334.0, ans=0.1 2023-06-19 14:36:01,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=485514.0, ans=0.0 2023-06-19 14:36:33,287 INFO [train.py:996] (2/4) Epoch 3, batch 19950, loss[loss=0.279, simple_loss=0.326, pruned_loss=0.1159, over 21855.00 frames. ], tot_loss[loss=0.2647, simple_loss=0.3334, pruned_loss=0.098, over 4270361.74 frames. ], batch size: 98, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:36:43,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=485634.0, ans=0.125 2023-06-19 14:37:25,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=485814.0, ans=0.0 2023-06-19 14:37:37,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=485814.0, ans=0.125 2023-06-19 14:37:48,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=485874.0, ans=0.125 2023-06-19 14:37:59,669 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.893e+02 3.575e+02 4.384e+02 6.859e+02, threshold=7.149e+02, percent-clipped=1.0 2023-06-19 14:38:02,392 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=22.5 2023-06-19 14:38:13,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=485934.0, ans=0.125 2023-06-19 14:38:14,248 INFO [train.py:996] (2/4) Epoch 3, batch 20000, loss[loss=0.2894, simple_loss=0.3816, pruned_loss=0.0986, over 20756.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.334, pruned_loss=0.09861, over 4268646.85 frames. ], batch size: 607, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:38:16,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=485934.0, ans=0.125 2023-06-19 14:38:27,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=485934.0, ans=0.125 2023-06-19 14:38:54,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=486054.0, ans=0.0 2023-06-19 14:39:21,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=486114.0, ans=0.0 2023-06-19 14:39:21,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=486114.0, ans=0.2 2023-06-19 14:39:35,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=486174.0, ans=0.125 2023-06-19 14:39:54,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=486234.0, ans=0.125 2023-06-19 14:39:54,965 INFO [train.py:996] (2/4) Epoch 3, batch 20050, loss[loss=0.2823, simple_loss=0.3348, pruned_loss=0.1149, over 21278.00 frames. ], tot_loss[loss=0.2707, simple_loss=0.3366, pruned_loss=0.1024, over 4276078.32 frames. ], batch size: 143, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:40:05,408 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.37 vs. limit=15.0 2023-06-19 14:40:23,243 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=12.0 2023-06-19 14:40:34,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=486354.0, ans=0.125 2023-06-19 14:40:36,867 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.09 vs. limit=6.0 2023-06-19 14:41:28,461 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.834e+02 3.316e+02 3.890e+02 7.458e+02, threshold=6.631e+02, percent-clipped=1.0 2023-06-19 14:41:38,332 INFO [train.py:996] (2/4) Epoch 3, batch 20100, loss[loss=0.2994, simple_loss=0.3517, pruned_loss=0.1236, over 21333.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3395, pruned_loss=0.1057, over 4284542.73 frames. ], batch size: 176, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:41:40,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=486534.0, ans=0.125 2023-06-19 14:42:13,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=486594.0, ans=0.0 2023-06-19 14:42:35,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=486654.0, ans=0.125 2023-06-19 14:42:51,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=486714.0, ans=0.1 2023-06-19 14:43:27,788 INFO [train.py:996] (2/4) Epoch 3, batch 20150, loss[loss=0.3254, simple_loss=0.3875, pruned_loss=0.1317, over 21952.00 frames. ], tot_loss[loss=0.2861, simple_loss=0.3514, pruned_loss=0.1104, over 4287769.10 frames. ], batch size: 372, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:43:43,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=486894.0, ans=0.5 2023-06-19 14:43:55,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=486894.0, ans=0.125 2023-06-19 14:44:15,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=486954.0, ans=0.1 2023-06-19 14:44:16,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=486954.0, ans=0.125 2023-06-19 14:44:45,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=487014.0, ans=0.125 2023-06-19 14:44:45,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=487014.0, ans=0.0 2023-06-19 14:45:05,237 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.534e+02 3.281e+02 3.907e+02 5.074e+02 8.084e+02, threshold=7.814e+02, percent-clipped=7.0 2023-06-19 14:45:13,867 INFO [train.py:996] (2/4) Epoch 3, batch 20200, loss[loss=0.2874, simple_loss=0.3764, pruned_loss=0.09919, over 21716.00 frames. ], tot_loss[loss=0.2908, simple_loss=0.3554, pruned_loss=0.1131, over 4279565.83 frames. ], batch size: 298, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:46:30,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=487314.0, ans=0.125 2023-06-19 14:46:46,043 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2023-06-19 14:46:52,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=487374.0, ans=0.125 2023-06-19 14:47:01,501 INFO [train.py:996] (2/4) Epoch 3, batch 20250, loss[loss=0.2724, simple_loss=0.3354, pruned_loss=0.1046, over 21901.00 frames. ], tot_loss[loss=0.2873, simple_loss=0.3545, pruned_loss=0.11, over 4280682.83 frames. ], batch size: 124, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:48:22,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=487674.0, ans=0.0 2023-06-19 14:48:22,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=487674.0, ans=0.2 2023-06-19 14:48:29,762 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.865e+02 3.489e+02 4.461e+02 6.612e+02, threshold=6.978e+02, percent-clipped=0.0 2023-06-19 14:48:37,425 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:48:43,483 INFO [train.py:996] (2/4) Epoch 3, batch 20300, loss[loss=0.2292, simple_loss=0.3047, pruned_loss=0.07683, over 21470.00 frames. ], tot_loss[loss=0.2822, simple_loss=0.3515, pruned_loss=0.1065, over 4267283.12 frames. ], batch size: 195, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:48:43,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=487734.0, ans=0.1 2023-06-19 14:49:06,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=487794.0, ans=0.125 2023-06-19 14:49:26,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=487854.0, ans=0.2 2023-06-19 14:49:30,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=487854.0, ans=0.125 2023-06-19 14:49:43,298 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=22.5 2023-06-19 14:50:10,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=487974.0, ans=0.05 2023-06-19 14:50:24,314 INFO [train.py:996] (2/4) Epoch 3, batch 20350, loss[loss=0.2544, simple_loss=0.3063, pruned_loss=0.1013, over 19959.00 frames. ], tot_loss[loss=0.2835, simple_loss=0.3522, pruned_loss=0.1075, over 4253661.01 frames. ], batch size: 703, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:50:26,908 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.12 vs. limit=6.0 2023-06-19 14:50:42,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=488094.0, ans=0.2 2023-06-19 14:50:43,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=488094.0, ans=0.0 2023-06-19 14:50:47,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=488094.0, ans=0.0 2023-06-19 14:51:03,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=488154.0, ans=0.125 2023-06-19 14:51:25,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=488214.0, ans=0.2 2023-06-19 14:51:51,835 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.942e+02 3.646e+02 4.954e+02 9.108e+02, threshold=7.293e+02, percent-clipped=8.0 2023-06-19 14:52:05,485 INFO [train.py:996] (2/4) Epoch 3, batch 20400, loss[loss=0.4093, simple_loss=0.4386, pruned_loss=0.19, over 21439.00 frames. ], tot_loss[loss=0.2894, simple_loss=0.356, pruned_loss=0.1114, over 4257052.57 frames. ], batch size: 508, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:52:52,291 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.39 vs. limit=15.0 2023-06-19 14:53:38,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=488574.0, ans=0.125 2023-06-19 14:53:41,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=488634.0, ans=0.2 2023-06-19 14:53:42,590 INFO [train.py:996] (2/4) Epoch 3, batch 20450, loss[loss=0.3555, simple_loss=0.3908, pruned_loss=0.1601, over 21538.00 frames. ], tot_loss[loss=0.2927, simple_loss=0.3573, pruned_loss=0.1141, over 4242192.19 frames. ], batch size: 471, lr: 1.06e-02, grad_scale: 32.0 2023-06-19 14:53:46,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=488634.0, ans=15.0 2023-06-19 14:54:33,860 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-19 14:55:15,704 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.996e+02 3.435e+02 4.162e+02 7.102e+02, threshold=6.869e+02, percent-clipped=0.0 2023-06-19 14:55:22,282 INFO [train.py:996] (2/4) Epoch 3, batch 20500, loss[loss=0.3082, simple_loss=0.3675, pruned_loss=0.1245, over 21455.00 frames. ], tot_loss[loss=0.2902, simple_loss=0.3522, pruned_loss=0.1141, over 4253494.46 frames. ], batch size: 131, lr: 1.06e-02, grad_scale: 16.0 2023-06-19 14:55:39,808 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 14:55:40,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=488994.0, ans=15.0 2023-06-19 14:57:02,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=489174.0, ans=0.05 2023-06-19 14:57:05,167 INFO [train.py:996] (2/4) Epoch 3, batch 20550, loss[loss=0.3014, simple_loss=0.3273, pruned_loss=0.1378, over 21406.00 frames. ], tot_loss[loss=0.286, simple_loss=0.3464, pruned_loss=0.1128, over 4252549.22 frames. ], batch size: 508, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 14:57:22,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=489294.0, ans=0.0 2023-06-19 14:57:56,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=489354.0, ans=0.125 2023-06-19 14:58:19,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=489414.0, ans=0.125 2023-06-19 14:58:39,778 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.659e+02 2.727e+02 3.175e+02 3.881e+02 7.747e+02, threshold=6.350e+02, percent-clipped=1.0 2023-06-19 14:58:45,984 INFO [train.py:996] (2/4) Epoch 3, batch 20600, loss[loss=0.2871, simple_loss=0.346, pruned_loss=0.114, over 21844.00 frames. ], tot_loss[loss=0.2834, simple_loss=0.3473, pruned_loss=0.1097, over 4242488.96 frames. ], batch size: 332, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 14:59:07,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=489594.0, ans=0.5 2023-06-19 15:00:05,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=489714.0, ans=0.0 2023-06-19 15:00:15,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=489774.0, ans=0.125 2023-06-19 15:00:26,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=489834.0, ans=0.1 2023-06-19 15:00:27,256 INFO [train.py:996] (2/4) Epoch 3, batch 20650, loss[loss=0.2377, simple_loss=0.3046, pruned_loss=0.08542, over 21683.00 frames. ], tot_loss[loss=0.2814, simple_loss=0.3425, pruned_loss=0.1102, over 4234254.26 frames. ], batch size: 282, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:00:32,327 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=22.5 2023-06-19 15:00:36,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=489834.0, ans=0.125 2023-06-19 15:01:12,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=489954.0, ans=0.0 2023-06-19 15:01:15,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=489954.0, ans=0.0 2023-06-19 15:01:53,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=490074.0, ans=0.2 2023-06-19 15:02:03,116 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.806e+02 3.239e+02 3.703e+02 6.671e+02, threshold=6.478e+02, percent-clipped=1.0 2023-06-19 15:02:05,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=490074.0, ans=0.2 2023-06-19 15:02:08,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=490074.0, ans=6.0 2023-06-19 15:02:10,367 INFO [train.py:996] (2/4) Epoch 3, batch 20700, loss[loss=0.269, simple_loss=0.3362, pruned_loss=0.1009, over 21763.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.3344, pruned_loss=0.106, over 4248673.60 frames. ], batch size: 282, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:02:15,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=490134.0, ans=0.125 2023-06-19 15:02:23,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=490134.0, ans=0.2 2023-06-19 15:02:26,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=490194.0, ans=0.1 2023-06-19 15:02:30,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=490194.0, ans=0.1 2023-06-19 15:03:43,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=490374.0, ans=0.07 2023-06-19 15:03:45,857 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-19 15:03:50,842 INFO [train.py:996] (2/4) Epoch 3, batch 20750, loss[loss=0.2737, simple_loss=0.3668, pruned_loss=0.09035, over 21222.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3386, pruned_loss=0.1051, over 4251050.26 frames. ], batch size: 548, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:03:58,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=490434.0, ans=15.0 2023-06-19 15:04:21,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=490494.0, ans=0.0 2023-06-19 15:04:32,146 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=22.5 2023-06-19 15:04:50,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=490554.0, ans=0.125 2023-06-19 15:05:27,644 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.438e+02 3.202e+02 3.808e+02 5.093e+02 1.097e+03, threshold=7.616e+02, percent-clipped=4.0 2023-06-19 15:05:33,792 INFO [train.py:996] (2/4) Epoch 3, batch 20800, loss[loss=0.2593, simple_loss=0.3088, pruned_loss=0.1049, over 21546.00 frames. ], tot_loss[loss=0.2779, simple_loss=0.3422, pruned_loss=0.1067, over 4242968.30 frames. ], batch size: 132, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:05:34,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=490734.0, ans=0.1 2023-06-19 15:05:56,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=490794.0, ans=0.0 2023-06-19 15:06:46,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=490914.0, ans=0.125 2023-06-19 15:07:05,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=490974.0, ans=0.0 2023-06-19 15:07:10,058 INFO [train.py:996] (2/4) Epoch 3, batch 20850, loss[loss=0.2484, simple_loss=0.3032, pruned_loss=0.09679, over 21639.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3343, pruned_loss=0.104, over 4241433.18 frames. ], batch size: 230, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:07:35,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=491094.0, ans=0.1 2023-06-19 15:07:41,195 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.36 vs. limit=6.0 2023-06-19 15:07:45,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=491094.0, ans=0.125 2023-06-19 15:08:09,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=491154.0, ans=0.2 2023-06-19 15:08:22,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=491214.0, ans=0.04949747468305833 2023-06-19 15:08:45,563 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 3.010e+02 3.845e+02 4.738e+02 1.149e+03, threshold=7.690e+02, percent-clipped=6.0 2023-06-19 15:08:56,653 INFO [train.py:996] (2/4) Epoch 3, batch 20900, loss[loss=0.2987, simple_loss=0.3537, pruned_loss=0.1218, over 21889.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3341, pruned_loss=0.105, over 4253389.31 frames. ], batch size: 118, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:10:16,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=491574.0, ans=0.0 2023-06-19 15:10:25,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=491574.0, ans=0.0 2023-06-19 15:10:28,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=491574.0, ans=0.0 2023-06-19 15:10:31,182 INFO [train.py:996] (2/4) Epoch 3, batch 20950, loss[loss=0.2044, simple_loss=0.2809, pruned_loss=0.0639, over 21766.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3298, pruned_loss=0.1006, over 4254935.33 frames. ], batch size: 316, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:11:03,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=491694.0, ans=0.125 2023-06-19 15:11:44,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=491814.0, ans=0.125 2023-06-19 15:12:04,998 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.767e+02 2.584e+02 3.033e+02 4.070e+02 6.900e+02, threshold=6.066e+02, percent-clipped=0.0 2023-06-19 15:12:11,020 INFO [train.py:996] (2/4) Epoch 3, batch 21000, loss[loss=0.2466, simple_loss=0.3491, pruned_loss=0.07207, over 19799.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3284, pruned_loss=0.1002, over 4254544.58 frames. ], batch size: 703, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:12:11,021 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 15:12:29,328 INFO [train.py:1028] (2/4) Epoch 3, validation: loss=0.2787, simple_loss=0.3805, pruned_loss=0.08847, over 1796401.00 frames. 2023-06-19 15:12:29,329 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-19 15:12:34,079 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.94 vs. limit=12.0 2023-06-19 15:13:05,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=491994.0, ans=0.125 2023-06-19 15:13:23,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=492054.0, ans=0.2 2023-06-19 15:14:01,754 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.35 vs. limit=6.0 2023-06-19 15:14:05,461 INFO [train.py:996] (2/4) Epoch 3, batch 21050, loss[loss=0.298, simple_loss=0.3344, pruned_loss=0.1308, over 21297.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3271, pruned_loss=0.1012, over 4262922.98 frames. ], batch size: 471, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:14:07,995 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-19 15:14:29,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=492234.0, ans=0.0 2023-06-19 15:14:41,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=492294.0, ans=0.2 2023-06-19 15:14:50,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=492294.0, ans=0.1 2023-06-19 15:15:17,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=492414.0, ans=0.0 2023-06-19 15:15:22,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=492414.0, ans=0.0 2023-06-19 15:15:22,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=492414.0, ans=0.125 2023-06-19 15:15:39,136 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.890e+02 2.805e+02 3.251e+02 3.947e+02 6.448e+02, threshold=6.502e+02, percent-clipped=2.0 2023-06-19 15:15:45,691 INFO [train.py:996] (2/4) Epoch 3, batch 21100, loss[loss=0.2682, simple_loss=0.3135, pruned_loss=0.1114, over 21627.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.3229, pruned_loss=0.1004, over 4258670.10 frames. ], batch size: 416, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:16:22,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=492594.0, ans=0.1 2023-06-19 15:16:29,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=492594.0, ans=0.0 2023-06-19 15:16:31,583 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-19 15:16:34,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=492654.0, ans=0.125 2023-06-19 15:17:06,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=492714.0, ans=0.125 2023-06-19 15:17:27,206 INFO [train.py:996] (2/4) Epoch 3, batch 21150, loss[loss=0.2442, simple_loss=0.2912, pruned_loss=0.0986, over 21444.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.319, pruned_loss=0.1006, over 4260639.26 frames. ], batch size: 212, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:17:43,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=492834.0, ans=0.1 2023-06-19 15:17:47,971 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:18:24,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=492954.0, ans=0.125 2023-06-19 15:18:43,776 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=22.5 2023-06-19 15:18:55,184 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=15.0 2023-06-19 15:19:03,368 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 2.674e+02 3.023e+02 3.632e+02 5.729e+02, threshold=6.045e+02, percent-clipped=0.0 2023-06-19 15:19:13,132 INFO [train.py:996] (2/4) Epoch 3, batch 21200, loss[loss=0.2181, simple_loss=0.2879, pruned_loss=0.07418, over 21698.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3146, pruned_loss=0.09909, over 4262155.88 frames. ], batch size: 298, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:20:49,054 INFO [train.py:996] (2/4) Epoch 3, batch 21250, loss[loss=0.3024, simple_loss=0.3671, pruned_loss=0.1188, over 21668.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3138, pruned_loss=0.09988, over 4261838.19 frames. ], batch size: 247, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:21:13,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=493434.0, ans=0.1 2023-06-19 15:21:25,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=493494.0, ans=0.0 2023-06-19 15:21:34,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=493494.0, ans=0.125 2023-06-19 15:21:44,516 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.30 vs. limit=10.0 2023-06-19 15:22:03,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=493614.0, ans=0.125 2023-06-19 15:22:24,399 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.101e+02 3.377e+02 3.991e+02 5.475e+02 9.358e+02, threshold=7.981e+02, percent-clipped=20.0 2023-06-19 15:22:26,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=493674.0, ans=0.125 2023-06-19 15:22:28,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=493734.0, ans=0.0 2023-06-19 15:22:29,378 INFO [train.py:996] (2/4) Epoch 3, batch 21300, loss[loss=0.2746, simple_loss=0.329, pruned_loss=0.1101, over 21323.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3216, pruned_loss=0.1025, over 4256437.24 frames. ], batch size: 159, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:22:57,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=493794.0, ans=0.1 2023-06-19 15:23:01,783 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=22.5 2023-06-19 15:23:10,807 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.29 vs. limit=15.0 2023-06-19 15:23:18,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=493854.0, ans=0.1 2023-06-19 15:23:19,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=493854.0, ans=0.1 2023-06-19 15:24:17,171 INFO [train.py:996] (2/4) Epoch 3, batch 21350, loss[loss=0.253, simple_loss=0.3363, pruned_loss=0.08488, over 21352.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3279, pruned_loss=0.1037, over 4258138.12 frames. ], batch size: 548, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:25:20,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=494214.0, ans=0.0 2023-06-19 15:25:24,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=494214.0, ans=0.125 2023-06-19 15:25:30,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=494214.0, ans=0.1 2023-06-19 15:25:45,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=494274.0, ans=0.125 2023-06-19 15:25:54,672 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.815e+02 3.178e+02 3.883e+02 6.278e+02, threshold=6.357e+02, percent-clipped=0.0 2023-06-19 15:26:09,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=494334.0, ans=0.05 2023-06-19 15:26:10,359 INFO [train.py:996] (2/4) Epoch 3, batch 21400, loss[loss=0.2301, simple_loss=0.3086, pruned_loss=0.07582, over 21370.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3308, pruned_loss=0.1025, over 4264364.03 frames. ], batch size: 211, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:27:12,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=494514.0, ans=0.125 2023-06-19 15:27:15,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=494514.0, ans=0.125 2023-06-19 15:27:26,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=494574.0, ans=0.0 2023-06-19 15:27:36,082 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:27:37,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=494574.0, ans=0.125 2023-06-19 15:27:45,394 INFO [train.py:996] (2/4) Epoch 3, batch 21450, loss[loss=0.2429, simple_loss=0.3129, pruned_loss=0.08644, over 21265.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3342, pruned_loss=0.1046, over 4272935.25 frames. ], batch size: 143, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:28:31,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=494754.0, ans=0.1 2023-06-19 15:28:32,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=494754.0, ans=0.0 2023-06-19 15:28:59,185 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-19 15:29:21,109 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.168e+02 2.895e+02 3.297e+02 3.892e+02 6.030e+02, threshold=6.593e+02, percent-clipped=0.0 2023-06-19 15:29:31,466 INFO [train.py:996] (2/4) Epoch 3, batch 21500, loss[loss=0.2766, simple_loss=0.3195, pruned_loss=0.1169, over 21453.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.3327, pruned_loss=0.1061, over 4270743.64 frames. ], batch size: 194, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:29:44,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=494934.0, ans=0.07 2023-06-19 15:29:57,956 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-19 15:30:07,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=495054.0, ans=0.1 2023-06-19 15:30:11,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=495054.0, ans=0.1 2023-06-19 15:30:18,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=495054.0, ans=0.0 2023-06-19 15:31:06,799 INFO [train.py:996] (2/4) Epoch 3, batch 21550, loss[loss=0.2144, simple_loss=0.272, pruned_loss=0.07847, over 21330.00 frames. ], tot_loss[loss=0.2646, simple_loss=0.3239, pruned_loss=0.1026, over 4263751.69 frames. ], batch size: 160, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:31:51,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=495354.0, ans=0.0 2023-06-19 15:32:12,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=495414.0, ans=0.1 2023-06-19 15:32:22,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=495414.0, ans=0.0 2023-06-19 15:32:48,703 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 3.022e+02 3.623e+02 4.651e+02 8.178e+02, threshold=7.247e+02, percent-clipped=4.0 2023-06-19 15:32:57,594 INFO [train.py:996] (2/4) Epoch 3, batch 21600, loss[loss=0.2583, simple_loss=0.3452, pruned_loss=0.08576, over 20768.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3184, pruned_loss=0.1006, over 4254019.03 frames. ], batch size: 607, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:33:03,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=495534.0, ans=0.0 2023-06-19 15:33:29,369 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.46 vs. limit=10.0 2023-06-19 15:33:35,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=495654.0, ans=0.125 2023-06-19 15:34:22,822 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-06-19 15:34:39,701 INFO [train.py:996] (2/4) Epoch 3, batch 21650, loss[loss=0.247, simple_loss=0.3387, pruned_loss=0.07767, over 21625.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3242, pruned_loss=0.09926, over 4252704.32 frames. ], batch size: 230, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:35:18,045 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-19 15:35:36,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=496014.0, ans=0.0 2023-06-19 15:36:04,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=496074.0, ans=0.95 2023-06-19 15:36:18,637 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.307e+02 3.153e+02 4.282e+02 5.472e+02 1.270e+03, threshold=8.565e+02, percent-clipped=5.0 2023-06-19 15:36:20,165 INFO [train.py:996] (2/4) Epoch 3, batch 21700, loss[loss=0.2562, simple_loss=0.3147, pruned_loss=0.09882, over 21779.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3257, pruned_loss=0.09759, over 4254331.57 frames. ], batch size: 351, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:36:42,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=496194.0, ans=0.025 2023-06-19 15:37:00,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=496254.0, ans=0.0 2023-06-19 15:37:17,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=496314.0, ans=0.1 2023-06-19 15:37:51,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=496374.0, ans=0.125 2023-06-19 15:37:54,018 INFO [train.py:996] (2/4) Epoch 3, batch 21750, loss[loss=0.2093, simple_loss=0.269, pruned_loss=0.07481, over 21265.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3197, pruned_loss=0.09642, over 4246823.35 frames. ], batch size: 144, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:38:26,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=496494.0, ans=0.04949747468305833 2023-06-19 15:39:34,247 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 2.772e+02 3.220e+02 4.013e+02 6.187e+02, threshold=6.439e+02, percent-clipped=0.0 2023-06-19 15:39:41,176 INFO [train.py:996] (2/4) Epoch 3, batch 21800, loss[loss=0.2741, simple_loss=0.3108, pruned_loss=0.1187, over 21328.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3184, pruned_loss=0.09797, over 4256873.59 frames. ], batch size: 473, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:39:56,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=496734.0, ans=0.5 2023-06-19 15:40:29,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=496854.0, ans=6.0 2023-06-19 15:40:29,918 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.68 vs. limit=15.0 2023-06-19 15:40:32,358 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 15:40:47,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=496914.0, ans=0.125 2023-06-19 15:40:52,330 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-19 15:41:23,696 INFO [train.py:996] (2/4) Epoch 3, batch 21850, loss[loss=0.2621, simple_loss=0.3468, pruned_loss=0.08872, over 19835.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3243, pruned_loss=0.09872, over 4263917.48 frames. ], batch size: 702, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:41:32,720 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=22.5 2023-06-19 15:43:07,049 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.253e+02 3.918e+02 5.054e+02 8.247e+02, threshold=7.836e+02, percent-clipped=6.0 2023-06-19 15:43:08,872 INFO [train.py:996] (2/4) Epoch 3, batch 21900, loss[loss=0.2496, simple_loss=0.3042, pruned_loss=0.09752, over 21707.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3257, pruned_loss=0.09981, over 4270407.12 frames. ], batch size: 112, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:43:22,797 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.51 vs. limit=10.0 2023-06-19 15:43:28,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=497394.0, ans=0.09899494936611666 2023-06-19 15:44:03,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=497514.0, ans=0.1 2023-06-19 15:44:49,961 INFO [train.py:996] (2/4) Epoch 3, batch 21950, loss[loss=0.1764, simple_loss=0.2677, pruned_loss=0.04259, over 21719.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3204, pruned_loss=0.09826, over 4267090.48 frames. ], batch size: 333, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:44:57,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=497634.0, ans=0.2 2023-06-19 15:45:04,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=497694.0, ans=0.1 2023-06-19 15:45:16,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=497694.0, ans=0.2 2023-06-19 15:46:31,004 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.843e+02 2.810e+02 3.300e+02 3.802e+02 6.470e+02, threshold=6.601e+02, percent-clipped=0.0 2023-06-19 15:46:32,709 INFO [train.py:996] (2/4) Epoch 3, batch 22000, loss[loss=0.255, simple_loss=0.3091, pruned_loss=0.1005, over 21725.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3147, pruned_loss=0.09592, over 4276412.46 frames. ], batch size: 112, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:46:38,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=497934.0, ans=0.125 2023-06-19 15:47:32,245 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=15.0 2023-06-19 15:47:33,738 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.85 vs. limit=6.0 2023-06-19 15:47:36,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=498114.0, ans=0.0 2023-06-19 15:48:03,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=498174.0, ans=0.2 2023-06-19 15:48:16,232 INFO [train.py:996] (2/4) Epoch 3, batch 22050, loss[loss=0.2757, simple_loss=0.3415, pruned_loss=0.1049, over 21166.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3208, pruned_loss=0.09806, over 4263332.70 frames. ], batch size: 143, lr: 1.05e-02, grad_scale: 32.0 2023-06-19 15:48:33,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=498294.0, ans=0.1 2023-06-19 15:48:41,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=498294.0, ans=0.125 2023-06-19 15:48:42,803 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.90 vs. limit=5.0 2023-06-19 15:49:30,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=498414.0, ans=0.0 2023-06-19 15:49:58,782 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 3.520e+02 4.276e+02 5.874e+02 8.679e+02, threshold=8.552e+02, percent-clipped=13.0 2023-06-19 15:49:58,810 INFO [train.py:996] (2/4) Epoch 3, batch 22100, loss[loss=0.2877, simple_loss=0.3358, pruned_loss=0.1198, over 21787.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3302, pruned_loss=0.1025, over 4265317.99 frames. ], batch size: 247, lr: 1.05e-02, grad_scale: 16.0 2023-06-19 15:50:07,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=498534.0, ans=0.125 2023-06-19 15:50:10,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=498534.0, ans=0.125 2023-06-19 15:51:38,279 INFO [train.py:996] (2/4) Epoch 3, batch 22150, loss[loss=0.2888, simple_loss=0.3556, pruned_loss=0.111, over 21313.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.3324, pruned_loss=0.104, over 4272794.74 frames. ], batch size: 159, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 15:51:56,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=498894.0, ans=0.125 2023-06-19 15:52:37,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=499014.0, ans=0.0 2023-06-19 15:53:18,950 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.814e+02 3.451e+02 4.432e+02 8.221e+02, threshold=6.902e+02, percent-clipped=0.0 2023-06-19 15:53:18,986 INFO [train.py:996] (2/4) Epoch 3, batch 22200, loss[loss=0.2772, simple_loss=0.355, pruned_loss=0.09971, over 21322.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3356, pruned_loss=0.1061, over 4277584.87 frames. ], batch size: 176, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 15:54:20,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=499314.0, ans=0.1 2023-06-19 15:54:53,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=499374.0, ans=0.0 2023-06-19 15:55:01,255 INFO [train.py:996] (2/4) Epoch 3, batch 22250, loss[loss=0.3058, simple_loss=0.3711, pruned_loss=0.1203, over 21591.00 frames. ], tot_loss[loss=0.2799, simple_loss=0.3433, pruned_loss=0.1082, over 4272882.50 frames. ], batch size: 414, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 15:55:15,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=499434.0, ans=0.0 2023-06-19 15:56:15,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=499614.0, ans=0.0 2023-06-19 15:56:26,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=499674.0, ans=0.025 2023-06-19 15:56:41,197 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.351e+02 3.175e+02 3.826e+02 4.848e+02 6.426e+02, threshold=7.653e+02, percent-clipped=0.0 2023-06-19 15:56:41,227 INFO [train.py:996] (2/4) Epoch 3, batch 22300, loss[loss=0.2601, simple_loss=0.3514, pruned_loss=0.08443, over 19945.00 frames. ], tot_loss[loss=0.2841, simple_loss=0.3459, pruned_loss=0.1112, over 4267722.61 frames. ], batch size: 702, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 15:56:52,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=499734.0, ans=0.1 2023-06-19 15:56:58,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=499794.0, ans=0.0 2023-06-19 15:58:21,548 INFO [train.py:996] (2/4) Epoch 3, batch 22350, loss[loss=0.3172, simple_loss=0.3645, pruned_loss=0.1349, over 21692.00 frames. ], tot_loss[loss=0.2833, simple_loss=0.3438, pruned_loss=0.1114, over 4280626.93 frames. ], batch size: 473, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 15:58:30,923 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2023-06-19 15:58:35,737 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2023-06-19 15:58:57,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=500154.0, ans=0.125 2023-06-19 15:59:22,090 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=15.0 2023-06-19 15:59:29,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=500214.0, ans=0.125 2023-06-19 15:59:50,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=500274.0, ans=0.125 2023-06-19 16:00:02,200 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.767e+02 3.274e+02 4.023e+02 7.731e+02, threshold=6.547e+02, percent-clipped=1.0 2023-06-19 16:00:02,230 INFO [train.py:996] (2/4) Epoch 3, batch 22400, loss[loss=0.2729, simple_loss=0.3491, pruned_loss=0.09839, over 21632.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.3394, pruned_loss=0.1071, over 4282696.01 frames. ], batch size: 414, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:00:02,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=500334.0, ans=0.125 2023-06-19 16:01:02,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=500454.0, ans=0.125 2023-06-19 16:01:21,404 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-19 16:01:42,909 INFO [train.py:996] (2/4) Epoch 3, batch 22450, loss[loss=0.2771, simple_loss=0.3132, pruned_loss=0.1205, over 21241.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.3337, pruned_loss=0.1064, over 4282899.53 frames. ], batch size: 144, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:01:48,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=500634.0, ans=0.1 2023-06-19 16:01:50,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=500634.0, ans=0.125 2023-06-19 16:01:55,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=500634.0, ans=0.1 2023-06-19 16:02:34,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=500754.0, ans=0.025 2023-06-19 16:02:49,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=500754.0, ans=0.125 2023-06-19 16:03:11,885 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.05 vs. limit=22.5 2023-06-19 16:03:27,122 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 3.119e+02 3.860e+02 5.027e+02 1.347e+03, threshold=7.719e+02, percent-clipped=7.0 2023-06-19 16:03:27,148 INFO [train.py:996] (2/4) Epoch 3, batch 22500, loss[loss=0.2247, simple_loss=0.3003, pruned_loss=0.07453, over 21711.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3299, pruned_loss=0.1062, over 4275268.21 frames. ], batch size: 124, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:03:27,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=500934.0, ans=0.2 2023-06-19 16:04:11,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=500994.0, ans=0.1 2023-06-19 16:04:17,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=501054.0, ans=0.2 2023-06-19 16:05:10,038 INFO [train.py:996] (2/4) Epoch 3, batch 22550, loss[loss=0.259, simple_loss=0.3169, pruned_loss=0.1006, over 21519.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.3331, pruned_loss=0.106, over 4279811.79 frames. ], batch size: 194, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:05:39,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=501294.0, ans=0.125 2023-06-19 16:06:38,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=501474.0, ans=0.125 2023-06-19 16:07:06,050 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 3.038e+02 3.713e+02 4.829e+02 9.473e+02, threshold=7.425e+02, percent-clipped=2.0 2023-06-19 16:07:06,082 INFO [train.py:996] (2/4) Epoch 3, batch 22600, loss[loss=0.3565, simple_loss=0.422, pruned_loss=0.1455, over 21547.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3362, pruned_loss=0.1063, over 4276961.79 frames. ], batch size: 471, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:07:14,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=501534.0, ans=0.125 2023-06-19 16:07:39,536 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.22 vs. limit=10.0 2023-06-19 16:07:59,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=501654.0, ans=0.04949747468305833 2023-06-19 16:08:40,627 INFO [train.py:996] (2/4) Epoch 3, batch 22650, loss[loss=0.2519, simple_loss=0.3046, pruned_loss=0.09964, over 21842.00 frames. ], tot_loss[loss=0.272, simple_loss=0.333, pruned_loss=0.1055, over 4270435.81 frames. ], batch size: 98, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:09:29,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=501954.0, ans=0.125 2023-06-19 16:10:01,146 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-19 16:10:23,374 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.918e+02 3.422e+02 4.341e+02 8.662e+02, threshold=6.843e+02, percent-clipped=1.0 2023-06-19 16:10:23,403 INFO [train.py:996] (2/4) Epoch 3, batch 22700, loss[loss=0.2317, simple_loss=0.289, pruned_loss=0.08725, over 21807.00 frames. ], tot_loss[loss=0.2675, simple_loss=0.3263, pruned_loss=0.1044, over 4267324.75 frames. ], batch size: 112, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:10:32,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=502134.0, ans=0.035 2023-06-19 16:11:59,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=502374.0, ans=0.2 2023-06-19 16:12:10,406 INFO [train.py:996] (2/4) Epoch 3, batch 22750, loss[loss=0.2997, simple_loss=0.3591, pruned_loss=0.1202, over 21751.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3288, pruned_loss=0.1073, over 4268816.69 frames. ], batch size: 351, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:12:17,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=502434.0, ans=0.07 2023-06-19 16:13:16,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=502614.0, ans=0.125 2023-06-19 16:13:18,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=502614.0, ans=0.125 2023-06-19 16:13:51,344 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.464e+02 3.384e+02 3.991e+02 5.040e+02 7.219e+02, threshold=7.983e+02, percent-clipped=3.0 2023-06-19 16:13:51,380 INFO [train.py:996] (2/4) Epoch 3, batch 22800, loss[loss=0.2695, simple_loss=0.3295, pruned_loss=0.1048, over 21997.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.333, pruned_loss=0.11, over 4276067.85 frames. ], batch size: 103, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:14:40,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=502854.0, ans=0.125 2023-06-19 16:14:53,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=502914.0, ans=0.0 2023-06-19 16:15:00,642 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-06-19 16:15:11,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=502974.0, ans=0.035 2023-06-19 16:15:32,983 INFO [train.py:996] (2/4) Epoch 3, batch 22850, loss[loss=0.2572, simple_loss=0.3034, pruned_loss=0.1056, over 21777.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.3284, pruned_loss=0.1087, over 4266325.15 frames. ], batch size: 124, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:15:41,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=503034.0, ans=0.2 2023-06-19 16:16:46,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=503214.0, ans=0.0 2023-06-19 16:17:16,753 INFO [train.py:996] (2/4) Epoch 3, batch 22900, loss[loss=0.3059, simple_loss=0.3932, pruned_loss=0.1093, over 20790.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.3314, pruned_loss=0.1079, over 4266603.73 frames. ], batch size: 608, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 16:17:18,570 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 3.187e+02 3.862e+02 4.458e+02 8.142e+02, threshold=7.724e+02, percent-clipped=1.0 2023-06-19 16:17:32,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=503334.0, ans=0.1 2023-06-19 16:18:06,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=503454.0, ans=0.0 2023-06-19 16:19:03,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=503634.0, ans=0.125 2023-06-19 16:19:03,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=503634.0, ans=0.0 2023-06-19 16:19:04,718 INFO [train.py:996] (2/4) Epoch 3, batch 22950, loss[loss=0.2808, simple_loss=0.3966, pruned_loss=0.08253, over 21315.00 frames. ], tot_loss[loss=0.279, simple_loss=0.3465, pruned_loss=0.1057, over 4273256.08 frames. ], batch size: 548, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 16:19:06,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=503634.0, ans=0.0 2023-06-19 16:19:15,042 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=15.0 2023-06-19 16:19:42,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=503754.0, ans=0.125 2023-06-19 16:19:49,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=503754.0, ans=0.125 2023-06-19 16:19:55,464 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=22.5 2023-06-19 16:19:56,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=503814.0, ans=0.125 2023-06-19 16:20:45,496 INFO [train.py:996] (2/4) Epoch 3, batch 23000, loss[loss=0.2654, simple_loss=0.3206, pruned_loss=0.1051, over 21250.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3429, pruned_loss=0.1023, over 4273284.53 frames. ], batch size: 159, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 16:20:51,993 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.976e+02 2.906e+02 3.294e+02 4.043e+02 6.729e+02, threshold=6.588e+02, percent-clipped=0.0 2023-06-19 16:21:25,285 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 16:22:33,421 INFO [train.py:996] (2/4) Epoch 3, batch 23050, loss[loss=0.3382, simple_loss=0.3904, pruned_loss=0.143, over 21416.00 frames. ], tot_loss[loss=0.2781, simple_loss=0.3452, pruned_loss=0.1055, over 4275752.43 frames. ], batch size: 471, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 16:22:58,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=504294.0, ans=0.125 2023-06-19 16:24:14,592 INFO [train.py:996] (2/4) Epoch 3, batch 23100, loss[loss=0.2336, simple_loss=0.2891, pruned_loss=0.0891, over 21539.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.3407, pruned_loss=0.1059, over 4271920.81 frames. ], batch size: 391, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 16:24:16,321 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.949e+02 3.465e+02 4.322e+02 6.088e+02, threshold=6.930e+02, percent-clipped=0.0 2023-06-19 16:25:15,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=504714.0, ans=0.125 2023-06-19 16:25:33,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=504774.0, ans=0.1 2023-06-19 16:25:39,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=504774.0, ans=0.2 2023-06-19 16:25:49,789 INFO [train.py:996] (2/4) Epoch 3, batch 23150, loss[loss=0.2834, simple_loss=0.3282, pruned_loss=0.1193, over 21303.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3344, pruned_loss=0.1049, over 4269922.92 frames. ], batch size: 176, lr: 1.04e-02, grad_scale: 16.0 2023-06-19 16:25:50,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=504834.0, ans=0.0 2023-06-19 16:26:20,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=504954.0, ans=0.125 2023-06-19 16:26:51,639 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-19 16:27:04,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=505074.0, ans=0.0 2023-06-19 16:27:24,207 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.83 vs. limit=10.0 2023-06-19 16:27:29,918 INFO [train.py:996] (2/4) Epoch 3, batch 23200, loss[loss=0.2595, simple_loss=0.3194, pruned_loss=0.09979, over 21726.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.333, pruned_loss=0.1056, over 4278810.38 frames. ], batch size: 230, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:27:31,413 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.314e+02 3.188e+02 3.765e+02 4.583e+02 7.279e+02, threshold=7.530e+02, percent-clipped=1.0 2023-06-19 16:28:00,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=505254.0, ans=0.125 2023-06-19 16:28:00,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=505254.0, ans=0.1 2023-06-19 16:28:12,697 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.64 vs. limit=15.0 2023-06-19 16:28:41,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=505314.0, ans=0.2 2023-06-19 16:28:42,161 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-06-19 16:28:57,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=505374.0, ans=0.1 2023-06-19 16:29:11,846 INFO [train.py:996] (2/4) Epoch 3, batch 23250, loss[loss=0.2748, simple_loss=0.3271, pruned_loss=0.1113, over 21876.00 frames. ], tot_loss[loss=0.2745, simple_loss=0.3338, pruned_loss=0.1076, over 4288997.93 frames. ], batch size: 298, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:29:31,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=505494.0, ans=0.125 2023-06-19 16:29:37,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=505494.0, ans=0.1 2023-06-19 16:29:49,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=505554.0, ans=0.0 2023-06-19 16:30:51,254 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=6.0 2023-06-19 16:30:55,050 INFO [train.py:996] (2/4) Epoch 3, batch 23300, loss[loss=0.468, simple_loss=0.5206, pruned_loss=0.2077, over 21457.00 frames. ], tot_loss[loss=0.2813, simple_loss=0.3423, pruned_loss=0.1101, over 4289741.34 frames. ], batch size: 507, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:30:56,672 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.268e+02 3.151e+02 3.585e+02 4.227e+02 7.319e+02, threshold=7.169e+02, percent-clipped=0.0 2023-06-19 16:31:25,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=505794.0, ans=0.2 2023-06-19 16:32:19,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=505914.0, ans=0.0 2023-06-19 16:32:24,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=505974.0, ans=0.025 2023-06-19 16:32:38,664 INFO [train.py:996] (2/4) Epoch 3, batch 23350, loss[loss=0.3085, simple_loss=0.3868, pruned_loss=0.1151, over 21624.00 frames. ], tot_loss[loss=0.2813, simple_loss=0.3456, pruned_loss=0.1084, over 4278933.04 frames. ], batch size: 414, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:32:54,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=506034.0, ans=0.95 2023-06-19 16:32:57,868 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=15.0 2023-06-19 16:33:26,454 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.13 vs. limit=6.0 2023-06-19 16:33:30,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=506154.0, ans=0.125 2023-06-19 16:33:38,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=506154.0, ans=0.95 2023-06-19 16:33:48,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=506214.0, ans=0.2 2023-06-19 16:34:21,322 INFO [train.py:996] (2/4) Epoch 3, batch 23400, loss[loss=0.2438, simple_loss=0.3043, pruned_loss=0.0916, over 21148.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3371, pruned_loss=0.1031, over 4274356.28 frames. ], batch size: 607, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:34:22,875 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.677e+02 2.587e+02 3.042e+02 3.768e+02 6.854e+02, threshold=6.085e+02, percent-clipped=0.0 2023-06-19 16:35:32,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=506514.0, ans=0.0 2023-06-19 16:35:43,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=506514.0, ans=0.0 2023-06-19 16:36:07,916 INFO [train.py:996] (2/4) Epoch 3, batch 23450, loss[loss=0.2974, simple_loss=0.358, pruned_loss=0.1184, over 21897.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3387, pruned_loss=0.1061, over 4279664.37 frames. ], batch size: 334, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:36:15,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=506634.0, ans=0.125 2023-06-19 16:36:28,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=506694.0, ans=0.1 2023-06-19 16:37:02,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=506754.0, ans=0.2 2023-06-19 16:37:49,344 INFO [train.py:996] (2/4) Epoch 3, batch 23500, loss[loss=0.2609, simple_loss=0.3207, pruned_loss=0.1005, over 21566.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3393, pruned_loss=0.1084, over 4276757.36 frames. ], batch size: 212, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:37:50,958 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.271e+02 4.126e+02 5.318e+02 8.868e+02, threshold=8.252e+02, percent-clipped=14.0 2023-06-19 16:38:09,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=506994.0, ans=0.1 2023-06-19 16:38:17,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=506994.0, ans=0.125 2023-06-19 16:38:48,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=507054.0, ans=0.0 2023-06-19 16:38:49,669 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.67 vs. limit=10.0 2023-06-19 16:39:31,005 INFO [train.py:996] (2/4) Epoch 3, batch 23550, loss[loss=0.2485, simple_loss=0.2995, pruned_loss=0.09875, over 21625.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.335, pruned_loss=0.107, over 4264288.01 frames. ], batch size: 247, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:41:17,637 INFO [train.py:996] (2/4) Epoch 3, batch 23600, loss[loss=0.295, simple_loss=0.3642, pruned_loss=0.1129, over 21784.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.3368, pruned_loss=0.1078, over 4263577.86 frames. ], batch size: 118, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:41:18,864 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-06-19 16:41:19,225 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 3.135e+02 3.693e+02 4.651e+02 9.053e+02, threshold=7.385e+02, percent-clipped=1.0 2023-06-19 16:41:28,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=507534.0, ans=0.0 2023-06-19 16:41:47,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=507594.0, ans=0.125 2023-06-19 16:42:06,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=507654.0, ans=0.2 2023-06-19 16:42:06,473 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.10 vs. limit=12.0 2023-06-19 16:42:09,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=507654.0, ans=0.125 2023-06-19 16:42:59,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=507834.0, ans=0.125 2023-06-19 16:43:00,388 INFO [train.py:996] (2/4) Epoch 3, batch 23650, loss[loss=0.368, simple_loss=0.416, pruned_loss=0.1601, over 21351.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.3353, pruned_loss=0.1053, over 4267957.80 frames. ], batch size: 507, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:43:01,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=507834.0, ans=15.0 2023-06-19 16:43:12,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=507834.0, ans=0.05 2023-06-19 16:43:26,528 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.01 vs. limit=8.0 2023-06-19 16:43:43,019 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=22.5 2023-06-19 16:43:52,350 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.19 vs. limit=22.5 2023-06-19 16:43:53,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=507954.0, ans=10.0 2023-06-19 16:44:14,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=508014.0, ans=0.125 2023-06-19 16:44:48,418 INFO [train.py:996] (2/4) Epoch 3, batch 23700, loss[loss=0.2402, simple_loss=0.3067, pruned_loss=0.08687, over 21404.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.338, pruned_loss=0.1042, over 4265909.64 frames. ], batch size: 211, lr: 1.04e-02, grad_scale: 32.0 2023-06-19 16:44:49,949 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.106e+02 2.801e+02 3.226e+02 4.051e+02 6.982e+02, threshold=6.453e+02, percent-clipped=0.0 2023-06-19 16:45:31,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=508254.0, ans=0.0 2023-06-19 16:46:12,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=508374.0, ans=0.0 2023-06-19 16:46:36,727 INFO [train.py:996] (2/4) Epoch 3, batch 23750, loss[loss=0.2641, simple_loss=0.3533, pruned_loss=0.08745, over 21701.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3407, pruned_loss=0.1054, over 4274522.56 frames. ], batch size: 351, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 16:47:05,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=508494.0, ans=0.1 2023-06-19 16:47:26,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=508554.0, ans=0.0 2023-06-19 16:47:43,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=508614.0, ans=0.0 2023-06-19 16:48:20,714 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.40 vs. limit=10.0 2023-06-19 16:48:21,125 INFO [train.py:996] (2/4) Epoch 3, batch 23800, loss[loss=0.2648, simple_loss=0.317, pruned_loss=0.1063, over 21746.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3369, pruned_loss=0.102, over 4265811.01 frames. ], batch size: 112, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 16:48:21,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=508734.0, ans=0.0 2023-06-19 16:48:21,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=508734.0, ans=0.2 2023-06-19 16:48:22,761 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.690e+02 3.256e+02 4.075e+02 6.648e+02, threshold=6.511e+02, percent-clipped=1.0 2023-06-19 16:48:28,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=508734.0, ans=0.1 2023-06-19 16:48:33,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=508734.0, ans=0.125 2023-06-19 16:48:34,460 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=22.5 2023-06-19 16:48:58,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=508854.0, ans=0.2 2023-06-19 16:49:14,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=508854.0, ans=0.125 2023-06-19 16:49:58,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=508974.0, ans=0.125 2023-06-19 16:50:05,862 INFO [train.py:996] (2/4) Epoch 3, batch 23850, loss[loss=0.3202, simple_loss=0.3798, pruned_loss=0.1304, over 21860.00 frames. ], tot_loss[loss=0.2821, simple_loss=0.3505, pruned_loss=0.1069, over 4265736.11 frames. ], batch size: 371, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 16:50:25,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=509034.0, ans=0.125 2023-06-19 16:50:26,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=509094.0, ans=0.125 2023-06-19 16:50:33,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=509094.0, ans=0.125 2023-06-19 16:50:40,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.whiten.whitening_limit, batch_count=509094.0, ans=15.0 2023-06-19 16:51:48,169 INFO [train.py:996] (2/4) Epoch 3, batch 23900, loss[loss=0.2907, simple_loss=0.3547, pruned_loss=0.1133, over 21858.00 frames. ], tot_loss[loss=0.2878, simple_loss=0.3577, pruned_loss=0.1089, over 4271075.67 frames. ], batch size: 98, lr: 1.03e-02, grad_scale: 16.0 2023-06-19 16:51:51,150 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 3.185e+02 4.046e+02 5.288e+02 1.128e+03, threshold=8.092e+02, percent-clipped=13.0 2023-06-19 16:52:16,796 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=15.0 2023-06-19 16:52:32,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=509454.0, ans=0.0 2023-06-19 16:52:47,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=509454.0, ans=0.0 2023-06-19 16:52:51,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=509454.0, ans=0.1 2023-06-19 16:53:00,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=509514.0, ans=0.0 2023-06-19 16:53:05,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=509514.0, ans=0.125 2023-06-19 16:53:28,842 INFO [train.py:996] (2/4) Epoch 3, batch 23950, loss[loss=0.2662, simple_loss=0.3264, pruned_loss=0.103, over 21246.00 frames. ], tot_loss[loss=0.2831, simple_loss=0.3496, pruned_loss=0.1083, over 4268998.86 frames. ], batch size: 159, lr: 1.03e-02, grad_scale: 16.0 2023-06-19 16:54:29,255 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-06-19 16:54:43,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=509814.0, ans=0.0 2023-06-19 16:54:48,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=509814.0, ans=0.04949747468305833 2023-06-19 16:55:15,657 INFO [train.py:996] (2/4) Epoch 3, batch 24000, loss[loss=0.318, simple_loss=0.374, pruned_loss=0.1309, over 21286.00 frames. ], tot_loss[loss=0.2885, simple_loss=0.3522, pruned_loss=0.1124, over 4269634.71 frames. ], batch size: 159, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 16:55:15,658 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 16:55:31,893 INFO [train.py:1028] (2/4) Epoch 3, validation: loss=0.2855, simple_loss=0.3833, pruned_loss=0.09389, over 1796401.00 frames. 2023-06-19 16:55:31,894 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-19 16:55:35,244 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.384e+02 3.049e+02 3.553e+02 4.728e+02 8.625e+02, threshold=7.107e+02, percent-clipped=2.0 2023-06-19 16:56:29,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=510054.0, ans=0.0 2023-06-19 16:57:10,308 INFO [train.py:996] (2/4) Epoch 3, batch 24050, loss[loss=0.2862, simple_loss=0.3577, pruned_loss=0.1073, over 21638.00 frames. ], tot_loss[loss=0.2895, simple_loss=0.353, pruned_loss=0.113, over 4272239.72 frames. ], batch size: 414, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 16:57:34,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=510234.0, ans=0.04949747468305833 2023-06-19 16:57:38,171 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=15.0 2023-06-19 16:57:40,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=510294.0, ans=0.0 2023-06-19 16:58:21,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=510414.0, ans=0.0 2023-06-19 16:58:39,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.50 vs. limit=22.5 2023-06-19 16:58:53,209 INFO [train.py:996] (2/4) Epoch 3, batch 24100, loss[loss=0.2909, simple_loss=0.3342, pruned_loss=0.1238, over 20111.00 frames. ], tot_loss[loss=0.2863, simple_loss=0.3519, pruned_loss=0.1104, over 4269352.14 frames. ], batch size: 702, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 16:58:56,273 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.988e+02 3.709e+02 5.089e+02 1.009e+03, threshold=7.417e+02, percent-clipped=9.0 2023-06-19 16:59:43,305 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=22.5 2023-06-19 17:00:28,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=510774.0, ans=0.0 2023-06-19 17:00:30,542 INFO [train.py:996] (2/4) Epoch 3, batch 24150, loss[loss=0.313, simple_loss=0.3654, pruned_loss=0.1303, over 21696.00 frames. ], tot_loss[loss=0.2891, simple_loss=0.3524, pruned_loss=0.1129, over 4275553.36 frames. ], batch size: 389, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:01:34,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=511014.0, ans=0.0 2023-06-19 17:01:38,238 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.89 vs. limit=8.0 2023-06-19 17:02:11,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=511074.0, ans=0.0 2023-06-19 17:02:13,440 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.98 vs. limit=6.0 2023-06-19 17:02:14,047 INFO [train.py:996] (2/4) Epoch 3, batch 24200, loss[loss=0.4015, simple_loss=0.4529, pruned_loss=0.175, over 21502.00 frames. ], tot_loss[loss=0.2904, simple_loss=0.3532, pruned_loss=0.1138, over 4271509.44 frames. ], batch size: 508, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:02:15,096 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-19 17:02:17,193 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.540e+02 3.206e+02 3.739e+02 4.662e+02 8.285e+02, threshold=7.479e+02, percent-clipped=1.0 2023-06-19 17:02:45,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=511194.0, ans=0.125 2023-06-19 17:02:54,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=511254.0, ans=0.0 2023-06-19 17:03:46,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=511374.0, ans=0.125 2023-06-19 17:03:52,825 INFO [train.py:996] (2/4) Epoch 3, batch 24250, loss[loss=0.2215, simple_loss=0.3178, pruned_loss=0.06259, over 21671.00 frames. ], tot_loss[loss=0.2788, simple_loss=0.3478, pruned_loss=0.1049, over 4275267.44 frames. ], batch size: 263, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:05:21,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=511674.0, ans=0.125 2023-06-19 17:05:33,912 INFO [train.py:996] (2/4) Epoch 3, batch 24300, loss[loss=0.1696, simple_loss=0.261, pruned_loss=0.03914, over 21739.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.3375, pruned_loss=0.09743, over 4274904.30 frames. ], batch size: 316, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:05:37,138 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.288e+02 2.786e+02 3.535e+02 7.213e+02, threshold=5.572e+02, percent-clipped=0.0 2023-06-19 17:06:07,369 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=15.0 2023-06-19 17:06:16,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=511854.0, ans=0.04949747468305833 2023-06-19 17:06:19,039 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=15.0 2023-06-19 17:06:55,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=511914.0, ans=0.1 2023-06-19 17:07:16,707 INFO [train.py:996] (2/4) Epoch 3, batch 24350, loss[loss=0.3152, simple_loss=0.3716, pruned_loss=0.1294, over 21844.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3343, pruned_loss=0.09814, over 4274771.52 frames. ], batch size: 332, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:08:35,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=512214.0, ans=0.2 2023-06-19 17:09:06,347 INFO [train.py:996] (2/4) Epoch 3, batch 24400, loss[loss=0.2847, simple_loss=0.3546, pruned_loss=0.1074, over 21752.00 frames. ], tot_loss[loss=0.2742, simple_loss=0.3416, pruned_loss=0.1034, over 4274942.50 frames. ], batch size: 247, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:09:09,704 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.816e+02 3.270e+02 4.131e+02 5.260e+02 7.879e+02, threshold=8.262e+02, percent-clipped=18.0 2023-06-19 17:10:03,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=512454.0, ans=0.05 2023-06-19 17:10:32,346 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=15.0 2023-06-19 17:10:38,880 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.64 vs. limit=6.0 2023-06-19 17:10:48,984 INFO [train.py:996] (2/4) Epoch 3, batch 24450, loss[loss=0.2641, simple_loss=0.3203, pruned_loss=0.1039, over 21155.00 frames. ], tot_loss[loss=0.2792, simple_loss=0.347, pruned_loss=0.1057, over 4268594.43 frames. ], batch size: 143, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:11:01,040 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 17:11:42,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=512754.0, ans=0.1 2023-06-19 17:12:21,356 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.19 vs. limit=12.0 2023-06-19 17:12:30,307 INFO [train.py:996] (2/4) Epoch 3, batch 24500, loss[loss=0.2784, simple_loss=0.338, pruned_loss=0.1094, over 21896.00 frames. ], tot_loss[loss=0.2782, simple_loss=0.346, pruned_loss=0.1052, over 4273362.10 frames. ], batch size: 107, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:12:33,624 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.872e+02 3.383e+02 4.151e+02 6.413e+02, threshold=6.766e+02, percent-clipped=0.0 2023-06-19 17:12:34,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=512934.0, ans=0.1 2023-06-19 17:13:20,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=513054.0, ans=0.0 2023-06-19 17:13:25,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=513054.0, ans=0.0 2023-06-19 17:13:34,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=513114.0, ans=0.125 2023-06-19 17:13:36,235 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.99 vs. limit=15.0 2023-06-19 17:14:12,267 INFO [train.py:996] (2/4) Epoch 3, batch 24550, loss[loss=0.3195, simple_loss=0.3807, pruned_loss=0.1291, over 21855.00 frames. ], tot_loss[loss=0.2835, simple_loss=0.3493, pruned_loss=0.1089, over 4271436.24 frames. ], batch size: 124, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:14:57,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=513294.0, ans=0.1 2023-06-19 17:15:10,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=513354.0, ans=0.1 2023-06-19 17:15:20,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=513414.0, ans=0.0 2023-06-19 17:15:54,318 INFO [train.py:996] (2/4) Epoch 3, batch 24600, loss[loss=0.2616, simple_loss=0.3181, pruned_loss=0.1026, over 21743.00 frames. ], tot_loss[loss=0.2811, simple_loss=0.3439, pruned_loss=0.1091, over 4272694.90 frames. ], batch size: 282, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:15:57,346 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.888e+02 3.572e+02 4.375e+02 7.058e+02, threshold=7.144e+02, percent-clipped=1.0 2023-06-19 17:15:59,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=513534.0, ans=0.125 2023-06-19 17:17:03,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=513714.0, ans=0.125 2023-06-19 17:17:21,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=513774.0, ans=0.125 2023-06-19 17:17:35,826 INFO [train.py:996] (2/4) Epoch 3, batch 24650, loss[loss=0.2652, simple_loss=0.3114, pruned_loss=0.1095, over 21560.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3359, pruned_loss=0.1075, over 4275804.50 frames. ], batch size: 442, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:17:38,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=513834.0, ans=0.2 2023-06-19 17:18:11,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=513894.0, ans=0.125 2023-06-19 17:18:21,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=513954.0, ans=0.1 2023-06-19 17:18:37,330 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.82 vs. limit=15.0 2023-06-19 17:19:02,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=514074.0, ans=0.0 2023-06-19 17:19:13,027 INFO [train.py:996] (2/4) Epoch 3, batch 24700, loss[loss=0.262, simple_loss=0.323, pruned_loss=0.1005, over 21485.00 frames. ], tot_loss[loss=0.273, simple_loss=0.3353, pruned_loss=0.1053, over 4268066.03 frames. ], batch size: 389, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:19:15,610 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=22.5 2023-06-19 17:19:16,035 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.157e+02 3.618e+02 4.336e+02 6.867e+02, threshold=7.236e+02, percent-clipped=0.0 2023-06-19 17:19:34,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=514134.0, ans=0.125 2023-06-19 17:19:52,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=514194.0, ans=0.1 2023-06-19 17:20:38,311 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=12.0 2023-06-19 17:20:44,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=514374.0, ans=0.125 2023-06-19 17:20:46,158 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 17:20:50,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=514374.0, ans=0.1 2023-06-19 17:20:55,379 INFO [train.py:996] (2/4) Epoch 3, batch 24750, loss[loss=0.2624, simple_loss=0.3132, pruned_loss=0.1058, over 21510.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.327, pruned_loss=0.1016, over 4261747.25 frames. ], batch size: 441, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:22:00,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=514614.0, ans=0.0 2023-06-19 17:22:36,164 INFO [train.py:996] (2/4) Epoch 3, batch 24800, loss[loss=0.2837, simple_loss=0.3449, pruned_loss=0.1113, over 21863.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3214, pruned_loss=0.1013, over 4262329.66 frames. ], batch size: 118, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:22:39,060 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 2.753e+02 3.144e+02 3.669e+02 5.851e+02, threshold=6.289e+02, percent-clipped=0.0 2023-06-19 17:22:43,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=514734.0, ans=0.125 2023-06-19 17:23:53,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=514914.0, ans=0.1 2023-06-19 17:24:08,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=514974.0, ans=0.0 2023-06-19 17:24:19,436 INFO [train.py:996] (2/4) Epoch 3, batch 24850, loss[loss=0.2333, simple_loss=0.2825, pruned_loss=0.09201, over 21321.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3226, pruned_loss=0.103, over 4267502.78 frames. ], batch size: 176, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:24:24,436 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.86 vs. limit=6.0 2023-06-19 17:24:27,640 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.50 vs. limit=15.0 2023-06-19 17:24:48,278 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-19 17:26:02,449 INFO [train.py:996] (2/4) Epoch 3, batch 24900, loss[loss=0.3055, simple_loss=0.3579, pruned_loss=0.1265, over 21937.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3262, pruned_loss=0.1038, over 4272755.19 frames. ], batch size: 316, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:26:11,126 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 2.873e+02 3.627e+02 4.468e+02 7.935e+02, threshold=7.253e+02, percent-clipped=5.0 2023-06-19 17:26:12,186 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-19 17:26:23,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=515334.0, ans=0.125 2023-06-19 17:26:24,263 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.58 vs. limit=15.0 2023-06-19 17:26:51,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=515454.0, ans=0.0 2023-06-19 17:27:09,332 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=12.0 2023-06-19 17:27:20,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=515514.0, ans=0.125 2023-06-19 17:27:31,169 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-19 17:27:57,502 INFO [train.py:996] (2/4) Epoch 3, batch 24950, loss[loss=0.2955, simple_loss=0.3638, pruned_loss=0.1136, over 21224.00 frames. ], tot_loss[loss=0.2755, simple_loss=0.3343, pruned_loss=0.1083, over 4269086.93 frames. ], batch size: 143, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:28:20,062 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-19 17:28:23,576 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.77 vs. limit=10.0 2023-06-19 17:28:27,069 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0 2023-06-19 17:28:57,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=515814.0, ans=0.2 2023-06-19 17:28:59,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=515814.0, ans=0.2 2023-06-19 17:29:27,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=515874.0, ans=0.125 2023-06-19 17:29:46,116 INFO [train.py:996] (2/4) Epoch 3, batch 25000, loss[loss=0.2678, simple_loss=0.3285, pruned_loss=0.1036, over 21863.00 frames. ], tot_loss[loss=0.2809, simple_loss=0.3412, pruned_loss=0.1103, over 4271228.54 frames. ], batch size: 118, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:29:49,481 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.297e+02 2.951e+02 3.694e+02 4.326e+02 9.045e+02, threshold=7.388e+02, percent-clipped=1.0 2023-06-19 17:29:51,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=515934.0, ans=0.125 2023-06-19 17:29:59,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=515934.0, ans=0.125 2023-06-19 17:31:28,967 INFO [train.py:996] (2/4) Epoch 3, batch 25050, loss[loss=0.2582, simple_loss=0.3055, pruned_loss=0.1054, over 21317.00 frames. ], tot_loss[loss=0.2753, simple_loss=0.3337, pruned_loss=0.1084, over 4271411.23 frames. ], batch size: 160, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:32:20,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=516414.0, ans=0.125 2023-06-19 17:32:54,184 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 17:33:11,005 INFO [train.py:996] (2/4) Epoch 3, batch 25100, loss[loss=0.2281, simple_loss=0.2869, pruned_loss=0.08464, over 21655.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3264, pruned_loss=0.1062, over 4270046.82 frames. ], batch size: 282, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:33:13,889 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.044e+02 3.071e+02 3.524e+02 4.196e+02 8.233e+02, threshold=7.049e+02, percent-clipped=3.0 2023-06-19 17:34:47,473 INFO [train.py:996] (2/4) Epoch 3, batch 25150, loss[loss=0.2504, simple_loss=0.3316, pruned_loss=0.08462, over 21449.00 frames. ], tot_loss[loss=0.2696, simple_loss=0.3313, pruned_loss=0.1039, over 4274767.31 frames. ], batch size: 211, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:34:54,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=516834.0, ans=0.125 2023-06-19 17:36:29,180 INFO [train.py:996] (2/4) Epoch 3, batch 25200, loss[loss=0.2297, simple_loss=0.3125, pruned_loss=0.07343, over 21562.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3317, pruned_loss=0.1018, over 4267580.79 frames. ], batch size: 230, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:36:32,446 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.799e+02 2.662e+02 3.153e+02 4.538e+02 8.599e+02, threshold=6.306e+02, percent-clipped=6.0 2023-06-19 17:36:58,558 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.47 vs. limit=15.0 2023-06-19 17:37:05,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=517254.0, ans=0.125 2023-06-19 17:37:59,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=517374.0, ans=0.0 2023-06-19 17:38:10,298 INFO [train.py:996] (2/4) Epoch 3, batch 25250, loss[loss=0.2229, simple_loss=0.2836, pruned_loss=0.08111, over 21655.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3296, pruned_loss=0.09955, over 4273339.20 frames. ], batch size: 282, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:39:04,268 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=15.0 2023-06-19 17:39:40,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=517674.0, ans=0.0 2023-06-19 17:39:56,965 INFO [train.py:996] (2/4) Epoch 3, batch 25300, loss[loss=0.2438, simple_loss=0.3256, pruned_loss=0.08097, over 21608.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3281, pruned_loss=0.09973, over 4275255.48 frames. ], batch size: 414, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:39:57,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=517734.0, ans=0.1 2023-06-19 17:40:00,377 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 3.138e+02 3.706e+02 4.437e+02 8.805e+02, threshold=7.413e+02, percent-clipped=6.0 2023-06-19 17:40:20,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=517794.0, ans=0.125 2023-06-19 17:41:40,350 INFO [train.py:996] (2/4) Epoch 3, batch 25350, loss[loss=0.2315, simple_loss=0.3009, pruned_loss=0.08105, over 21382.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.3311, pruned_loss=0.1, over 4265352.38 frames. ], batch size: 131, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:42:02,581 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.19 vs. limit=15.0 2023-06-19 17:42:33,611 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.99 vs. limit=22.5 2023-06-19 17:42:44,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=518214.0, ans=0.125 2023-06-19 17:43:21,817 INFO [train.py:996] (2/4) Epoch 3, batch 25400, loss[loss=0.2868, simple_loss=0.3341, pruned_loss=0.1197, over 21226.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3253, pruned_loss=0.0987, over 4271212.38 frames. ], batch size: 159, lr: 1.03e-02, grad_scale: 32.0 2023-06-19 17:43:24,806 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 2.818e+02 3.409e+02 4.580e+02 8.063e+02, threshold=6.817e+02, percent-clipped=2.0 2023-06-19 17:43:38,902 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.35 vs. limit=15.0 2023-06-19 17:44:30,211 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-19 17:44:58,238 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 17:45:02,528 INFO [train.py:996] (2/4) Epoch 3, batch 25450, loss[loss=0.2863, simple_loss=0.3715, pruned_loss=0.1005, over 21492.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3267, pruned_loss=0.1019, over 4259531.90 frames. ], batch size: 471, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:45:58,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=518814.0, ans=0.0 2023-06-19 17:46:38,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=518874.0, ans=0.1 2023-06-19 17:46:46,177 INFO [train.py:996] (2/4) Epoch 3, batch 25500, loss[loss=0.2879, simple_loss=0.354, pruned_loss=0.1108, over 16559.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3268, pruned_loss=0.09804, over 4241412.16 frames. ], batch size: 60, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:46:47,576 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.48 vs. limit=15.0 2023-06-19 17:46:49,361 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.719e+02 2.642e+02 3.063e+02 3.580e+02 7.751e+02, threshold=6.127e+02, percent-clipped=1.0 2023-06-19 17:47:02,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=518994.0, ans=0.125 2023-06-19 17:48:30,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=519234.0, ans=0.05 2023-06-19 17:48:31,152 INFO [train.py:996] (2/4) Epoch 3, batch 25550, loss[loss=0.2711, simple_loss=0.3574, pruned_loss=0.09233, over 21719.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3342, pruned_loss=0.09834, over 4251979.58 frames. ], batch size: 332, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:48:53,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=519294.0, ans=0.125 2023-06-19 17:48:54,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=519294.0, ans=0.09899494936611666 2023-06-19 17:49:59,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=519474.0, ans=0.125 2023-06-19 17:50:01,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=519474.0, ans=0.0 2023-06-19 17:50:06,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=519474.0, ans=0.125 2023-06-19 17:50:20,664 INFO [train.py:996] (2/4) Epoch 3, batch 25600, loss[loss=0.3106, simple_loss=0.3592, pruned_loss=0.131, over 20133.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3378, pruned_loss=0.09913, over 4256078.09 frames. ], batch size: 707, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:50:23,741 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.720e+02 3.211e+02 3.853e+02 6.629e+02, threshold=6.421e+02, percent-clipped=1.0 2023-06-19 17:50:54,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=519654.0, ans=0.1 2023-06-19 17:51:26,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=519714.0, ans=0.0 2023-06-19 17:51:50,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=519774.0, ans=0.125 2023-06-19 17:51:57,688 INFO [train.py:996] (2/4) Epoch 3, batch 25650, loss[loss=0.2527, simple_loss=0.3071, pruned_loss=0.09913, over 21345.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.3394, pruned_loss=0.1039, over 4257454.19 frames. ], batch size: 211, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:52:31,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=519894.0, ans=0.0 2023-06-19 17:52:32,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=519894.0, ans=0.125 2023-06-19 17:53:20,169 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=15.0 2023-06-19 17:53:34,798 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.49 vs. limit=10.0 2023-06-19 17:53:35,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=520074.0, ans=0.2 2023-06-19 17:53:39,100 INFO [train.py:996] (2/4) Epoch 3, batch 25700, loss[loss=0.248, simple_loss=0.3283, pruned_loss=0.08384, over 21885.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3372, pruned_loss=0.1052, over 4253600.43 frames. ], batch size: 316, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:53:46,987 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.314e+02 3.093e+02 3.779e+02 4.609e+02 9.934e+02, threshold=7.559e+02, percent-clipped=6.0 2023-06-19 17:54:07,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=520194.0, ans=0.125 2023-06-19 17:54:17,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=520254.0, ans=0.1 2023-06-19 17:55:02,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=520314.0, ans=0.1 2023-06-19 17:55:07,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=520374.0, ans=0.125 2023-06-19 17:55:28,424 INFO [train.py:996] (2/4) Epoch 3, batch 25750, loss[loss=0.2451, simple_loss=0.3167, pruned_loss=0.08677, over 20735.00 frames. ], tot_loss[loss=0.2788, simple_loss=0.3422, pruned_loss=0.1077, over 4259489.45 frames. ], batch size: 607, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:55:43,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=520434.0, ans=0.1 2023-06-19 17:56:19,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=520554.0, ans=0.0 2023-06-19 17:56:27,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=520554.0, ans=0.125 2023-06-19 17:56:50,098 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=15.0 2023-06-19 17:57:07,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=520674.0, ans=0.125 2023-06-19 17:57:15,538 INFO [train.py:996] (2/4) Epoch 3, batch 25800, loss[loss=0.3712, simple_loss=0.4218, pruned_loss=0.1603, over 21407.00 frames. ], tot_loss[loss=0.2901, simple_loss=0.3548, pruned_loss=0.1127, over 4261327.11 frames. ], batch size: 471, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 17:57:25,029 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.576e+02 3.639e+02 4.483e+02 6.036e+02 1.254e+03, threshold=8.967e+02, percent-clipped=11.0 2023-06-19 17:57:42,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=520794.0, ans=0.125 2023-06-19 17:58:05,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=520854.0, ans=0.0 2023-06-19 17:58:15,861 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.27 vs. limit=10.0 2023-06-19 17:58:27,393 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.97 vs. limit=15.0 2023-06-19 17:58:53,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=520974.0, ans=0.0 2023-06-19 17:59:06,178 INFO [train.py:996] (2/4) Epoch 3, batch 25850, loss[loss=0.2722, simple_loss=0.3416, pruned_loss=0.1014, over 21866.00 frames. ], tot_loss[loss=0.2901, simple_loss=0.3558, pruned_loss=0.1122, over 4265237.50 frames. ], batch size: 414, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:00:11,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=521214.0, ans=0.125 2023-06-19 18:00:56,498 INFO [train.py:996] (2/4) Epoch 3, batch 25900, loss[loss=0.2773, simple_loss=0.35, pruned_loss=0.1023, over 21194.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.3566, pruned_loss=0.1124, over 4269365.37 frames. ], batch size: 143, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:01:01,410 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.484e+02 3.101e+02 3.467e+02 4.368e+02 8.294e+02, threshold=6.933e+02, percent-clipped=0.0 2023-06-19 18:01:29,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=521394.0, ans=0.2 2023-06-19 18:01:34,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=521454.0, ans=0.125 2023-06-19 18:01:47,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=521454.0, ans=0.0 2023-06-19 18:01:49,851 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=15.0 2023-06-19 18:02:29,189 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=22.5 2023-06-19 18:02:39,550 INFO [train.py:996] (2/4) Epoch 3, batch 25950, loss[loss=0.3303, simple_loss=0.3952, pruned_loss=0.1327, over 21516.00 frames. ], tot_loss[loss=0.297, simple_loss=0.3626, pruned_loss=0.1157, over 4265686.75 frames. ], batch size: 131, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:03:17,373 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=12.0 2023-06-19 18:03:20,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=521754.0, ans=0.125 2023-06-19 18:03:21,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=521754.0, ans=0.0 2023-06-19 18:03:59,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=521814.0, ans=0.0 2023-06-19 18:04:02,020 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-06-19 18:04:07,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=521874.0, ans=0.2 2023-06-19 18:04:24,062 INFO [train.py:996] (2/4) Epoch 3, batch 26000, loss[loss=0.3564, simple_loss=0.4115, pruned_loss=0.1506, over 21750.00 frames. ], tot_loss[loss=0.297, simple_loss=0.3648, pruned_loss=0.1146, over 4258942.53 frames. ], batch size: 441, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:04:35,627 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.072e+02 3.699e+02 4.692e+02 7.013e+02, threshold=7.398e+02, percent-clipped=1.0 2023-06-19 18:04:42,957 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.63 vs. limit=6.0 2023-06-19 18:04:43,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=521994.0, ans=0.125 2023-06-19 18:05:52,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=522174.0, ans=0.0 2023-06-19 18:06:06,965 INFO [train.py:996] (2/4) Epoch 3, batch 26050, loss[loss=0.3044, simple_loss=0.3494, pruned_loss=0.1298, over 21366.00 frames. ], tot_loss[loss=0.2971, simple_loss=0.364, pruned_loss=0.1151, over 4260699.81 frames. ], batch size: 159, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:06:40,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=522294.0, ans=0.0 2023-06-19 18:07:32,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=522474.0, ans=0.2 2023-06-19 18:07:34,547 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.99 vs. limit=15.0 2023-06-19 18:07:49,576 INFO [train.py:996] (2/4) Epoch 3, batch 26100, loss[loss=0.2595, simple_loss=0.3107, pruned_loss=0.1042, over 21591.00 frames. ], tot_loss[loss=0.2928, simple_loss=0.3571, pruned_loss=0.1142, over 4272359.85 frames. ], batch size: 548, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:08:01,017 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.956e+02 3.379e+02 4.537e+02 7.018e+02, threshold=6.758e+02, percent-clipped=0.0 2023-06-19 18:08:50,859 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.33 vs. limit=6.0 2023-06-19 18:08:56,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=522654.0, ans=0.125 2023-06-19 18:09:20,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=522774.0, ans=0.0 2023-06-19 18:09:23,827 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-19 18:09:39,650 INFO [train.py:996] (2/4) Epoch 3, batch 26150, loss[loss=0.3039, simple_loss=0.3592, pruned_loss=0.1243, over 21666.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3534, pruned_loss=0.1144, over 4278440.84 frames. ], batch size: 230, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:11:11,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=523074.0, ans=0.2 2023-06-19 18:11:18,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=523074.0, ans=0.125 2023-06-19 18:11:23,940 INFO [train.py:996] (2/4) Epoch 3, batch 26200, loss[loss=0.2512, simple_loss=0.338, pruned_loss=0.08219, over 21286.00 frames. ], tot_loss[loss=0.2903, simple_loss=0.3552, pruned_loss=0.1127, over 4279226.66 frames. ], batch size: 159, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:11:28,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=523134.0, ans=0.2 2023-06-19 18:11:30,771 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 3.095e+02 3.569e+02 4.232e+02 6.752e+02, threshold=7.138e+02, percent-clipped=0.0 2023-06-19 18:12:07,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=523254.0, ans=0.125 2023-06-19 18:12:16,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=523254.0, ans=0.125 2023-06-19 18:12:25,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=523254.0, ans=0.025 2023-06-19 18:12:38,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=523314.0, ans=0.125 2023-06-19 18:13:06,889 INFO [train.py:996] (2/4) Epoch 3, batch 26250, loss[loss=0.2887, simple_loss=0.35, pruned_loss=0.1137, over 21431.00 frames. ], tot_loss[loss=0.2911, simple_loss=0.3594, pruned_loss=0.1115, over 4282178.07 frames. ], batch size: 211, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:13:22,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=523494.0, ans=0.125 2023-06-19 18:13:22,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=523494.0, ans=0.2 2023-06-19 18:13:58,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=523554.0, ans=0.05 2023-06-19 18:14:44,574 INFO [train.py:996] (2/4) Epoch 3, batch 26300, loss[loss=0.238, simple_loss=0.3023, pruned_loss=0.08687, over 20170.00 frames. ], tot_loss[loss=0.2884, simple_loss=0.3547, pruned_loss=0.1111, over 4289530.48 frames. ], batch size: 703, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:14:48,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=523734.0, ans=0.2 2023-06-19 18:14:51,302 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 3.134e+02 3.781e+02 4.659e+02 7.680e+02, threshold=7.563e+02, percent-clipped=3.0 2023-06-19 18:14:51,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=523734.0, ans=0.125 2023-06-19 18:15:06,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=523734.0, ans=0.125 2023-06-19 18:15:15,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=523794.0, ans=0.125 2023-06-19 18:16:21,774 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-19 18:16:39,682 INFO [train.py:996] (2/4) Epoch 3, batch 26350, loss[loss=0.288, simple_loss=0.3486, pruned_loss=0.1137, over 21788.00 frames. ], tot_loss[loss=0.2864, simple_loss=0.3514, pruned_loss=0.1108, over 4287741.31 frames. ], batch size: 332, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:18:13,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=524274.0, ans=0.125 2023-06-19 18:18:16,305 INFO [train.py:996] (2/4) Epoch 3, batch 26400, loss[loss=0.2511, simple_loss=0.2942, pruned_loss=0.104, over 21159.00 frames. ], tot_loss[loss=0.2839, simple_loss=0.3456, pruned_loss=0.1111, over 4283356.37 frames. ], batch size: 143, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:18:28,550 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.200e+02 2.857e+02 3.384e+02 4.347e+02 8.285e+02, threshold=6.769e+02, percent-clipped=0.0 2023-06-19 18:19:21,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=524514.0, ans=0.125 2023-06-19 18:19:42,735 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-06-19 18:20:12,407 INFO [train.py:996] (2/4) Epoch 3, batch 26450, loss[loss=0.2598, simple_loss=0.3624, pruned_loss=0.07855, over 19776.00 frames. ], tot_loss[loss=0.2846, simple_loss=0.346, pruned_loss=0.1116, over 4273577.72 frames. ], batch size: 702, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:20:38,813 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-06-19 18:21:11,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=524814.0, ans=0.0 2023-06-19 18:21:42,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=524874.0, ans=0.0 2023-06-19 18:21:56,807 INFO [train.py:996] (2/4) Epoch 3, batch 26500, loss[loss=0.1896, simple_loss=0.2431, pruned_loss=0.06808, over 21816.00 frames. ], tot_loss[loss=0.2818, simple_loss=0.3455, pruned_loss=0.109, over 4273559.47 frames. ], batch size: 107, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:22:04,695 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 3.354e+02 4.139e+02 5.566e+02 7.518e+02, threshold=8.277e+02, percent-clipped=7.0 2023-06-19 18:23:04,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=525114.0, ans=0.1 2023-06-19 18:23:26,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=525174.0, ans=0.125 2023-06-19 18:23:42,742 INFO [train.py:996] (2/4) Epoch 3, batch 26550, loss[loss=0.2358, simple_loss=0.3155, pruned_loss=0.07806, over 21667.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.342, pruned_loss=0.1049, over 4273182.00 frames. ], batch size: 298, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:24:24,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=525294.0, ans=0.125 2023-06-19 18:25:30,103 INFO [train.py:996] (2/4) Epoch 3, batch 26600, loss[loss=0.2672, simple_loss=0.3211, pruned_loss=0.1066, over 21736.00 frames. ], tot_loss[loss=0.2727, simple_loss=0.3406, pruned_loss=0.1024, over 4265396.90 frames. ], batch size: 112, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:25:38,690 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 3.044e+02 3.646e+02 4.264e+02 8.431e+02, threshold=7.292e+02, percent-clipped=1.0 2023-06-19 18:26:17,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=525654.0, ans=0.125 2023-06-19 18:26:22,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=525654.0, ans=0.1 2023-06-19 18:26:37,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=525714.0, ans=0.125 2023-06-19 18:27:13,255 INFO [train.py:996] (2/4) Epoch 3, batch 26650, loss[loss=0.2078, simple_loss=0.2932, pruned_loss=0.06121, over 21621.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.335, pruned_loss=0.1015, over 4263575.18 frames. ], batch size: 391, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:27:21,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=525834.0, ans=0.0 2023-06-19 18:28:16,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=526014.0, ans=0.07 2023-06-19 18:28:53,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=526134.0, ans=0.0 2023-06-19 18:28:55,368 INFO [train.py:996] (2/4) Epoch 3, batch 26700, loss[loss=0.2488, simple_loss=0.3161, pruned_loss=0.09078, over 21428.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3281, pruned_loss=0.09778, over 4267083.41 frames. ], batch size: 131, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:29:03,453 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.469e+02 2.681e+02 3.249e+02 4.280e+02 9.861e+02, threshold=6.499e+02, percent-clipped=1.0 2023-06-19 18:29:29,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=526194.0, ans=0.2 2023-06-19 18:29:31,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=526194.0, ans=0.125 2023-06-19 18:29:48,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=526254.0, ans=0.125 2023-06-19 18:29:51,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=526254.0, ans=0.125 2023-06-19 18:29:52,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=526254.0, ans=0.0 2023-06-19 18:30:07,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=526314.0, ans=0.125 2023-06-19 18:30:08,217 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-19 18:30:38,058 INFO [train.py:996] (2/4) Epoch 3, batch 26750, loss[loss=0.2306, simple_loss=0.3085, pruned_loss=0.07638, over 21465.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3272, pruned_loss=0.09558, over 4278647.78 frames. ], batch size: 194, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:31:19,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=526494.0, ans=0.125 2023-06-19 18:31:29,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=526554.0, ans=0.2 2023-06-19 18:32:09,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=526674.0, ans=0.0 2023-06-19 18:32:35,852 INFO [train.py:996] (2/4) Epoch 3, batch 26800, loss[loss=0.4095, simple_loss=0.4354, pruned_loss=0.1918, over 21327.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3369, pruned_loss=0.1017, over 4275522.55 frames. ], batch size: 507, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:32:49,015 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 3.066e+02 3.643e+02 4.361e+02 8.068e+02, threshold=7.286e+02, percent-clipped=5.0 2023-06-19 18:33:11,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=526794.0, ans=0.125 2023-06-19 18:33:34,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=526914.0, ans=0.0 2023-06-19 18:33:49,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=526914.0, ans=0.125 2023-06-19 18:34:23,792 INFO [train.py:996] (2/4) Epoch 3, batch 26850, loss[loss=0.2409, simple_loss=0.2999, pruned_loss=0.0909, over 21718.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.3378, pruned_loss=0.1044, over 4279782.01 frames. ], batch size: 351, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:34:26,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=527034.0, ans=15.0 2023-06-19 18:34:32,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=527034.0, ans=0.0 2023-06-19 18:34:32,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=527034.0, ans=0.2 2023-06-19 18:34:33,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=527034.0, ans=0.125 2023-06-19 18:34:35,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=527034.0, ans=0.125 2023-06-19 18:35:12,646 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.44 vs. limit=15.0 2023-06-19 18:35:26,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=527214.0, ans=0.125 2023-06-19 18:35:58,872 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.32 vs. limit=15.0 2023-06-19 18:36:05,754 INFO [train.py:996] (2/4) Epoch 3, batch 26900, loss[loss=0.2626, simple_loss=0.3114, pruned_loss=0.1069, over 21713.00 frames. ], tot_loss[loss=0.2698, simple_loss=0.3304, pruned_loss=0.1046, over 4267677.49 frames. ], batch size: 124, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:36:14,255 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.946e+02 3.321e+02 4.106e+02 6.345e+02, threshold=6.642e+02, percent-clipped=0.0 2023-06-19 18:36:31,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=527394.0, ans=0.125 2023-06-19 18:37:16,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=527514.0, ans=0.125 2023-06-19 18:37:35,799 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-06-19 18:37:38,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=527574.0, ans=0.0 2023-06-19 18:37:49,035 INFO [train.py:996] (2/4) Epoch 3, batch 26950, loss[loss=0.3032, simple_loss=0.3827, pruned_loss=0.1118, over 21773.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3309, pruned_loss=0.1048, over 4272324.96 frames. ], batch size: 351, lr: 1.02e-02, grad_scale: 32.0 2023-06-19 18:37:49,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=527634.0, ans=0.0 2023-06-19 18:37:56,942 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=22.5 2023-06-19 18:39:00,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=527814.0, ans=0.2 2023-06-19 18:39:32,581 INFO [train.py:996] (2/4) Epoch 3, batch 27000, loss[loss=0.2242, simple_loss=0.307, pruned_loss=0.07074, over 21779.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3298, pruned_loss=0.1017, over 4260851.33 frames. ], batch size: 282, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:39:32,582 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 18:39:49,118 INFO [train.py:1028] (2/4) Epoch 3, validation: loss=0.2602, simple_loss=0.3579, pruned_loss=0.0813, over 1796401.00 frames. 2023-06-19 18:39:49,119 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-19 18:39:53,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=527934.0, ans=0.1 2023-06-19 18:39:59,041 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.939e+02 3.560e+02 4.603e+02 8.017e+02, threshold=7.120e+02, percent-clipped=5.0 2023-06-19 18:40:07,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=527934.0, ans=0.0 2023-06-19 18:40:28,174 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-19 18:40:34,370 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 18:40:57,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=528114.0, ans=0.125 2023-06-19 18:41:21,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=528174.0, ans=0.0 2023-06-19 18:41:33,649 INFO [train.py:996] (2/4) Epoch 3, batch 27050, loss[loss=0.2592, simple_loss=0.3261, pruned_loss=0.09612, over 21336.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3318, pruned_loss=0.09788, over 4272102.78 frames. ], batch size: 159, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:41:57,949 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-06-19 18:42:12,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=528354.0, ans=0.125 2023-06-19 18:42:16,164 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-19 18:43:16,440 INFO [train.py:996] (2/4) Epoch 3, batch 27100, loss[loss=0.3402, simple_loss=0.4361, pruned_loss=0.1222, over 19817.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.3354, pruned_loss=0.1008, over 4283252.22 frames. ], batch size: 702, lr: 1.02e-02, grad_scale: 16.0 2023-06-19 18:43:31,004 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.773e+02 3.198e+02 4.013e+02 8.418e+02, threshold=6.395e+02, percent-clipped=2.0 2023-06-19 18:43:35,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=528534.0, ans=0.125 2023-06-19 18:44:34,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=528714.0, ans=0.0 2023-06-19 18:44:59,064 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-19 18:45:00,456 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.22 vs. limit=15.0 2023-06-19 18:45:00,951 INFO [train.py:996] (2/4) Epoch 3, batch 27150, loss[loss=0.359, simple_loss=0.4313, pruned_loss=0.1434, over 21694.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.3476, pruned_loss=0.1047, over 4283939.15 frames. ], batch size: 441, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 18:45:43,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=528954.0, ans=0.09899494936611666 2023-06-19 18:45:55,920 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=15.0 2023-06-19 18:46:29,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=529074.0, ans=0.0 2023-06-19 18:46:43,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=529074.0, ans=0.125 2023-06-19 18:46:48,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=529134.0, ans=0.0 2023-06-19 18:46:49,421 INFO [train.py:996] (2/4) Epoch 3, batch 27200, loss[loss=0.285, simple_loss=0.352, pruned_loss=0.109, over 21742.00 frames. ], tot_loss[loss=0.2856, simple_loss=0.356, pruned_loss=0.1076, over 4277704.49 frames. ], batch size: 247, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 18:46:59,300 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.303e+02 3.363e+02 3.936e+02 4.684e+02 8.685e+02, threshold=7.872e+02, percent-clipped=10.0 2023-06-19 18:47:21,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=529194.0, ans=0.1 2023-06-19 18:47:24,761 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=22.5 2023-06-19 18:47:40,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=529254.0, ans=0.0 2023-06-19 18:47:50,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=529254.0, ans=0.2 2023-06-19 18:48:15,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=529374.0, ans=0.2 2023-06-19 18:48:33,238 INFO [train.py:996] (2/4) Epoch 3, batch 27250, loss[loss=0.3048, simple_loss=0.3673, pruned_loss=0.1211, over 21566.00 frames. ], tot_loss[loss=0.2923, simple_loss=0.3592, pruned_loss=0.1127, over 4275823.36 frames. ], batch size: 389, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 18:49:28,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=529554.0, ans=0.05 2023-06-19 18:49:32,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=529554.0, ans=0.0 2023-06-19 18:50:11,414 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=15.0 2023-06-19 18:50:28,583 INFO [train.py:996] (2/4) Epoch 3, batch 27300, loss[loss=0.3325, simple_loss=0.3996, pruned_loss=0.1327, over 21740.00 frames. ], tot_loss[loss=0.2943, simple_loss=0.3608, pruned_loss=0.1139, over 4276717.44 frames. ], batch size: 441, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 18:50:32,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=529734.0, ans=0.07 2023-06-19 18:50:33,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=529734.0, ans=0.125 2023-06-19 18:50:43,360 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 3.140e+02 3.530e+02 4.339e+02 7.752e+02, threshold=7.060e+02, percent-clipped=0.0 2023-06-19 18:50:57,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=529794.0, ans=0.2 2023-06-19 18:51:54,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=529974.0, ans=0.125 2023-06-19 18:52:03,627 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.15 vs. limit=6.0 2023-06-19 18:52:17,062 INFO [train.py:996] (2/4) Epoch 3, batch 27350, loss[loss=0.2633, simple_loss=0.3378, pruned_loss=0.09444, over 21796.00 frames. ], tot_loss[loss=0.2963, simple_loss=0.3627, pruned_loss=0.115, over 4279399.60 frames. ], batch size: 298, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 18:53:41,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=530274.0, ans=0.125 2023-06-19 18:53:58,608 INFO [train.py:996] (2/4) Epoch 3, batch 27400, loss[loss=0.273, simple_loss=0.3294, pruned_loss=0.1083, over 21186.00 frames. ], tot_loss[loss=0.2921, simple_loss=0.3571, pruned_loss=0.1135, over 4284864.79 frames. ], batch size: 608, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 18:54:09,814 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 3.048e+02 3.444e+02 4.008e+02 7.916e+02, threshold=6.888e+02, percent-clipped=1.0 2023-06-19 18:54:21,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=530394.0, ans=0.125 2023-06-19 18:54:21,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=530394.0, ans=0.125 2023-06-19 18:54:46,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=530454.0, ans=0.1 2023-06-19 18:54:47,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=530454.0, ans=0.2 2023-06-19 18:55:04,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=530514.0, ans=0.0 2023-06-19 18:55:09,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=530514.0, ans=0.1 2023-06-19 18:55:39,672 INFO [train.py:996] (2/4) Epoch 3, batch 27450, loss[loss=0.2841, simple_loss=0.3567, pruned_loss=0.1058, over 21670.00 frames. ], tot_loss[loss=0.2864, simple_loss=0.3507, pruned_loss=0.1111, over 4283200.63 frames. ], batch size: 247, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 18:56:11,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=530754.0, ans=0.1 2023-06-19 18:56:26,359 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-19 18:56:27,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=530754.0, ans=0.0 2023-06-19 18:56:42,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=530814.0, ans=0.2 2023-06-19 18:57:21,098 INFO [train.py:996] (2/4) Epoch 3, batch 27500, loss[loss=0.2657, simple_loss=0.3289, pruned_loss=0.1012, over 21886.00 frames. ], tot_loss[loss=0.2881, simple_loss=0.3511, pruned_loss=0.1126, over 4283614.62 frames. ], batch size: 351, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 18:57:24,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=530934.0, ans=0.125 2023-06-19 18:57:24,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=530934.0, ans=0.05 2023-06-19 18:57:32,519 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 3.096e+02 3.719e+02 4.715e+02 7.955e+02, threshold=7.439e+02, percent-clipped=2.0 2023-06-19 18:57:32,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=530934.0, ans=0.2 2023-06-19 18:57:47,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=530994.0, ans=0.04949747468305833 2023-06-19 18:58:01,289 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-06-19 18:58:02,802 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-19 18:58:51,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=531174.0, ans=0.125 2023-06-19 18:58:54,911 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.05 vs. limit=15.0 2023-06-19 18:59:01,872 INFO [train.py:996] (2/4) Epoch 3, batch 27550, loss[loss=0.2491, simple_loss=0.3056, pruned_loss=0.0963, over 21659.00 frames. ], tot_loss[loss=0.2804, simple_loss=0.3449, pruned_loss=0.108, over 4293866.86 frames. ], batch size: 298, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 18:59:09,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=531234.0, ans=0.125 2023-06-19 18:59:51,138 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:00:09,265 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-19 19:00:26,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=531474.0, ans=0.125 2023-06-19 19:00:42,116 INFO [train.py:996] (2/4) Epoch 3, batch 27600, loss[loss=0.2447, simple_loss=0.3054, pruned_loss=0.09194, over 22005.00 frames. ], tot_loss[loss=0.276, simple_loss=0.3386, pruned_loss=0.1067, over 4289151.66 frames. ], batch size: 103, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:00:53,351 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.013e+02 2.652e+02 3.386e+02 4.273e+02 7.001e+02, threshold=6.773e+02, percent-clipped=0.0 2023-06-19 19:00:55,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=531534.0, ans=0.2 2023-06-19 19:01:55,069 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-19 19:02:23,393 INFO [train.py:996] (2/4) Epoch 3, batch 27650, loss[loss=0.2618, simple_loss=0.3171, pruned_loss=0.1032, over 21605.00 frames. ], tot_loss[loss=0.2724, simple_loss=0.3332, pruned_loss=0.1058, over 4277979.37 frames. ], batch size: 389, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:02:25,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=531834.0, ans=0.0 2023-06-19 19:02:28,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=531834.0, ans=0.125 2023-06-19 19:02:37,779 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.20 vs. limit=15.0 2023-06-19 19:02:41,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=531894.0, ans=0.2 2023-06-19 19:02:45,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=531894.0, ans=0.125 2023-06-19 19:02:47,065 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=15.0 2023-06-19 19:03:11,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=531954.0, ans=0.125 2023-06-19 19:03:27,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=532014.0, ans=0.1 2023-06-19 19:03:33,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=532014.0, ans=0.2 2023-06-19 19:03:55,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=532074.0, ans=0.1 2023-06-19 19:04:05,908 INFO [train.py:996] (2/4) Epoch 3, batch 27700, loss[loss=0.215, simple_loss=0.2922, pruned_loss=0.06895, over 21440.00 frames. ], tot_loss[loss=0.27, simple_loss=0.3331, pruned_loss=0.1035, over 4277432.28 frames. ], batch size: 194, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:04:13,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=532134.0, ans=10.0 2023-06-19 19:04:17,185 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.013e+02 2.929e+02 3.310e+02 4.348e+02 7.080e+02, threshold=6.619e+02, percent-clipped=1.0 2023-06-19 19:04:31,568 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.80 vs. limit=12.0 2023-06-19 19:05:08,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=532314.0, ans=0.125 2023-06-19 19:05:08,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=532314.0, ans=0.125 2023-06-19 19:05:47,587 INFO [train.py:996] (2/4) Epoch 3, batch 27750, loss[loss=0.2646, simple_loss=0.3567, pruned_loss=0.08626, over 20809.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3362, pruned_loss=0.1035, over 4272065.37 frames. ], batch size: 608, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:05:52,159 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-19 19:06:06,542 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=22.5 2023-06-19 19:06:20,827 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=12.0 2023-06-19 19:06:36,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=532554.0, ans=0.0 2023-06-19 19:06:42,033 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=15.0 2023-06-19 19:07:25,261 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-19 19:07:29,263 INFO [train.py:996] (2/4) Epoch 3, batch 27800, loss[loss=0.2605, simple_loss=0.3132, pruned_loss=0.1039, over 21465.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3351, pruned_loss=0.1043, over 4278522.40 frames. ], batch size: 159, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:07:40,251 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.741e+02 3.231e+02 4.040e+02 7.271e+02, threshold=6.461e+02, percent-clipped=1.0 2023-06-19 19:07:40,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=532734.0, ans=0.1 2023-06-19 19:07:51,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=532794.0, ans=0.125 2023-06-19 19:07:51,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=532794.0, ans=0.2 2023-06-19 19:08:06,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=532854.0, ans=0.125 2023-06-19 19:08:12,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=532854.0, ans=0.0 2023-06-19 19:08:27,254 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:08:50,207 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.78 vs. limit=15.0 2023-06-19 19:09:12,031 INFO [train.py:996] (2/4) Epoch 3, batch 27850, loss[loss=0.2646, simple_loss=0.3219, pruned_loss=0.1036, over 21965.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.334, pruned_loss=0.1055, over 4290354.10 frames. ], batch size: 333, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:10:42,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=533274.0, ans=0.125 2023-06-19 19:10:48,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=533274.0, ans=0.125 2023-06-19 19:10:58,478 INFO [train.py:996] (2/4) Epoch 3, batch 27900, loss[loss=0.2956, simple_loss=0.385, pruned_loss=0.1031, over 21806.00 frames. ], tot_loss[loss=0.278, simple_loss=0.3438, pruned_loss=0.1061, over 4288878.80 frames. ], batch size: 316, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:10:59,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=533334.0, ans=0.125 2023-06-19 19:11:16,412 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.189e+02 3.014e+02 3.653e+02 4.966e+02 8.433e+02, threshold=7.306e+02, percent-clipped=7.0 2023-06-19 19:11:24,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=533394.0, ans=0.125 2023-06-19 19:11:38,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=533394.0, ans=0.0 2023-06-19 19:11:55,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=533454.0, ans=0.035 2023-06-19 19:12:47,271 INFO [train.py:996] (2/4) Epoch 3, batch 27950, loss[loss=0.3253, simple_loss=0.3942, pruned_loss=0.1282, over 21488.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3418, pruned_loss=0.101, over 4289371.16 frames. ], batch size: 471, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:13:42,433 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.35 vs. limit=10.0 2023-06-19 19:13:50,987 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.96 vs. limit=6.0 2023-06-19 19:13:55,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=533814.0, ans=10.0 2023-06-19 19:14:01,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=533814.0, ans=0.2 2023-06-19 19:14:06,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=533814.0, ans=0.125 2023-06-19 19:14:30,711 INFO [train.py:996] (2/4) Epoch 3, batch 28000, loss[loss=0.2947, simple_loss=0.346, pruned_loss=0.1217, over 21882.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3388, pruned_loss=0.09868, over 4286811.86 frames. ], batch size: 107, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:14:38,846 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=15.0 2023-06-19 19:14:48,833 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.017e+02 2.713e+02 3.305e+02 4.071e+02 8.310e+02, threshold=6.609e+02, percent-clipped=2.0 2023-06-19 19:15:35,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=534114.0, ans=0.2 2023-06-19 19:16:08,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=534174.0, ans=0.0 2023-06-19 19:16:20,846 INFO [train.py:996] (2/4) Epoch 3, batch 28050, loss[loss=0.2925, simple_loss=0.3355, pruned_loss=0.1247, over 21181.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.3365, pruned_loss=0.1003, over 4286305.88 frames. ], batch size: 607, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:16:34,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=534234.0, ans=0.125 2023-06-19 19:17:03,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=534354.0, ans=0.2 2023-06-19 19:17:16,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=534354.0, ans=0.1 2023-06-19 19:17:42,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=534474.0, ans=0.0 2023-06-19 19:18:03,144 INFO [train.py:996] (2/4) Epoch 3, batch 28100, loss[loss=0.2468, simple_loss=0.2932, pruned_loss=0.1002, over 21424.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3337, pruned_loss=0.1001, over 4284466.75 frames. ], batch size: 212, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:18:21,039 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 2.997e+02 3.623e+02 4.325e+02 7.130e+02, threshold=7.246e+02, percent-clipped=1.0 2023-06-19 19:18:44,342 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=15.0 2023-06-19 19:19:18,713 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=22.5 2023-06-19 19:19:44,130 INFO [train.py:996] (2/4) Epoch 3, batch 28150, loss[loss=0.2511, simple_loss=0.3062, pruned_loss=0.09797, over 21674.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3277, pruned_loss=0.1009, over 4281599.04 frames. ], batch size: 333, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:20:27,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=534954.0, ans=0.1 2023-06-19 19:20:45,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=535014.0, ans=0.1 2023-06-19 19:21:27,393 INFO [train.py:996] (2/4) Epoch 3, batch 28200, loss[loss=0.2945, simple_loss=0.3541, pruned_loss=0.1175, over 21573.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3267, pruned_loss=0.1026, over 4273342.36 frames. ], batch size: 389, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:21:29,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=535134.0, ans=0.5 2023-06-19 19:21:50,058 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.413e+02 3.171e+02 3.933e+02 4.825e+02 1.002e+03, threshold=7.866e+02, percent-clipped=3.0 2023-06-19 19:22:01,818 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:22:43,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=535314.0, ans=0.125 2023-06-19 19:23:19,297 INFO [train.py:996] (2/4) Epoch 3, batch 28250, loss[loss=0.2455, simple_loss=0.3001, pruned_loss=0.09547, over 21392.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3302, pruned_loss=0.1059, over 4278491.03 frames. ], batch size: 211, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:23:33,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=535434.0, ans=0.125 2023-06-19 19:23:39,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=535494.0, ans=0.0 2023-06-19 19:24:31,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=535614.0, ans=0.125 2023-06-19 19:24:38,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=535674.0, ans=0.125 2023-06-19 19:25:00,436 INFO [train.py:996] (2/4) Epoch 3, batch 28300, loss[loss=0.2101, simple_loss=0.2872, pruned_loss=0.0665, over 21387.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3277, pruned_loss=0.1031, over 4276423.59 frames. ], batch size: 211, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:25:01,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=535734.0, ans=0.0 2023-06-19 19:25:13,879 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 2.857e+02 3.337e+02 4.140e+02 8.167e+02, threshold=6.674e+02, percent-clipped=3.0 2023-06-19 19:25:14,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=535734.0, ans=0.09899494936611666 2023-06-19 19:25:41,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=535854.0, ans=0.1 2023-06-19 19:25:42,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=535854.0, ans=0.0 2023-06-19 19:25:42,573 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:26:08,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=535914.0, ans=0.125 2023-06-19 19:26:39,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=535974.0, ans=0.125 2023-06-19 19:26:43,648 INFO [train.py:996] (2/4) Epoch 3, batch 28350, loss[loss=0.2615, simple_loss=0.3183, pruned_loss=0.1024, over 21304.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3238, pruned_loss=0.09619, over 4267145.93 frames. ], batch size: 471, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:27:00,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=536094.0, ans=0.1 2023-06-19 19:27:23,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=536154.0, ans=15.0 2023-06-19 19:27:42,046 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.52 vs. limit=22.5 2023-06-19 19:27:47,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=536214.0, ans=0.125 2023-06-19 19:28:25,750 INFO [train.py:996] (2/4) Epoch 3, batch 28400, loss[loss=0.2846, simple_loss=0.3356, pruned_loss=0.1168, over 21458.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3212, pruned_loss=0.09591, over 4258312.20 frames. ], batch size: 389, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:28:38,769 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=15.0 2023-06-19 19:28:44,198 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.835e+02 2.716e+02 3.452e+02 4.220e+02 6.740e+02, threshold=6.905e+02, percent-clipped=2.0 2023-06-19 19:28:44,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=536334.0, ans=0.2 2023-06-19 19:29:24,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=536514.0, ans=0.0 2023-06-19 19:29:26,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=536514.0, ans=0.125 2023-06-19 19:30:03,716 INFO [train.py:996] (2/4) Epoch 3, batch 28450, loss[loss=0.2836, simple_loss=0.3343, pruned_loss=0.1164, over 21612.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3279, pruned_loss=0.1019, over 4265036.88 frames. ], batch size: 263, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:30:09,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=536634.0, ans=0.125 2023-06-19 19:30:52,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=536754.0, ans=0.0 2023-06-19 19:30:55,027 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.30 vs. limit=15.0 2023-06-19 19:31:42,104 INFO [train.py:996] (2/4) Epoch 3, batch 28500, loss[loss=0.2888, simple_loss=0.3534, pruned_loss=0.1121, over 21893.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3312, pruned_loss=0.1045, over 4274185.86 frames. ], batch size: 371, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:32:01,955 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 2.952e+02 3.630e+02 4.610e+02 9.107e+02, threshold=7.260e+02, percent-clipped=2.0 2023-06-19 19:33:06,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=537114.0, ans=0.125 2023-06-19 19:33:08,186 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-19 19:33:30,123 INFO [train.py:996] (2/4) Epoch 3, batch 28550, loss[loss=0.2623, simple_loss=0.3285, pruned_loss=0.09807, over 19954.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.3392, pruned_loss=0.1073, over 4275469.43 frames. ], batch size: 702, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:33:51,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=537294.0, ans=0.0 2023-06-19 19:34:21,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=537354.0, ans=0.125 2023-06-19 19:35:14,043 INFO [train.py:996] (2/4) Epoch 3, batch 28600, loss[loss=0.2863, simple_loss=0.3499, pruned_loss=0.1113, over 21614.00 frames. ], tot_loss[loss=0.2852, simple_loss=0.3481, pruned_loss=0.1111, over 4281556.06 frames. ], batch size: 230, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:35:21,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=537534.0, ans=0.0 2023-06-19 19:35:38,659 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.290e+02 3.072e+02 3.686e+02 4.724e+02 8.342e+02, threshold=7.372e+02, percent-clipped=3.0 2023-06-19 19:36:04,269 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2023-06-19 19:36:21,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=537714.0, ans=0.0 2023-06-19 19:36:27,833 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:36:38,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=537774.0, ans=15.0 2023-06-19 19:36:56,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=537774.0, ans=0.1 2023-06-19 19:36:59,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=537834.0, ans=0.125 2023-06-19 19:37:00,667 INFO [train.py:996] (2/4) Epoch 3, batch 28650, loss[loss=0.2763, simple_loss=0.3212, pruned_loss=0.1157, over 21654.00 frames. ], tot_loss[loss=0.282, simple_loss=0.343, pruned_loss=0.1105, over 4280728.61 frames. ], batch size: 333, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:37:04,919 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-19 19:37:07,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=537834.0, ans=0.035 2023-06-19 19:37:41,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=537954.0, ans=0.125 2023-06-19 19:38:02,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=538014.0, ans=0.0 2023-06-19 19:38:02,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=538014.0, ans=0.0 2023-06-19 19:38:12,218 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:38:21,872 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:38:39,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=538074.0, ans=0.125 2023-06-19 19:38:41,656 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.55 vs. limit=10.0 2023-06-19 19:38:42,161 INFO [train.py:996] (2/4) Epoch 3, batch 28700, loss[loss=0.2868, simple_loss=0.3433, pruned_loss=0.1151, over 21495.00 frames. ], tot_loss[loss=0.2835, simple_loss=0.3431, pruned_loss=0.1119, over 4273253.40 frames. ], batch size: 194, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:38:42,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=538134.0, ans=0.125 2023-06-19 19:38:49,870 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-19 19:38:57,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=538134.0, ans=0.1 2023-06-19 19:39:01,909 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 2.971e+02 3.318e+02 4.254e+02 6.959e+02, threshold=6.637e+02, percent-clipped=0.0 2023-06-19 19:39:14,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=538194.0, ans=0.125 2023-06-19 19:39:24,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=538254.0, ans=15.0 2023-06-19 19:39:30,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=538254.0, ans=0.0 2023-06-19 19:39:34,427 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=12.0 2023-06-19 19:39:37,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=538254.0, ans=0.0 2023-06-19 19:39:38,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=538254.0, ans=0.0 2023-06-19 19:39:58,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=538314.0, ans=0.0 2023-06-19 19:40:18,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=538374.0, ans=0.02 2023-06-19 19:40:24,362 INFO [train.py:996] (2/4) Epoch 3, batch 28750, loss[loss=0.2593, simple_loss=0.323, pruned_loss=0.09782, over 21915.00 frames. ], tot_loss[loss=0.2836, simple_loss=0.3433, pruned_loss=0.112, over 4271863.24 frames. ], batch size: 333, lr: 1.01e-02, grad_scale: 16.0 2023-06-19 19:40:31,911 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=12.0 2023-06-19 19:40:38,187 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:40:41,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=538434.0, ans=0.125 2023-06-19 19:40:42,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=538434.0, ans=0.0 2023-06-19 19:40:55,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=538494.0, ans=10.0 2023-06-19 19:41:26,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=538554.0, ans=0.0 2023-06-19 19:41:28,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=538614.0, ans=0.0 2023-06-19 19:41:47,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=538614.0, ans=0.0 2023-06-19 19:41:48,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=538614.0, ans=0.125 2023-06-19 19:41:51,742 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:42:12,016 INFO [train.py:996] (2/4) Epoch 3, batch 28800, loss[loss=0.3432, simple_loss=0.3976, pruned_loss=0.1444, over 21763.00 frames. ], tot_loss[loss=0.2854, simple_loss=0.3463, pruned_loss=0.1123, over 4276795.33 frames. ], batch size: 124, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:42:27,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=538734.0, ans=0.125 2023-06-19 19:42:31,961 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 3.000e+02 3.700e+02 5.247e+02 1.056e+03, threshold=7.400e+02, percent-clipped=15.0 2023-06-19 19:42:47,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=538794.0, ans=0.125 2023-06-19 19:43:29,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=538914.0, ans=0.125 2023-06-19 19:43:30,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=538914.0, ans=0.125 2023-06-19 19:43:59,901 INFO [train.py:996] (2/4) Epoch 3, batch 28850, loss[loss=0.3134, simple_loss=0.351, pruned_loss=0.1379, over 21738.00 frames. ], tot_loss[loss=0.2887, simple_loss=0.3487, pruned_loss=0.1143, over 4283681.33 frames. ], batch size: 508, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:44:17,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=539094.0, ans=0.0 2023-06-19 19:44:41,402 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.12 vs. limit=10.0 2023-06-19 19:45:04,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=539214.0, ans=0.125 2023-06-19 19:45:43,220 INFO [train.py:996] (2/4) Epoch 3, batch 28900, loss[loss=0.2993, simple_loss=0.3665, pruned_loss=0.116, over 21868.00 frames. ], tot_loss[loss=0.2904, simple_loss=0.351, pruned_loss=0.1149, over 4283831.41 frames. ], batch size: 371, lr: 1.01e-02, grad_scale: 32.0 2023-06-19 19:45:57,286 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 19:45:58,271 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.142e+02 3.259e+02 3.859e+02 4.928e+02 8.850e+02, threshold=7.718e+02, percent-clipped=4.0 2023-06-19 19:46:11,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=539394.0, ans=0.125 2023-06-19 19:46:30,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=539454.0, ans=0.0 2023-06-19 19:46:46,797 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.39 vs. limit=10.0 2023-06-19 19:47:15,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=539574.0, ans=0.0 2023-06-19 19:47:26,875 INFO [train.py:996] (2/4) Epoch 3, batch 28950, loss[loss=0.2931, simple_loss=0.3872, pruned_loss=0.09953, over 21673.00 frames. ], tot_loss[loss=0.289, simple_loss=0.3502, pruned_loss=0.1139, over 4277260.80 frames. ], batch size: 414, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 19:47:45,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=539634.0, ans=0.0 2023-06-19 19:47:45,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=539634.0, ans=0.2 2023-06-19 19:48:24,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=539754.0, ans=0.125 2023-06-19 19:48:41,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=539814.0, ans=0.125 2023-06-19 19:48:46,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=539814.0, ans=0.0 2023-06-19 19:49:10,755 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=22.5 2023-06-19 19:49:12,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=539934.0, ans=0.0 2023-06-19 19:49:13,106 INFO [train.py:996] (2/4) Epoch 3, batch 29000, loss[loss=0.2687, simple_loss=0.3312, pruned_loss=0.1031, over 21474.00 frames. ], tot_loss[loss=0.2876, simple_loss=0.351, pruned_loss=0.112, over 4271320.56 frames. ], batch size: 211, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 19:49:13,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=539934.0, ans=0.125 2023-06-19 19:49:15,755 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.42 vs. limit=15.0 2023-06-19 19:49:27,892 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 2.902e+02 3.366e+02 4.190e+02 7.172e+02, threshold=6.731e+02, percent-clipped=0.0 2023-06-19 19:49:59,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=540054.0, ans=0.0 2023-06-19 19:50:14,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=540054.0, ans=0.0 2023-06-19 19:50:55,883 INFO [train.py:996] (2/4) Epoch 3, batch 29050, loss[loss=0.2755, simple_loss=0.3277, pruned_loss=0.1116, over 21334.00 frames. ], tot_loss[loss=0.2869, simple_loss=0.3506, pruned_loss=0.1116, over 4268434.12 frames. ], batch size: 159, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 19:52:36,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=540534.0, ans=0.2 2023-06-19 19:52:36,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=540534.0, ans=0.0 2023-06-19 19:52:38,151 INFO [train.py:996] (2/4) Epoch 3, batch 29100, loss[loss=0.2661, simple_loss=0.3162, pruned_loss=0.108, over 21835.00 frames. ], tot_loss[loss=0.2813, simple_loss=0.3426, pruned_loss=0.11, over 4273623.20 frames. ], batch size: 372, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 19:52:57,601 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.115e+02 2.942e+02 3.636e+02 4.444e+02 9.761e+02, threshold=7.273e+02, percent-clipped=4.0 2023-06-19 19:53:03,635 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.23 vs. limit=12.0 2023-06-19 19:53:23,126 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.29 vs. limit=15.0 2023-06-19 19:54:18,768 INFO [train.py:996] (2/4) Epoch 3, batch 29150, loss[loss=0.2539, simple_loss=0.3114, pruned_loss=0.09821, over 21966.00 frames. ], tot_loss[loss=0.2801, simple_loss=0.3423, pruned_loss=0.1089, over 4273356.24 frames. ], batch size: 103, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 19:54:31,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=540834.0, ans=0.95 2023-06-19 19:54:34,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=540834.0, ans=0.125 2023-06-19 19:55:58,507 INFO [train.py:996] (2/4) Epoch 3, batch 29200, loss[loss=0.2613, simple_loss=0.3291, pruned_loss=0.09672, over 21731.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.3373, pruned_loss=0.1075, over 4267646.91 frames. ], batch size: 333, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 19:56:18,618 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.074e+02 3.815e+02 4.848e+02 9.248e+02, threshold=7.630e+02, percent-clipped=3.0 2023-06-19 19:56:48,294 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.35 vs. limit=15.0 2023-06-19 19:57:36,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=541374.0, ans=0.125 2023-06-19 19:57:41,206 INFO [train.py:996] (2/4) Epoch 3, batch 29250, loss[loss=0.2275, simple_loss=0.3026, pruned_loss=0.07627, over 21466.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.333, pruned_loss=0.1038, over 4255559.88 frames. ], batch size: 212, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 19:57:58,290 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.25 vs. limit=22.5 2023-06-19 19:59:28,960 INFO [train.py:996] (2/4) Epoch 3, batch 29300, loss[loss=0.2216, simple_loss=0.2907, pruned_loss=0.07627, over 21787.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3326, pruned_loss=0.1021, over 4257087.38 frames. ], batch size: 118, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 19:59:49,545 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 3.065e+02 3.693e+02 4.587e+02 7.138e+02, threshold=7.387e+02, percent-clipped=0.0 2023-06-19 20:01:11,346 INFO [train.py:996] (2/4) Epoch 3, batch 29350, loss[loss=0.2416, simple_loss=0.3154, pruned_loss=0.08389, over 21230.00 frames. ], tot_loss[loss=0.267, simple_loss=0.3303, pruned_loss=0.1019, over 4259959.06 frames. ], batch size: 176, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:01:13,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=542034.0, ans=0.0 2023-06-19 20:01:32,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=542094.0, ans=0.125 2023-06-19 20:01:32,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=542094.0, ans=0.2 2023-06-19 20:01:55,335 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:02:59,346 INFO [train.py:996] (2/4) Epoch 3, batch 29400, loss[loss=0.3151, simple_loss=0.3749, pruned_loss=0.1276, over 21484.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.329, pruned_loss=0.09947, over 4259070.97 frames. ], batch size: 509, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:03:21,663 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.080e+02 2.918e+02 3.507e+02 4.489e+02 7.938e+02, threshold=7.015e+02, percent-clipped=2.0 2023-06-19 20:03:27,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=542394.0, ans=0.1 2023-06-19 20:03:29,102 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.50 vs. limit=15.0 2023-06-19 20:03:30,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=542394.0, ans=0.09899494936611666 2023-06-19 20:04:24,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=542574.0, ans=0.125 2023-06-19 20:04:37,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=542574.0, ans=0.125 2023-06-19 20:04:43,537 INFO [train.py:996] (2/4) Epoch 3, batch 29450, loss[loss=0.2817, simple_loss=0.3439, pruned_loss=0.1098, over 21608.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3298, pruned_loss=0.0992, over 4258128.80 frames. ], batch size: 263, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:04:44,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=542634.0, ans=0.125 2023-06-19 20:05:05,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=542694.0, ans=0.0 2023-06-19 20:05:26,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=542754.0, ans=0.125 2023-06-19 20:06:06,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=542874.0, ans=0.125 2023-06-19 20:06:29,873 INFO [train.py:996] (2/4) Epoch 3, batch 29500, loss[loss=0.3099, simple_loss=0.3591, pruned_loss=0.1303, over 21362.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3338, pruned_loss=0.1034, over 4263470.71 frames. ], batch size: 143, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:06:45,770 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.340e+02 3.087e+02 3.959e+02 5.251e+02 8.059e+02, threshold=7.918e+02, percent-clipped=6.0 2023-06-19 20:06:51,590 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.29 vs. limit=15.0 2023-06-19 20:07:41,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=543114.0, ans=0.125 2023-06-19 20:08:10,049 INFO [train.py:996] (2/4) Epoch 3, batch 29550, loss[loss=0.249, simple_loss=0.3107, pruned_loss=0.09366, over 21676.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.3342, pruned_loss=0.1052, over 4271078.17 frames. ], batch size: 263, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:08:30,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=543294.0, ans=0.125 2023-06-19 20:09:26,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=543474.0, ans=0.035 2023-06-19 20:09:54,121 INFO [train.py:996] (2/4) Epoch 3, batch 29600, loss[loss=0.2819, simple_loss=0.3578, pruned_loss=0.103, over 21423.00 frames. ], tot_loss[loss=0.2795, simple_loss=0.3418, pruned_loss=0.1086, over 4281760.81 frames. ], batch size: 211, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 20:10:15,952 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 3.027e+02 3.599e+02 4.338e+02 7.072e+02, threshold=7.197e+02, percent-clipped=0.0 2023-06-19 20:10:33,176 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:11:36,798 INFO [train.py:996] (2/4) Epoch 3, batch 29650, loss[loss=0.2825, simple_loss=0.3598, pruned_loss=0.1026, over 20041.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3356, pruned_loss=0.1032, over 4279074.17 frames. ], batch size: 702, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 20:11:37,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=543834.0, ans=0.2 2023-06-19 20:12:34,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=543954.0, ans=0.125 2023-06-19 20:13:20,498 INFO [train.py:996] (2/4) Epoch 3, batch 29700, loss[loss=0.2761, simple_loss=0.3375, pruned_loss=0.1074, over 21776.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.3372, pruned_loss=0.1042, over 4284797.33 frames. ], batch size: 112, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 20:13:41,161 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.649e+02 2.987e+02 3.970e+02 7.304e+02, threshold=5.973e+02, percent-clipped=1.0 2023-06-19 20:13:56,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=544254.0, ans=0.125 2023-06-19 20:14:02,471 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=15.0 2023-06-19 20:14:56,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=544374.0, ans=0.125 2023-06-19 20:15:01,855 INFO [train.py:996] (2/4) Epoch 3, batch 29750, loss[loss=0.2564, simple_loss=0.3396, pruned_loss=0.08653, over 21319.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3407, pruned_loss=0.1035, over 4288544.60 frames. ], batch size: 176, lr: 1.00e-02, grad_scale: 32.0 2023-06-19 20:16:27,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=544674.0, ans=0.0 2023-06-19 20:16:31,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=544674.0, ans=0.07 2023-06-19 20:16:47,596 INFO [train.py:996] (2/4) Epoch 3, batch 29800, loss[loss=0.2417, simple_loss=0.3069, pruned_loss=0.08829, over 21682.00 frames. ], tot_loss[loss=0.2777, simple_loss=0.3447, pruned_loss=0.1054, over 4292361.92 frames. ], batch size: 263, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:16:58,393 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.59 vs. limit=15.0 2023-06-19 20:17:04,964 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.53 vs. limit=15.0 2023-06-19 20:17:05,134 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 3.342e+02 4.045e+02 4.978e+02 1.039e+03, threshold=8.090e+02, percent-clipped=10.0 2023-06-19 20:17:09,243 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=22.5 2023-06-19 20:17:31,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=544854.0, ans=0.0 2023-06-19 20:18:22,309 INFO [train.py:996] (2/4) Epoch 3, batch 29850, loss[loss=0.2248, simple_loss=0.3041, pruned_loss=0.07269, over 21675.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3418, pruned_loss=0.1029, over 4287604.23 frames. ], batch size: 247, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:18:24,853 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-19 20:18:45,838 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=22.5 2023-06-19 20:18:55,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=545094.0, ans=0.125 2023-06-19 20:20:08,638 INFO [train.py:996] (2/4) Epoch 3, batch 29900, loss[loss=0.2608, simple_loss=0.3229, pruned_loss=0.09938, over 21361.00 frames. ], tot_loss[loss=0.2749, simple_loss=0.3402, pruned_loss=0.1048, over 4294090.55 frames. ], batch size: 131, lr: 1.00e-02, grad_scale: 16.0 2023-06-19 20:20:26,518 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.121e+02 2.681e+02 3.110e+02 3.688e+02 5.256e+02, threshold=6.220e+02, percent-clipped=0.0 2023-06-19 20:20:41,437 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.03 vs. limit=15.0 2023-06-19 20:20:51,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=545454.0, ans=0.2 2023-06-19 20:21:46,646 INFO [train.py:996] (2/4) Epoch 3, batch 29950, loss[loss=0.277, simple_loss=0.3284, pruned_loss=0.1128, over 20949.00 frames. ], tot_loss[loss=0.2803, simple_loss=0.3436, pruned_loss=0.1085, over 4287595.90 frames. ], batch size: 607, lr: 9.99e-03, grad_scale: 16.0 2023-06-19 20:22:04,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=545694.0, ans=0.125 2023-06-19 20:22:39,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=545754.0, ans=0.0 2023-06-19 20:22:59,246 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=22.5 2023-06-19 20:23:11,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=545874.0, ans=0.125 2023-06-19 20:23:17,297 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.05 vs. limit=15.0 2023-06-19 20:23:29,429 INFO [train.py:996] (2/4) Epoch 3, batch 30000, loss[loss=0.2326, simple_loss=0.3189, pruned_loss=0.07311, over 21660.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.3453, pruned_loss=0.1078, over 4285615.30 frames. ], batch size: 263, lr: 9.99e-03, grad_scale: 32.0 2023-06-19 20:23:29,430 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 20:23:45,902 INFO [train.py:1028] (2/4) Epoch 3, validation: loss=0.254, simple_loss=0.3581, pruned_loss=0.075, over 1796401.00 frames. 2023-06-19 20:23:45,903 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-19 20:24:15,689 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.901e+02 3.447e+02 4.272e+02 9.118e+02, threshold=6.893e+02, percent-clipped=6.0 2023-06-19 20:24:33,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=545994.0, ans=0.125 2023-06-19 20:25:02,941 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.74 vs. limit=6.0 2023-06-19 20:25:04,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=546114.0, ans=0.2 2023-06-19 20:25:11,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=546114.0, ans=0.125 2023-06-19 20:25:14,963 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=22.5 2023-06-19 20:25:30,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=546174.0, ans=0.125 2023-06-19 20:25:43,274 INFO [train.py:996] (2/4) Epoch 3, batch 30050, loss[loss=0.2533, simple_loss=0.3118, pruned_loss=0.09741, over 21067.00 frames. ], tot_loss[loss=0.2784, simple_loss=0.3484, pruned_loss=0.1042, over 4284352.63 frames. ], batch size: 143, lr: 9.99e-03, grad_scale: 32.0 2023-06-19 20:26:00,756 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=15.0 2023-06-19 20:26:04,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=546294.0, ans=0.0 2023-06-19 20:26:18,840 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.97 vs. limit=22.5 2023-06-19 20:26:37,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=546354.0, ans=0.125 2023-06-19 20:26:50,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=546414.0, ans=0.125 2023-06-19 20:26:51,656 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:27:06,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=546474.0, ans=0.0 2023-06-19 20:27:23,655 INFO [train.py:996] (2/4) Epoch 3, batch 30100, loss[loss=0.2383, simple_loss=0.2866, pruned_loss=0.09503, over 21181.00 frames. ], tot_loss[loss=0.2777, simple_loss=0.3473, pruned_loss=0.1041, over 4286496.96 frames. ], batch size: 549, lr: 9.99e-03, grad_scale: 32.0 2023-06-19 20:27:46,643 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.970e+02 3.475e+02 4.229e+02 7.609e+02, threshold=6.950e+02, percent-clipped=3.0 2023-06-19 20:27:59,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=546594.0, ans=0.05 2023-06-19 20:28:25,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=546714.0, ans=0.125 2023-06-19 20:29:10,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=546834.0, ans=0.0 2023-06-19 20:29:11,599 INFO [train.py:996] (2/4) Epoch 3, batch 30150, loss[loss=0.2872, simple_loss=0.3473, pruned_loss=0.1136, over 21973.00 frames. ], tot_loss[loss=0.2797, simple_loss=0.3449, pruned_loss=0.1073, over 4287186.79 frames. ], batch size: 317, lr: 9.98e-03, grad_scale: 32.0 2023-06-19 20:29:55,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=546954.0, ans=0.125 2023-06-19 20:30:11,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=547014.0, ans=0.125 2023-06-19 20:30:12,167 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2023-06-19 20:31:01,092 INFO [train.py:996] (2/4) Epoch 3, batch 30200, loss[loss=0.2574, simple_loss=0.3375, pruned_loss=0.08861, over 21626.00 frames. ], tot_loss[loss=0.2806, simple_loss=0.3485, pruned_loss=0.1064, over 4282626.98 frames. ], batch size: 263, lr: 9.98e-03, grad_scale: 32.0 2023-06-19 20:31:11,449 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.55 vs. limit=15.0 2023-06-19 20:31:20,349 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.884e+02 3.477e+02 4.360e+02 6.992e+02, threshold=6.953e+02, percent-clipped=1.0 2023-06-19 20:32:43,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=547374.0, ans=0.04949747468305833 2023-06-19 20:32:45,991 INFO [train.py:996] (2/4) Epoch 3, batch 30250, loss[loss=0.3661, simple_loss=0.4476, pruned_loss=0.1423, over 21872.00 frames. ], tot_loss[loss=0.2868, simple_loss=0.3557, pruned_loss=0.109, over 4281014.89 frames. ], batch size: 372, lr: 9.98e-03, grad_scale: 32.0 2023-06-19 20:33:49,761 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 20:34:08,701 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=15.0 2023-06-19 20:34:15,256 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.53 vs. limit=15.0 2023-06-19 20:34:22,336 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.76 vs. limit=15.0 2023-06-19 20:34:23,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=547674.0, ans=0.0 2023-06-19 20:34:28,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=547734.0, ans=0.125 2023-06-19 20:34:29,309 INFO [train.py:996] (2/4) Epoch 3, batch 30300, loss[loss=0.2379, simple_loss=0.2909, pruned_loss=0.09247, over 21464.00 frames. ], tot_loss[loss=0.2835, simple_loss=0.3509, pruned_loss=0.1081, over 4285126.18 frames. ], batch size: 195, lr: 9.97e-03, grad_scale: 32.0 2023-06-19 20:34:52,624 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 3.189e+02 3.746e+02 4.977e+02 8.102e+02, threshold=7.493e+02, percent-clipped=4.0 2023-06-19 20:36:13,961 INFO [train.py:996] (2/4) Epoch 3, batch 30350, loss[loss=0.2039, simple_loss=0.2617, pruned_loss=0.07304, over 21296.00 frames. ], tot_loss[loss=0.2872, simple_loss=0.3539, pruned_loss=0.1103, over 4277961.80 frames. ], batch size: 176, lr: 9.97e-03, grad_scale: 32.0 2023-06-19 20:37:08,817 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-19 20:37:11,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=548214.0, ans=0.025 2023-06-19 20:37:20,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=548214.0, ans=0.04949747468305833 2023-06-19 20:37:37,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=548274.0, ans=0.0 2023-06-19 20:37:43,098 INFO [train.py:996] (2/4) Epoch 3, batch 30400, loss[loss=0.2824, simple_loss=0.3205, pruned_loss=0.1221, over 20367.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.3441, pruned_loss=0.1063, over 4270508.75 frames. ], batch size: 703, lr: 9.97e-03, grad_scale: 32.0 2023-06-19 20:37:48,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=548334.0, ans=0.125 2023-06-19 20:37:59,840 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 3.466e+02 4.166e+02 5.135e+02 9.055e+02, threshold=8.331e+02, percent-clipped=4.0 2023-06-19 20:38:01,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=548394.0, ans=0.2 2023-06-19 20:38:29,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=548514.0, ans=0.0 2023-06-19 20:38:56,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=548574.0, ans=0.2 2023-06-19 20:39:04,644 INFO [train.py:996] (2/4) Epoch 3, batch 30450, loss[loss=0.3335, simple_loss=0.4425, pruned_loss=0.1123, over 19850.00 frames. ], tot_loss[loss=0.2803, simple_loss=0.3466, pruned_loss=0.107, over 4209448.21 frames. ], batch size: 702, lr: 9.97e-03, grad_scale: 32.0 2023-06-19 20:39:38,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=548754.0, ans=0.2 2023-06-19 20:41:58,168 INFO [train.py:996] (2/4) Epoch 4, batch 0, loss[loss=0.2797, simple_loss=0.3278, pruned_loss=0.1158, over 21498.00 frames. ], tot_loss[loss=0.2797, simple_loss=0.3278, pruned_loss=0.1158, over 21498.00 frames. ], batch size: 212, lr: 8.60e-03, grad_scale: 32.0 2023-06-19 20:41:58,169 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 20:42:15,980 INFO [train.py:1028] (2/4) Epoch 4, validation: loss=0.2612, simple_loss=0.3698, pruned_loss=0.07632, over 1796401.00 frames. 2023-06-19 20:42:15,981 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-19 20:42:45,418 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.032e+02 5.518e+02 8.293e+02 1.240e+03 3.012e+03, threshold=1.659e+03, percent-clipped=49.0 2023-06-19 20:42:49,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=548964.0, ans=0.1 2023-06-19 20:43:14,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=549024.0, ans=0.0 2023-06-19 20:43:52,596 INFO [train.py:996] (2/4) Epoch 4, batch 50, loss[loss=0.2906, simple_loss=0.3555, pruned_loss=0.1128, over 21430.00 frames. ], tot_loss[loss=0.2792, simple_loss=0.3458, pruned_loss=0.1063, over 952269.44 frames. ], batch size: 211, lr: 8.60e-03, grad_scale: 32.0 2023-06-19 20:43:53,707 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=22.5 2023-06-19 20:43:54,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=549204.0, ans=0.2 2023-06-19 20:44:11,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=549204.0, ans=0.0 2023-06-19 20:44:17,766 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-19 20:45:33,222 INFO [train.py:996] (2/4) Epoch 4, batch 100, loss[loss=0.3208, simple_loss=0.3928, pruned_loss=0.1244, over 21734.00 frames. ], tot_loss[loss=0.2907, simple_loss=0.3628, pruned_loss=0.1093, over 1691356.90 frames. ], batch size: 441, lr: 8.60e-03, grad_scale: 32.0 2023-06-19 20:45:33,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=549504.0, ans=0.125 2023-06-19 20:46:04,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=549564.0, ans=0.07 2023-06-19 20:46:08,884 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.893e+02 3.441e+02 3.943e+02 7.428e+02, threshold=6.883e+02, percent-clipped=0.0 2023-06-19 20:46:17,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=549624.0, ans=0.125 2023-06-19 20:46:50,783 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.26 vs. limit=15.0 2023-06-19 20:46:58,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=549744.0, ans=0.0 2023-06-19 20:47:02,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=549744.0, ans=0.035 2023-06-19 20:47:07,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=549744.0, ans=0.0 2023-06-19 20:47:10,083 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.66 vs. limit=22.5 2023-06-19 20:47:13,452 INFO [train.py:996] (2/4) Epoch 4, batch 150, loss[loss=0.2727, simple_loss=0.3447, pruned_loss=0.1003, over 21804.00 frames. ], tot_loss[loss=0.2914, simple_loss=0.3645, pruned_loss=0.1091, over 2266167.12 frames. ], batch size: 282, lr: 8.59e-03, grad_scale: 32.0 2023-06-19 20:47:36,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=549864.0, ans=0.2 2023-06-19 20:47:54,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=549924.0, ans=0.125 2023-06-19 20:48:29,252 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.29 vs. limit=6.0 2023-06-19 20:48:37,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=550044.0, ans=0.125 2023-06-19 20:48:47,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=550044.0, ans=0.125 2023-06-19 20:48:53,414 INFO [train.py:996] (2/4) Epoch 4, batch 200, loss[loss=0.2943, simple_loss=0.3516, pruned_loss=0.1185, over 21258.00 frames. ], tot_loss[loss=0.2902, simple_loss=0.3624, pruned_loss=0.109, over 2717981.08 frames. ], batch size: 143, lr: 8.59e-03, grad_scale: 32.0 2023-06-19 20:49:29,402 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.787e+02 3.303e+02 4.395e+02 6.398e+02, threshold=6.606e+02, percent-clipped=0.0 2023-06-19 20:49:29,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=550164.0, ans=0.125 2023-06-19 20:49:56,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=550284.0, ans=0.125 2023-06-19 20:50:35,692 INFO [train.py:996] (2/4) Epoch 4, batch 250, loss[loss=0.2346, simple_loss=0.2977, pruned_loss=0.08571, over 21822.00 frames. ], tot_loss[loss=0.2865, simple_loss=0.3576, pruned_loss=0.1077, over 3060694.00 frames. ], batch size: 107, lr: 8.59e-03, grad_scale: 32.0 2023-06-19 20:50:42,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=550404.0, ans=0.0 2023-06-19 20:50:57,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=550464.0, ans=0.125 2023-06-19 20:51:04,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=550464.0, ans=0.0 2023-06-19 20:51:53,309 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.39 vs. limit=15.0 2023-06-19 20:52:19,248 INFO [train.py:996] (2/4) Epoch 4, batch 300, loss[loss=0.2517, simple_loss=0.315, pruned_loss=0.09419, over 20958.00 frames. ], tot_loss[loss=0.2809, simple_loss=0.3507, pruned_loss=0.1055, over 3329085.05 frames. ], batch size: 607, lr: 8.59e-03, grad_scale: 32.0 2023-06-19 20:52:20,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=550704.0, ans=0.0 2023-06-19 20:52:33,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=550704.0, ans=0.1 2023-06-19 20:52:57,508 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.319e+02 3.088e+02 3.665e+02 5.063e+02 1.079e+03, threshold=7.330e+02, percent-clipped=8.0 2023-06-19 20:53:06,983 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.57 vs. limit=15.0 2023-06-19 20:53:30,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=550884.0, ans=0.0 2023-06-19 20:53:45,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=550884.0, ans=0.125 2023-06-19 20:54:05,670 INFO [train.py:996] (2/4) Epoch 4, batch 350, loss[loss=0.2686, simple_loss=0.358, pruned_loss=0.08963, over 21782.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.3428, pruned_loss=0.1048, over 3534675.73 frames. ], batch size: 351, lr: 8.59e-03, grad_scale: 32.0 2023-06-19 20:55:01,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=551124.0, ans=0.2 2023-06-19 20:55:33,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=551244.0, ans=0.125 2023-06-19 20:55:54,947 INFO [train.py:996] (2/4) Epoch 4, batch 400, loss[loss=0.2822, simple_loss=0.3376, pruned_loss=0.1134, over 21322.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3339, pruned_loss=0.1016, over 3695767.40 frames. ], batch size: 471, lr: 8.58e-03, grad_scale: 32.0 2023-06-19 20:56:14,502 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-06-19 20:56:26,012 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 2.883e+02 3.575e+02 4.503e+02 7.615e+02, threshold=7.149e+02, percent-clipped=2.0 2023-06-19 20:56:58,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=551484.0, ans=0.2 2023-06-19 20:57:12,707 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=22.5 2023-06-19 20:57:17,595 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.62 vs. limit=12.0 2023-06-19 20:57:19,449 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.01 vs. limit=10.0 2023-06-19 20:57:37,279 INFO [train.py:996] (2/4) Epoch 4, batch 450, loss[loss=0.2656, simple_loss=0.3617, pruned_loss=0.08472, over 21228.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3361, pruned_loss=0.1015, over 3830216.18 frames. ], batch size: 548, lr: 8.58e-03, grad_scale: 32.0 2023-06-19 20:58:03,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=551664.0, ans=0.125 2023-06-19 20:58:07,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=551664.0, ans=0.1 2023-06-19 20:58:18,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=551724.0, ans=0.0 2023-06-19 20:59:19,195 INFO [train.py:996] (2/4) Epoch 4, batch 500, loss[loss=0.2817, simple_loss=0.3838, pruned_loss=0.08978, over 21771.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3357, pruned_loss=0.09823, over 3930720.94 frames. ], batch size: 351, lr: 8.58e-03, grad_scale: 32.0 2023-06-19 20:59:53,162 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.333e+02 2.948e+02 3.424e+02 4.506e+02 6.960e+02, threshold=6.848e+02, percent-clipped=0.0 2023-06-19 21:01:02,299 INFO [train.py:996] (2/4) Epoch 4, batch 550, loss[loss=0.2962, simple_loss=0.3525, pruned_loss=0.1199, over 21863.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3381, pruned_loss=0.09829, over 4013136.18 frames. ], batch size: 371, lr: 8.58e-03, grad_scale: 16.0 2023-06-19 21:01:26,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=552264.0, ans=0.0 2023-06-19 21:01:42,193 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=22.5 2023-06-19 21:02:45,488 INFO [train.py:996] (2/4) Epoch 4, batch 600, loss[loss=0.2496, simple_loss=0.3183, pruned_loss=0.09044, over 21914.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3402, pruned_loss=0.09828, over 4071695.39 frames. ], batch size: 351, lr: 8.57e-03, grad_scale: 16.0 2023-06-19 21:03:13,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=552564.0, ans=0.1 2023-06-19 21:03:17,437 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 3.276e+02 3.981e+02 4.951e+02 8.718e+02, threshold=7.962e+02, percent-clipped=3.0 2023-06-19 21:03:34,247 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.80 vs. limit=15.0 2023-06-19 21:03:36,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=552624.0, ans=0.0 2023-06-19 21:03:42,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=552624.0, ans=0.125 2023-06-19 21:03:47,085 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-19 21:04:28,061 INFO [train.py:996] (2/4) Epoch 4, batch 650, loss[loss=0.2645, simple_loss=0.3233, pruned_loss=0.1028, over 21854.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.3403, pruned_loss=0.09839, over 4113127.63 frames. ], batch size: 98, lr: 8.57e-03, grad_scale: 16.0 2023-06-19 21:04:34,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=552804.0, ans=0.2 2023-06-19 21:04:38,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=552804.0, ans=0.2 2023-06-19 21:04:58,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=552864.0, ans=10.0 2023-06-19 21:04:59,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=552864.0, ans=0.0 2023-06-19 21:05:46,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=552984.0, ans=0.0 2023-06-19 21:06:10,745 INFO [train.py:996] (2/4) Epoch 4, batch 700, loss[loss=0.2807, simple_loss=0.3292, pruned_loss=0.116, over 21206.00 frames. ], tot_loss[loss=0.2713, simple_loss=0.3406, pruned_loss=0.101, over 4148736.27 frames. ], batch size: 159, lr: 8.57e-03, grad_scale: 16.0 2023-06-19 21:06:42,417 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:06:43,343 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.524e+02 3.407e+02 4.015e+02 5.310e+02 1.031e+03, threshold=8.030e+02, percent-clipped=3.0 2023-06-19 21:07:29,095 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:07:52,979 INFO [train.py:996] (2/4) Epoch 4, batch 750, loss[loss=0.2519, simple_loss=0.3188, pruned_loss=0.09247, over 21531.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3376, pruned_loss=0.1009, over 4184826.35 frames. ], batch size: 263, lr: 8.57e-03, grad_scale: 16.0 2023-06-19 21:08:06,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=553404.0, ans=0.1 2023-06-19 21:08:07,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=553464.0, ans=0.025 2023-06-19 21:09:31,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=553644.0, ans=0.0 2023-06-19 21:09:34,468 INFO [train.py:996] (2/4) Epoch 4, batch 800, loss[loss=0.2895, simple_loss=0.3445, pruned_loss=0.1172, over 21877.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.3354, pruned_loss=0.101, over 4193832.99 frames. ], batch size: 414, lr: 8.56e-03, grad_scale: 32.0 2023-06-19 21:09:48,849 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.92 vs. limit=10.0 2023-06-19 21:10:04,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=553764.0, ans=0.2 2023-06-19 21:10:07,072 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.257e+02 3.089e+02 3.541e+02 4.418e+02 8.046e+02, threshold=7.083e+02, percent-clipped=1.0 2023-06-19 21:10:56,095 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.74 vs. limit=6.0 2023-06-19 21:10:59,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=553944.0, ans=0.125 2023-06-19 21:11:18,274 INFO [train.py:996] (2/4) Epoch 4, batch 850, loss[loss=0.2877, simple_loss=0.3477, pruned_loss=0.1138, over 21926.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.3394, pruned_loss=0.1026, over 4213524.78 frames. ], batch size: 124, lr: 8.56e-03, grad_scale: 32.0 2023-06-19 21:11:22,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=554004.0, ans=0.125 2023-06-19 21:11:45,852 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-06-19 21:11:57,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=554064.0, ans=0.1 2023-06-19 21:12:00,573 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:12:31,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=554184.0, ans=0.025 2023-06-19 21:13:02,653 INFO [train.py:996] (2/4) Epoch 4, batch 900, loss[loss=0.2729, simple_loss=0.3286, pruned_loss=0.1086, over 21816.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3367, pruned_loss=0.102, over 4231719.52 frames. ], batch size: 247, lr: 8.56e-03, grad_scale: 32.0 2023-06-19 21:13:05,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=554304.0, ans=22.5 2023-06-19 21:13:28,199 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.17 vs. limit=15.0 2023-06-19 21:13:29,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=554364.0, ans=0.125 2023-06-19 21:13:40,701 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.142e+02 3.017e+02 3.559e+02 4.118e+02 8.031e+02, threshold=7.118e+02, percent-clipped=1.0 2023-06-19 21:13:54,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=554424.0, ans=0.125 2023-06-19 21:14:07,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=554484.0, ans=0.0 2023-06-19 21:14:11,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=554484.0, ans=0.125 2023-06-19 21:14:45,102 INFO [train.py:996] (2/4) Epoch 4, batch 950, loss[loss=0.235, simple_loss=0.2852, pruned_loss=0.09244, over 21175.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3329, pruned_loss=0.1011, over 4247132.77 frames. ], batch size: 548, lr: 8.56e-03, grad_scale: 32.0 2023-06-19 21:14:47,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=554604.0, ans=0.2 2023-06-19 21:15:55,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=554784.0, ans=0.0 2023-06-19 21:16:13,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=554844.0, ans=0.2 2023-06-19 21:16:18,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=554844.0, ans=0.125 2023-06-19 21:16:23,150 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:16:27,606 INFO [train.py:996] (2/4) Epoch 4, batch 1000, loss[loss=0.2399, simple_loss=0.3202, pruned_loss=0.07983, over 21698.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3317, pruned_loss=0.1008, over 4261400.11 frames. ], batch size: 263, lr: 8.56e-03, grad_scale: 32.0 2023-06-19 21:16:58,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=554964.0, ans=0.04949747468305833 2023-06-19 21:17:12,746 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.951e+02 3.502e+02 4.133e+02 7.133e+02, threshold=7.004e+02, percent-clipped=1.0 2023-06-19 21:17:50,325 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=15.0 2023-06-19 21:18:15,446 INFO [train.py:996] (2/4) Epoch 4, batch 1050, loss[loss=0.3596, simple_loss=0.4532, pruned_loss=0.1331, over 20899.00 frames. ], tot_loss[loss=0.2688, simple_loss=0.3333, pruned_loss=0.1022, over 4269623.38 frames. ], batch size: 608, lr: 8.55e-03, grad_scale: 16.0 2023-06-19 21:18:24,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=555204.0, ans=0.2 2023-06-19 21:18:53,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=555264.0, ans=0.1 2023-06-19 21:19:19,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=555384.0, ans=0.125 2023-06-19 21:19:50,414 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-19 21:19:58,902 INFO [train.py:996] (2/4) Epoch 4, batch 1100, loss[loss=0.2182, simple_loss=0.2793, pruned_loss=0.0786, over 21245.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3349, pruned_loss=0.1023, over 4276041.05 frames. ], batch size: 548, lr: 8.55e-03, grad_scale: 16.0 2023-06-19 21:20:39,867 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 3.086e+02 3.737e+02 4.742e+02 7.537e+02, threshold=7.473e+02, percent-clipped=2.0 2023-06-19 21:20:48,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=555624.0, ans=0.125 2023-06-19 21:21:16,026 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-19 21:21:43,871 INFO [train.py:996] (2/4) Epoch 4, batch 1150, loss[loss=0.2654, simple_loss=0.3411, pruned_loss=0.0948, over 21758.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.3348, pruned_loss=0.1015, over 4274248.64 frames. ], batch size: 298, lr: 8.55e-03, grad_scale: 16.0 2023-06-19 21:21:44,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=555804.0, ans=0.0 2023-06-19 21:21:52,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=555804.0, ans=0.125 2023-06-19 21:22:14,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=555864.0, ans=0.0 2023-06-19 21:22:18,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=555864.0, ans=0.1 2023-06-19 21:22:36,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=555924.0, ans=0.07 2023-06-19 21:22:44,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=555984.0, ans=0.0 2023-06-19 21:22:51,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=555984.0, ans=0.125 2023-06-19 21:23:25,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=556044.0, ans=0.125 2023-06-19 21:23:29,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=556044.0, ans=0.125 2023-06-19 21:23:33,575 INFO [train.py:996] (2/4) Epoch 4, batch 1200, loss[loss=0.3208, simple_loss=0.3799, pruned_loss=0.1309, over 21884.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3361, pruned_loss=0.102, over 4278373.00 frames. ], batch size: 371, lr: 8.55e-03, grad_scale: 32.0 2023-06-19 21:23:56,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=556164.0, ans=0.0 2023-06-19 21:24:04,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=556164.0, ans=0.125 2023-06-19 21:24:08,667 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.195e+02 2.755e+02 3.087e+02 3.854e+02 6.716e+02, threshold=6.173e+02, percent-clipped=0.0 2023-06-19 21:24:27,933 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.20 vs. limit=15.0 2023-06-19 21:25:01,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=556344.0, ans=0.125 2023-06-19 21:25:03,733 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-19 21:25:04,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=556344.0, ans=0.1 2023-06-19 21:25:15,377 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-06-19 21:25:17,521 INFO [train.py:996] (2/4) Epoch 4, batch 1250, loss[loss=0.2812, simple_loss=0.3505, pruned_loss=0.1059, over 21672.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3376, pruned_loss=0.1024, over 4278082.24 frames. ], batch size: 351, lr: 8.54e-03, grad_scale: 32.0 2023-06-19 21:25:18,804 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.43 vs. limit=22.5 2023-06-19 21:25:58,475 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:26:54,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=556644.0, ans=0.1 2023-06-19 21:26:54,504 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=12.0 2023-06-19 21:27:02,103 INFO [train.py:996] (2/4) Epoch 4, batch 1300, loss[loss=0.2783, simple_loss=0.3434, pruned_loss=0.1066, over 21915.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3378, pruned_loss=0.1028, over 4277369.55 frames. ], batch size: 351, lr: 8.54e-03, grad_scale: 32.0 2023-06-19 21:27:36,354 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.956e+02 2.941e+02 3.345e+02 4.151e+02 1.109e+03, threshold=6.689e+02, percent-clipped=6.0 2023-06-19 21:28:44,705 INFO [train.py:996] (2/4) Epoch 4, batch 1350, loss[loss=0.2773, simple_loss=0.3266, pruned_loss=0.114, over 21688.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3368, pruned_loss=0.1021, over 4279362.11 frames. ], batch size: 230, lr: 8.54e-03, grad_scale: 32.0 2023-06-19 21:29:54,089 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.58 vs. limit=15.0 2023-06-19 21:30:10,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=557244.0, ans=0.0 2023-06-19 21:30:17,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=557244.0, ans=0.125 2023-06-19 21:30:25,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=557244.0, ans=0.125 2023-06-19 21:30:27,888 INFO [train.py:996] (2/4) Epoch 4, batch 1400, loss[loss=0.2681, simple_loss=0.3151, pruned_loss=0.1105, over 21652.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.3355, pruned_loss=0.1024, over 4287828.15 frames. ], batch size: 282, lr: 8.54e-03, grad_scale: 32.0 2023-06-19 21:30:29,231 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.67 vs. limit=6.0 2023-06-19 21:30:52,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=557364.0, ans=0.2 2023-06-19 21:31:03,574 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 3.007e+02 3.409e+02 4.154e+02 6.851e+02, threshold=6.817e+02, percent-clipped=4.0 2023-06-19 21:31:06,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=557424.0, ans=0.2 2023-06-19 21:31:28,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=557424.0, ans=0.125 2023-06-19 21:32:18,695 INFO [train.py:996] (2/4) Epoch 4, batch 1450, loss[loss=0.282, simple_loss=0.3457, pruned_loss=0.1091, over 21817.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.334, pruned_loss=0.1025, over 4295860.98 frames. ], batch size: 282, lr: 8.54e-03, grad_scale: 32.0 2023-06-19 21:32:30,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=557604.0, ans=0.05 2023-06-19 21:32:32,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=557604.0, ans=0.1 2023-06-19 21:32:40,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=557664.0, ans=0.125 2023-06-19 21:33:43,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=557844.0, ans=0.125 2023-06-19 21:33:48,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=557844.0, ans=0.125 2023-06-19 21:34:02,933 INFO [train.py:996] (2/4) Epoch 4, batch 1500, loss[loss=0.2989, simple_loss=0.3532, pruned_loss=0.1223, over 21900.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.3372, pruned_loss=0.1043, over 4294607.17 frames. ], batch size: 333, lr: 8.53e-03, grad_scale: 32.0 2023-06-19 21:34:25,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=557964.0, ans=0.125 2023-06-19 21:34:33,018 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 2.980e+02 3.543e+02 4.143e+02 6.339e+02, threshold=7.086e+02, percent-clipped=0.0 2023-06-19 21:34:51,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=558024.0, ans=0.07 2023-06-19 21:35:04,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=558084.0, ans=0.2 2023-06-19 21:35:23,278 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=22.5 2023-06-19 21:35:49,421 INFO [train.py:996] (2/4) Epoch 4, batch 1550, loss[loss=0.2262, simple_loss=0.318, pruned_loss=0.0672, over 21756.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3355, pruned_loss=0.1028, over 4297520.52 frames. ], batch size: 332, lr: 8.53e-03, grad_scale: 32.0 2023-06-19 21:35:59,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=558204.0, ans=0.0 2023-06-19 21:36:14,687 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:36:31,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=558264.0, ans=0.125 2023-06-19 21:36:34,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=558324.0, ans=0.1 2023-06-19 21:36:42,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=558324.0, ans=10.0 2023-06-19 21:37:01,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=558384.0, ans=0.07 2023-06-19 21:37:23,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=558444.0, ans=0.125 2023-06-19 21:37:34,580 INFO [train.py:996] (2/4) Epoch 4, batch 1600, loss[loss=0.2487, simple_loss=0.3137, pruned_loss=0.0919, over 21714.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3352, pruned_loss=0.1022, over 4291149.29 frames. ], batch size: 332, lr: 8.53e-03, grad_scale: 32.0 2023-06-19 21:38:15,097 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.278e+02 2.993e+02 3.386e+02 4.443e+02 8.016e+02, threshold=6.773e+02, percent-clipped=2.0 2023-06-19 21:39:08,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=558744.0, ans=0.125 2023-06-19 21:39:19,173 INFO [train.py:996] (2/4) Epoch 4, batch 1650, loss[loss=0.2571, simple_loss=0.3052, pruned_loss=0.1045, over 21488.00 frames. ], tot_loss[loss=0.2698, simple_loss=0.3353, pruned_loss=0.1022, over 4291012.21 frames. ], batch size: 212, lr: 8.53e-03, grad_scale: 32.0 2023-06-19 21:39:23,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=558804.0, ans=0.2 2023-06-19 21:40:33,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=558984.0, ans=0.0 2023-06-19 21:40:41,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=558984.0, ans=0.125 2023-06-19 21:40:41,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=558984.0, ans=10.0 2023-06-19 21:40:43,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=558984.0, ans=0.0 2023-06-19 21:40:59,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=559044.0, ans=0.125 2023-06-19 21:41:05,521 INFO [train.py:996] (2/4) Epoch 4, batch 1700, loss[loss=0.2811, simple_loss=0.3231, pruned_loss=0.1195, over 21412.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3404, pruned_loss=0.1057, over 4284568.62 frames. ], batch size: 473, lr: 8.52e-03, grad_scale: 16.0 2023-06-19 21:41:53,214 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.180e+02 2.877e+02 3.357e+02 4.119e+02 6.244e+02, threshold=6.713e+02, percent-clipped=0.0 2023-06-19 21:42:13,718 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.69 vs. limit=15.0 2023-06-19 21:42:14,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=559224.0, ans=0.125 2023-06-19 21:42:17,308 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-19 21:42:26,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=559284.0, ans=0.125 2023-06-19 21:42:42,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=559344.0, ans=0.125 2023-06-19 21:42:56,596 INFO [train.py:996] (2/4) Epoch 4, batch 1750, loss[loss=0.223, simple_loss=0.2984, pruned_loss=0.07385, over 21560.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3387, pruned_loss=0.1026, over 4279865.23 frames. ], batch size: 212, lr: 8.52e-03, grad_scale: 16.0 2023-06-19 21:43:45,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=559524.0, ans=0.0 2023-06-19 21:44:44,132 INFO [train.py:996] (2/4) Epoch 4, batch 1800, loss[loss=0.2577, simple_loss=0.3485, pruned_loss=0.08349, over 21726.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3345, pruned_loss=0.09814, over 4273541.21 frames. ], batch size: 332, lr: 8.52e-03, grad_scale: 16.0 2023-06-19 21:45:17,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=559764.0, ans=0.0 2023-06-19 21:45:27,961 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.187e+02 3.069e+02 3.500e+02 4.481e+02 7.550e+02, threshold=6.999e+02, percent-clipped=2.0 2023-06-19 21:45:30,609 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=15.0 2023-06-19 21:45:34,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=559824.0, ans=0.2 2023-06-19 21:46:23,278 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2023-06-19 21:46:33,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=560004.0, ans=0.125 2023-06-19 21:46:34,188 INFO [train.py:996] (2/4) Epoch 4, batch 1850, loss[loss=0.3488, simple_loss=0.4333, pruned_loss=0.1322, over 20825.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.335, pruned_loss=0.09584, over 4265840.15 frames. ], batch size: 607, lr: 8.52e-03, grad_scale: 16.0 2023-06-19 21:47:13,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=560124.0, ans=0.1 2023-06-19 21:47:55,677 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.92 vs. limit=15.0 2023-06-19 21:48:17,786 INFO [train.py:996] (2/4) Epoch 4, batch 1900, loss[loss=0.2205, simple_loss=0.2837, pruned_loss=0.07867, over 21382.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3361, pruned_loss=0.09759, over 4272645.77 frames. ], batch size: 159, lr: 8.51e-03, grad_scale: 16.0 2023-06-19 21:48:35,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=560304.0, ans=0.125 2023-06-19 21:48:44,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=560364.0, ans=0.0 2023-06-19 21:48:53,840 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 2.971e+02 3.385e+02 4.219e+02 8.098e+02, threshold=6.770e+02, percent-clipped=2.0 2023-06-19 21:49:37,664 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=22.5 2023-06-19 21:50:02,301 INFO [train.py:996] (2/4) Epoch 4, batch 1950, loss[loss=0.2636, simple_loss=0.3301, pruned_loss=0.09856, over 21259.00 frames. ], tot_loss[loss=0.263, simple_loss=0.332, pruned_loss=0.09701, over 4271644.69 frames. ], batch size: 549, lr: 8.51e-03, grad_scale: 16.0 2023-06-19 21:51:22,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=560784.0, ans=0.1 2023-06-19 21:51:25,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=560844.0, ans=0.05 2023-06-19 21:51:46,771 INFO [train.py:996] (2/4) Epoch 4, batch 2000, loss[loss=0.2862, simple_loss=0.3603, pruned_loss=0.1061, over 21739.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3262, pruned_loss=0.09533, over 4268846.25 frames. ], batch size: 332, lr: 8.51e-03, grad_scale: 32.0 2023-06-19 21:52:02,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=560904.0, ans=0.125 2023-06-19 21:52:24,014 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 3.002e+02 3.642e+02 4.364e+02 7.369e+02, threshold=7.284e+02, percent-clipped=1.0 2023-06-19 21:52:50,104 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 21:53:30,408 INFO [train.py:996] (2/4) Epoch 4, batch 2050, loss[loss=0.2448, simple_loss=0.3115, pruned_loss=0.08901, over 21445.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3298, pruned_loss=0.09613, over 4276701.75 frames. ], batch size: 131, lr: 8.51e-03, grad_scale: 16.0 2023-06-19 21:53:51,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=561264.0, ans=0.0 2023-06-19 21:55:20,890 INFO [train.py:996] (2/4) Epoch 4, batch 2100, loss[loss=0.3095, simple_loss=0.3523, pruned_loss=0.1334, over 19985.00 frames. ], tot_loss[loss=0.265, simple_loss=0.333, pruned_loss=0.09845, over 4276051.85 frames. ], batch size: 702, lr: 8.51e-03, grad_scale: 16.0 2023-06-19 21:55:48,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=561564.0, ans=0.0 2023-06-19 21:55:51,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=561564.0, ans=0.0 2023-06-19 21:55:51,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=561564.0, ans=0.04949747468305833 2023-06-19 21:55:59,296 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.354e+02 3.198e+02 3.847e+02 4.816e+02 7.420e+02, threshold=7.693e+02, percent-clipped=1.0 2023-06-19 21:56:39,798 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-19 21:57:05,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=561804.0, ans=0.0 2023-06-19 21:57:06,057 INFO [train.py:996] (2/4) Epoch 4, batch 2150, loss[loss=0.2605, simple_loss=0.3413, pruned_loss=0.08981, over 21636.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3334, pruned_loss=0.1007, over 4269812.47 frames. ], batch size: 298, lr: 8.50e-03, grad_scale: 16.0 2023-06-19 21:57:31,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=561864.0, ans=0.2 2023-06-19 21:57:42,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=561864.0, ans=0.2 2023-06-19 21:57:55,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=561924.0, ans=0.125 2023-06-19 21:58:17,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=561984.0, ans=0.125 2023-06-19 21:58:46,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=562044.0, ans=0.0 2023-06-19 21:58:50,864 INFO [train.py:996] (2/4) Epoch 4, batch 2200, loss[loss=0.2883, simple_loss=0.3522, pruned_loss=0.1122, over 21800.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3335, pruned_loss=0.09868, over 4264554.17 frames. ], batch size: 441, lr: 8.50e-03, grad_scale: 16.0 2023-06-19 21:58:55,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=562104.0, ans=0.125 2023-06-19 21:59:11,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=562164.0, ans=0.125 2023-06-19 21:59:28,575 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 3.061e+02 3.534e+02 4.711e+02 8.653e+02, threshold=7.068e+02, percent-clipped=2.0 2023-06-19 21:59:35,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=562224.0, ans=0.125 2023-06-19 21:59:50,822 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:00:03,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=562284.0, ans=0.125 2023-06-19 22:00:29,273 INFO [train.py:996] (2/4) Epoch 4, batch 2250, loss[loss=0.2014, simple_loss=0.2645, pruned_loss=0.06916, over 21796.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3303, pruned_loss=0.09592, over 4264627.33 frames. ], batch size: 124, lr: 8.50e-03, grad_scale: 16.0 2023-06-19 22:00:36,825 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-19 22:01:22,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=562524.0, ans=0.07 2023-06-19 22:02:08,518 INFO [train.py:996] (2/4) Epoch 4, batch 2300, loss[loss=0.2739, simple_loss=0.3099, pruned_loss=0.119, over 21332.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3269, pruned_loss=0.09625, over 4257821.25 frames. ], batch size: 507, lr: 8.50e-03, grad_scale: 16.0 2023-06-19 22:02:23,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=562764.0, ans=0.2 2023-06-19 22:02:51,590 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.293e+02 3.061e+02 3.548e+02 4.710e+02 1.046e+03, threshold=7.097e+02, percent-clipped=5.0 2023-06-19 22:03:55,584 INFO [train.py:996] (2/4) Epoch 4, batch 2350, loss[loss=0.2149, simple_loss=0.2822, pruned_loss=0.07378, over 21386.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3236, pruned_loss=0.09664, over 4265894.96 frames. ], batch size: 211, lr: 8.49e-03, grad_scale: 16.0 2023-06-19 22:04:02,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=563004.0, ans=0.125 2023-06-19 22:04:21,646 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:04:44,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=563124.0, ans=15.0 2023-06-19 22:05:39,374 INFO [train.py:996] (2/4) Epoch 4, batch 2400, loss[loss=0.3319, simple_loss=0.3886, pruned_loss=0.1376, over 21553.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3309, pruned_loss=0.1005, over 4268732.13 frames. ], batch size: 389, lr: 8.49e-03, grad_scale: 32.0 2023-06-19 22:05:53,938 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.12 vs. limit=22.5 2023-06-19 22:06:00,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=563364.0, ans=0.1 2023-06-19 22:06:03,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=563364.0, ans=0.125 2023-06-19 22:06:23,733 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 3.093e+02 3.486e+02 4.537e+02 7.539e+02, threshold=6.972e+02, percent-clipped=1.0 2023-06-19 22:07:23,983 INFO [train.py:996] (2/4) Epoch 4, batch 2450, loss[loss=0.2815, simple_loss=0.3337, pruned_loss=0.1147, over 21124.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.3376, pruned_loss=0.1034, over 4274536.94 frames. ], batch size: 143, lr: 8.49e-03, grad_scale: 32.0 2023-06-19 22:07:58,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=563664.0, ans=0.0 2023-06-19 22:08:09,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=563724.0, ans=0.1 2023-06-19 22:08:45,488 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=22.5 2023-06-19 22:09:02,306 INFO [train.py:996] (2/4) Epoch 4, batch 2500, loss[loss=0.2598, simple_loss=0.3416, pruned_loss=0.08901, over 21392.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.3347, pruned_loss=0.1014, over 4264359.27 frames. ], batch size: 194, lr: 8.49e-03, grad_scale: 32.0 2023-06-19 22:09:43,228 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-06-19 22:09:45,353 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.863e+02 3.660e+02 4.293e+02 8.660e+02, threshold=7.321e+02, percent-clipped=2.0 2023-06-19 22:09:57,726 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.80 vs. limit=15.0 2023-06-19 22:10:00,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=564084.0, ans=0.0 2023-06-19 22:10:41,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=564144.0, ans=0.07 2023-06-19 22:10:45,496 INFO [train.py:996] (2/4) Epoch 4, batch 2550, loss[loss=0.2615, simple_loss=0.3166, pruned_loss=0.1032, over 21539.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3334, pruned_loss=0.09976, over 4265150.96 frames. ], batch size: 414, lr: 8.49e-03, grad_scale: 32.0 2023-06-19 22:11:23,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=564264.0, ans=0.0 2023-06-19 22:11:44,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=564384.0, ans=0.0 2023-06-19 22:12:21,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=564444.0, ans=0.1 2023-06-19 22:12:29,216 INFO [train.py:996] (2/4) Epoch 4, batch 2600, loss[loss=0.2906, simple_loss=0.3551, pruned_loss=0.113, over 21641.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3364, pruned_loss=0.1025, over 4263545.51 frames. ], batch size: 389, lr: 8.48e-03, grad_scale: 32.0 2023-06-19 22:12:30,429 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-19 22:12:46,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=564504.0, ans=0.05 2023-06-19 22:12:52,033 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.57 vs. limit=12.0 2023-06-19 22:13:12,183 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.314e+02 3.048e+02 3.693e+02 4.515e+02 8.330e+02, threshold=7.386e+02, percent-clipped=1.0 2023-06-19 22:13:26,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=564624.0, ans=0.04949747468305833 2023-06-19 22:13:31,397 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=15.0 2023-06-19 22:13:54,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=564744.0, ans=0.125 2023-06-19 22:13:57,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=564744.0, ans=0.125 2023-06-19 22:14:10,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=564804.0, ans=0.125 2023-06-19 22:14:11,626 INFO [train.py:996] (2/4) Epoch 4, batch 2650, loss[loss=0.2534, simple_loss=0.3228, pruned_loss=0.09204, over 21858.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3365, pruned_loss=0.1034, over 4270840.26 frames. ], batch size: 371, lr: 8.48e-03, grad_scale: 32.0 2023-06-19 22:14:13,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=564804.0, ans=0.125 2023-06-19 22:15:44,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=565044.0, ans=0.2 2023-06-19 22:15:47,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=565044.0, ans=0.07 2023-06-19 22:15:56,997 INFO [train.py:996] (2/4) Epoch 4, batch 2700, loss[loss=0.2433, simple_loss=0.3092, pruned_loss=0.08867, over 21792.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3302, pruned_loss=0.1012, over 4278347.81 frames. ], batch size: 316, lr: 8.48e-03, grad_scale: 32.0 2023-06-19 22:16:23,691 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:16:39,955 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.216e+02 3.006e+02 3.494e+02 4.497e+02 9.129e+02, threshold=6.988e+02, percent-clipped=4.0 2023-06-19 22:16:50,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=565224.0, ans=0.125 2023-06-19 22:17:13,683 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=22.5 2023-06-19 22:17:19,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=565284.0, ans=0.125 2023-06-19 22:17:40,835 INFO [train.py:996] (2/4) Epoch 4, batch 2750, loss[loss=0.2837, simple_loss=0.3484, pruned_loss=0.1095, over 17280.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3304, pruned_loss=0.1013, over 4278988.10 frames. ], batch size: 60, lr: 8.48e-03, grad_scale: 32.0 2023-06-19 22:17:41,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=565404.0, ans=0.0 2023-06-19 22:18:10,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=565464.0, ans=0.95 2023-06-19 22:19:03,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=565584.0, ans=0.125 2023-06-19 22:19:09,013 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:19:19,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=565644.0, ans=0.025 2023-06-19 22:19:32,221 INFO [train.py:996] (2/4) Epoch 4, batch 2800, loss[loss=0.2568, simple_loss=0.2665, pruned_loss=0.1236, over 16781.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3358, pruned_loss=0.1039, over 4275974.26 frames. ], batch size: 61, lr: 8.47e-03, grad_scale: 32.0 2023-06-19 22:20:17,430 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.329e+02 3.042e+02 3.463e+02 4.341e+02 7.810e+02, threshold=6.926e+02, percent-clipped=4.0 2023-06-19 22:20:19,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=565824.0, ans=0.1 2023-06-19 22:20:42,967 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-19 22:21:13,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=565944.0, ans=0.5 2023-06-19 22:21:16,503 INFO [train.py:996] (2/4) Epoch 4, batch 2850, loss[loss=0.3059, simple_loss=0.3732, pruned_loss=0.1193, over 21402.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3346, pruned_loss=0.1044, over 4266686.20 frames. ], batch size: 507, lr: 8.47e-03, grad_scale: 32.0 2023-06-19 22:21:35,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=566064.0, ans=0.015 2023-06-19 22:21:38,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=566064.0, ans=0.1 2023-06-19 22:22:17,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=566184.0, ans=0.125 2023-06-19 22:22:20,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=566184.0, ans=0.125 2023-06-19 22:22:30,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=566184.0, ans=0.1 2023-06-19 22:22:59,636 INFO [train.py:996] (2/4) Epoch 4, batch 2900, loss[loss=0.2291, simple_loss=0.3012, pruned_loss=0.07846, over 21673.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3337, pruned_loss=0.104, over 4272216.55 frames. ], batch size: 263, lr: 8.47e-03, grad_scale: 16.0 2023-06-19 22:23:42,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=566424.0, ans=0.125 2023-06-19 22:23:45,065 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.192e+02 2.998e+02 3.695e+02 4.530e+02 8.664e+02, threshold=7.390e+02, percent-clipped=3.0 2023-06-19 22:23:45,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=566424.0, ans=0.125 2023-06-19 22:24:42,838 INFO [train.py:996] (2/4) Epoch 4, batch 2950, loss[loss=0.2688, simple_loss=0.3671, pruned_loss=0.08522, over 20857.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.3372, pruned_loss=0.1042, over 4279842.80 frames. ], batch size: 607, lr: 8.47e-03, grad_scale: 16.0 2023-06-19 22:24:46,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=566604.0, ans=0.125 2023-06-19 22:25:04,261 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.37 vs. limit=15.0 2023-06-19 22:25:35,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=566724.0, ans=0.125 2023-06-19 22:25:37,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=566724.0, ans=0.05 2023-06-19 22:26:25,964 INFO [train.py:996] (2/4) Epoch 4, batch 3000, loss[loss=0.2618, simple_loss=0.3436, pruned_loss=0.09006, over 21439.00 frames. ], tot_loss[loss=0.2755, simple_loss=0.341, pruned_loss=0.105, over 4280843.69 frames. ], batch size: 194, lr: 8.47e-03, grad_scale: 16.0 2023-06-19 22:26:25,965 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-19 22:26:43,403 INFO [train.py:1028] (2/4) Epoch 4, validation: loss=0.2637, simple_loss=0.3577, pruned_loss=0.08486, over 1796401.00 frames. 2023-06-19 22:26:43,403 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-19 22:27:08,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=566904.0, ans=0.1 2023-06-19 22:27:29,299 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 3.065e+02 3.685e+02 4.308e+02 7.209e+02, threshold=7.369e+02, percent-clipped=0.0 2023-06-19 22:28:20,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=567144.0, ans=0.125 2023-06-19 22:28:27,666 INFO [train.py:996] (2/4) Epoch 4, batch 3050, loss[loss=0.2247, simple_loss=0.3099, pruned_loss=0.06974, over 21840.00 frames. ], tot_loss[loss=0.2755, simple_loss=0.3427, pruned_loss=0.1042, over 4280401.92 frames. ], batch size: 282, lr: 8.46e-03, grad_scale: 16.0 2023-06-19 22:29:03,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=567264.0, ans=0.125 2023-06-19 22:30:12,634 INFO [train.py:996] (2/4) Epoch 4, batch 3100, loss[loss=0.3353, simple_loss=0.4271, pruned_loss=0.1217, over 21268.00 frames. ], tot_loss[loss=0.2746, simple_loss=0.3424, pruned_loss=0.1034, over 4286722.61 frames. ], batch size: 548, lr: 8.46e-03, grad_scale: 16.0 2023-06-19 22:30:19,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=567504.0, ans=0.125 2023-06-19 22:30:52,802 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.105e+02 3.250e+02 3.985e+02 4.690e+02 7.522e+02, threshold=7.970e+02, percent-clipped=1.0 2023-06-19 22:31:03,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=567624.0, ans=0.0 2023-06-19 22:31:22,749 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=15.0 2023-06-19 22:31:25,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=567684.0, ans=0.0 2023-06-19 22:31:59,528 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-19 22:32:00,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=567744.0, ans=0.1 2023-06-19 22:32:03,258 INFO [train.py:996] (2/4) Epoch 4, batch 3150, loss[loss=0.2809, simple_loss=0.3495, pruned_loss=0.1061, over 21826.00 frames. ], tot_loss[loss=0.2759, simple_loss=0.3439, pruned_loss=0.1039, over 4285448.63 frames. ], batch size: 282, lr: 8.46e-03, grad_scale: 16.0 2023-06-19 22:32:08,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=567804.0, ans=0.07 2023-06-19 22:32:57,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=567924.0, ans=0.0 2023-06-19 22:33:05,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=567984.0, ans=0.0 2023-06-19 22:33:48,472 INFO [train.py:996] (2/4) Epoch 4, batch 3200, loss[loss=0.1991, simple_loss=0.2721, pruned_loss=0.06305, over 21294.00 frames. ], tot_loss[loss=0.277, simple_loss=0.3456, pruned_loss=0.1042, over 4291250.91 frames. ], batch size: 159, lr: 8.46e-03, grad_scale: 32.0 2023-06-19 22:34:34,214 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 3.110e+02 3.486e+02 4.566e+02 1.016e+03, threshold=6.972e+02, percent-clipped=1.0 2023-06-19 22:35:18,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=568344.0, ans=0.125 2023-06-19 22:35:27,858 INFO [train.py:996] (2/4) Epoch 4, batch 3250, loss[loss=0.2881, simple_loss=0.3428, pruned_loss=0.1167, over 21199.00 frames. ], tot_loss[loss=0.2798, simple_loss=0.3469, pruned_loss=0.1063, over 4283256.48 frames. ], batch size: 176, lr: 8.45e-03, grad_scale: 32.0 2023-06-19 22:35:32,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=568404.0, ans=0.0 2023-06-19 22:36:05,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=568464.0, ans=0.0 2023-06-19 22:36:41,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=568584.0, ans=0.1 2023-06-19 22:36:53,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=568644.0, ans=0.125 2023-06-19 22:37:04,856 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-06-19 22:37:12,287 INFO [train.py:996] (2/4) Epoch 4, batch 3300, loss[loss=0.2586, simple_loss=0.3149, pruned_loss=0.1011, over 21319.00 frames. ], tot_loss[loss=0.2764, simple_loss=0.3414, pruned_loss=0.1057, over 4276665.62 frames. ], batch size: 144, lr: 8.45e-03, grad_scale: 32.0 2023-06-19 22:37:35,641 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.00 vs. limit=6.0 2023-06-19 22:37:57,096 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 2.879e+02 3.455e+02 4.524e+02 7.307e+02, threshold=6.909e+02, percent-clipped=1.0 2023-06-19 22:38:29,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=568884.0, ans=0.1 2023-06-19 22:38:35,531 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=15.0 2023-06-19 22:38:46,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=568944.0, ans=0.1 2023-06-19 22:38:53,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=568944.0, ans=0.0 2023-06-19 22:38:55,642 INFO [train.py:996] (2/4) Epoch 4, batch 3350, loss[loss=0.3539, simple_loss=0.4011, pruned_loss=0.1533, over 21478.00 frames. ], tot_loss[loss=0.2785, simple_loss=0.3447, pruned_loss=0.1062, over 4280681.64 frames. ], batch size: 507, lr: 8.45e-03, grad_scale: 32.0 2023-06-19 22:39:28,736 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=12.0 2023-06-19 22:39:31,730 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:39:43,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=569124.0, ans=0.125 2023-06-19 22:40:06,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=569184.0, ans=0.04949747468305833 2023-06-19 22:40:10,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=569184.0, ans=0.0 2023-06-19 22:40:50,351 INFO [train.py:996] (2/4) Epoch 4, batch 3400, loss[loss=0.2544, simple_loss=0.3437, pruned_loss=0.08253, over 21921.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.343, pruned_loss=0.1059, over 4280515.44 frames. ], batch size: 372, lr: 8.45e-03, grad_scale: 16.0 2023-06-19 22:41:07,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=569364.0, ans=0.125 2023-06-19 22:41:36,999 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 3.071e+02 3.735e+02 4.641e+02 6.693e+02, threshold=7.470e+02, percent-clipped=0.0 2023-06-19 22:41:54,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=569484.0, ans=0.125 2023-06-19 22:41:56,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=569484.0, ans=0.125 2023-06-19 22:42:04,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=569484.0, ans=0.0 2023-06-19 22:42:15,444 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=22.5 2023-06-19 22:42:29,574 INFO [train.py:996] (2/4) Epoch 4, batch 3450, loss[loss=0.2559, simple_loss=0.322, pruned_loss=0.09486, over 21721.00 frames. ], tot_loss[loss=0.2726, simple_loss=0.3367, pruned_loss=0.1042, over 4269944.18 frames. ], batch size: 316, lr: 8.45e-03, grad_scale: 16.0 2023-06-19 22:42:30,700 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.30 vs. limit=15.0 2023-06-19 22:43:06,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=569664.0, ans=0.2 2023-06-19 22:43:14,212 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:43:34,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=569784.0, ans=0.125 2023-06-19 22:44:15,179 INFO [train.py:996] (2/4) Epoch 4, batch 3500, loss[loss=0.2847, simple_loss=0.3369, pruned_loss=0.1163, over 21175.00 frames. ], tot_loss[loss=0.2825, simple_loss=0.3469, pruned_loss=0.1091, over 4270408.72 frames. ], batch size: 608, lr: 8.44e-03, grad_scale: 16.0 2023-06-19 22:45:02,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=570024.0, ans=0.1 2023-06-19 22:45:03,486 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.089e+02 3.084e+02 3.677e+02 4.361e+02 8.360e+02, threshold=7.354e+02, percent-clipped=5.0 2023-06-19 22:45:10,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=570024.0, ans=0.0 2023-06-19 22:45:42,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=570144.0, ans=0.125 2023-06-19 22:46:00,060 INFO [train.py:996] (2/4) Epoch 4, batch 3550, loss[loss=0.2864, simple_loss=0.3581, pruned_loss=0.1073, over 21458.00 frames. ], tot_loss[loss=0.2865, simple_loss=0.3518, pruned_loss=0.1106, over 4268772.76 frames. ], batch size: 194, lr: 8.44e-03, grad_scale: 16.0 2023-06-19 22:46:19,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=570204.0, ans=0.2 2023-06-19 22:46:23,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=570264.0, ans=0.0 2023-06-19 22:46:49,650 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:46:49,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=570324.0, ans=0.0 2023-06-19 22:47:40,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=570444.0, ans=0.125 2023-06-19 22:47:51,381 INFO [train.py:996] (2/4) Epoch 4, batch 3600, loss[loss=0.2486, simple_loss=0.3048, pruned_loss=0.09617, over 21853.00 frames. ], tot_loss[loss=0.2847, simple_loss=0.3476, pruned_loss=0.1109, over 4271829.85 frames. ], batch size: 317, lr: 8.44e-03, grad_scale: 32.0 2023-06-19 22:47:54,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=570504.0, ans=0.125 2023-06-19 22:47:55,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=570504.0, ans=0.0 2023-06-19 22:48:26,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=570624.0, ans=0.2 2023-06-19 22:48:29,404 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.447e+02 3.242e+02 3.839e+02 4.789e+02 9.292e+02, threshold=7.677e+02, percent-clipped=2.0 2023-06-19 22:48:43,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=570624.0, ans=0.0 2023-06-19 22:49:34,924 INFO [train.py:996] (2/4) Epoch 4, batch 3650, loss[loss=0.2196, simple_loss=0.3063, pruned_loss=0.06651, over 21763.00 frames. ], tot_loss[loss=0.2844, simple_loss=0.347, pruned_loss=0.1109, over 4269751.93 frames. ], batch size: 247, lr: 8.44e-03, grad_scale: 16.0 2023-06-19 22:49:37,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=570804.0, ans=0.1 2023-06-19 22:50:47,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=571044.0, ans=0.0 2023-06-19 22:51:14,675 INFO [train.py:996] (2/4) Epoch 4, batch 3700, loss[loss=0.3185, simple_loss=0.3783, pruned_loss=0.1294, over 21608.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3464, pruned_loss=0.1098, over 4270602.99 frames. ], batch size: 471, lr: 8.43e-03, grad_scale: 16.0 2023-06-19 22:51:16,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=571104.0, ans=0.125 2023-06-19 22:51:47,195 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.81 vs. limit=15.0 2023-06-19 22:51:52,886 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.752e+02 3.200e+02 3.601e+02 6.077e+02, threshold=6.399e+02, percent-clipped=0.0 2023-06-19 22:52:17,910 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 22:52:24,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=571284.0, ans=0.125 2023-06-19 22:52:27,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=571284.0, ans=0.02 2023-06-19 22:52:57,351 INFO [train.py:996] (2/4) Epoch 4, batch 3750, loss[loss=0.1939, simple_loss=0.2646, pruned_loss=0.06161, over 21602.00 frames. ], tot_loss[loss=0.2794, simple_loss=0.343, pruned_loss=0.1079, over 4278791.11 frames. ], batch size: 230, lr: 8.43e-03, grad_scale: 16.0 2023-06-19 22:54:40,578 INFO [train.py:996] (2/4) Epoch 4, batch 3800, loss[loss=0.2964, simple_loss=0.3505, pruned_loss=0.1211, over 21765.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.3409, pruned_loss=0.1058, over 4284821.01 frames. ], batch size: 298, lr: 8.43e-03, grad_scale: 16.0 2023-06-19 22:55:21,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=571824.0, ans=0.1 2023-06-19 22:55:25,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=571824.0, ans=0.2 2023-06-19 22:55:27,673 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.895e+02 2.805e+02 3.314e+02 3.828e+02 7.886e+02, threshold=6.628e+02, percent-clipped=5.0 2023-06-19 22:56:23,703 INFO [train.py:996] (2/4) Epoch 4, batch 3850, loss[loss=0.2639, simple_loss=0.316, pruned_loss=0.1059, over 21661.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.3393, pruned_loss=0.1065, over 4284508.28 frames. ], batch size: 417, lr: 8.43e-03, grad_scale: 16.0 2023-06-19 22:56:37,494 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.42 vs. limit=15.0 2023-06-19 22:56:47,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=572064.0, ans=0.04949747468305833 2023-06-19 22:57:22,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=572184.0, ans=0.0 2023-06-19 22:57:39,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=572184.0, ans=0.07 2023-06-19 22:58:02,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=572244.0, ans=0.0 2023-06-19 22:58:06,828 INFO [train.py:996] (2/4) Epoch 4, batch 3900, loss[loss=0.3004, simple_loss=0.3472, pruned_loss=0.1268, over 21773.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.3337, pruned_loss=0.1054, over 4281462.32 frames. ], batch size: 441, lr: 8.43e-03, grad_scale: 16.0 2023-06-19 22:58:19,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=572304.0, ans=0.2 2023-06-19 22:58:43,430 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.75 vs. limit=8.0 2023-06-19 22:58:55,584 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 2.958e+02 3.677e+02 4.804e+02 9.279e+02, threshold=7.354e+02, percent-clipped=7.0 2023-06-19 22:58:56,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=572424.0, ans=0.125 2023-06-19 22:58:58,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=572424.0, ans=0.125 2023-06-19 22:59:15,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=572484.0, ans=0.1 2023-06-19 22:59:51,735 INFO [train.py:996] (2/4) Epoch 4, batch 3950, loss[loss=0.232, simple_loss=0.3386, pruned_loss=0.06276, over 19688.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3327, pruned_loss=0.1026, over 4275198.24 frames. ], batch size: 703, lr: 8.42e-03, grad_scale: 16.0 2023-06-19 23:00:36,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=572724.0, ans=0.125 2023-06-19 23:00:41,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=572724.0, ans=0.5 2023-06-19 23:01:30,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=572844.0, ans=0.1 2023-06-19 23:01:34,294 INFO [train.py:996] (2/4) Epoch 4, batch 4000, loss[loss=0.2115, simple_loss=0.272, pruned_loss=0.07549, over 21577.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3263, pruned_loss=0.09866, over 4271606.41 frames. ], batch size: 263, lr: 8.42e-03, grad_scale: 32.0 2023-06-19 23:01:38,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=572904.0, ans=0.125 2023-06-19 23:01:50,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=572964.0, ans=0.125 2023-06-19 23:01:52,401 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.84 vs. limit=6.0 2023-06-19 23:02:21,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=573024.0, ans=0.125 2023-06-19 23:02:22,421 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.071e+02 2.603e+02 3.194e+02 3.964e+02 9.151e+02, threshold=6.387e+02, percent-clipped=1.0 2023-06-19 23:02:35,088 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:02:55,349 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=15.0 2023-06-19 23:03:08,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=573144.0, ans=0.95 2023-06-19 23:03:18,134 INFO [train.py:996] (2/4) Epoch 4, batch 4050, loss[loss=0.2406, simple_loss=0.3099, pruned_loss=0.08569, over 21825.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3247, pruned_loss=0.09685, over 4268867.26 frames. ], batch size: 118, lr: 8.42e-03, grad_scale: 16.0 2023-06-19 23:03:32,257 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=15.0 2023-06-19 23:03:46,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=573264.0, ans=0.125 2023-06-19 23:03:49,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=573264.0, ans=0.125 2023-06-19 23:04:00,494 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=22.5 2023-06-19 23:04:57,151 INFO [train.py:996] (2/4) Epoch 4, batch 4100, loss[loss=0.2683, simple_loss=0.339, pruned_loss=0.09877, over 21855.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3265, pruned_loss=0.09723, over 4267198.05 frames. ], batch size: 316, lr: 8.42e-03, grad_scale: 16.0 2023-06-19 23:05:06,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=573504.0, ans=0.125 2023-06-19 23:05:17,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=573564.0, ans=0.1 2023-06-19 23:05:43,938 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.55 vs. limit=15.0 2023-06-19 23:05:46,194 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.060e+02 2.845e+02 3.334e+02 4.002e+02 7.963e+02, threshold=6.669e+02, percent-clipped=0.0 2023-06-19 23:06:22,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=573744.0, ans=0.0 2023-06-19 23:06:40,772 INFO [train.py:996] (2/4) Epoch 4, batch 4150, loss[loss=0.2225, simple_loss=0.3051, pruned_loss=0.06996, over 21876.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3255, pruned_loss=0.09397, over 4270548.93 frames. ], batch size: 373, lr: 8.41e-03, grad_scale: 16.0 2023-06-19 23:06:56,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=573804.0, ans=0.2 2023-06-19 23:07:25,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=573924.0, ans=0.125 2023-06-19 23:07:38,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=573924.0, ans=0.125 2023-06-19 23:08:04,200 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.61 vs. limit=10.0 2023-06-19 23:08:25,558 INFO [train.py:996] (2/4) Epoch 4, batch 4200, loss[loss=0.2588, simple_loss=0.3295, pruned_loss=0.09407, over 21594.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3276, pruned_loss=0.09435, over 4265523.13 frames. ], batch size: 414, lr: 8.41e-03, grad_scale: 16.0 2023-06-19 23:08:26,897 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=15.0 2023-06-19 23:09:22,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=574224.0, ans=0.1 2023-06-19 23:09:26,363 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.745e+02 2.689e+02 3.288e+02 4.795e+02 7.055e+02, threshold=6.577e+02, percent-clipped=3.0 2023-06-19 23:09:28,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=574224.0, ans=0.015 2023-06-19 23:09:46,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=574284.0, ans=0.2 2023-06-19 23:10:19,399 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.61 vs. limit=6.0 2023-06-19 23:10:19,731 INFO [train.py:996] (2/4) Epoch 4, batch 4250, loss[loss=0.3042, simple_loss=0.3667, pruned_loss=0.1209, over 21318.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3329, pruned_loss=0.09608, over 4271103.55 frames. ], batch size: 176, lr: 8.41e-03, grad_scale: 16.0 2023-06-19 23:10:25,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=574404.0, ans=0.07 2023-06-19 23:10:36,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=574464.0, ans=0.0 2023-06-19 23:10:51,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=574464.0, ans=0.2 2023-06-19 23:11:48,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=574644.0, ans=0.0 2023-06-19 23:12:06,319 INFO [train.py:996] (2/4) Epoch 4, batch 4300, loss[loss=0.2655, simple_loss=0.2976, pruned_loss=0.1167, over 20053.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3407, pruned_loss=0.0995, over 4270837.76 frames. ], batch size: 704, lr: 8.41e-03, grad_scale: 16.0 2023-06-19 23:12:51,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=574824.0, ans=0.2 2023-06-19 23:12:53,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=574824.0, ans=0.0 2023-06-19 23:12:57,771 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.349e+02 2.886e+02 3.415e+02 4.755e+02 8.316e+02, threshold=6.829e+02, percent-clipped=3.0 2023-06-19 23:13:29,997 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.33 vs. limit=22.5 2023-06-19 23:13:41,605 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.85 vs. limit=15.0 2023-06-19 23:14:00,217 INFO [train.py:996] (2/4) Epoch 4, batch 4350, loss[loss=0.2644, simple_loss=0.3212, pruned_loss=0.1039, over 21796.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3376, pruned_loss=0.09772, over 4262790.50 frames. ], batch size: 371, lr: 8.41e-03, grad_scale: 16.0 2023-06-19 23:14:39,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=575124.0, ans=10.0 2023-06-19 23:15:22,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=575244.0, ans=0.125 2023-06-19 23:15:40,588 INFO [train.py:996] (2/4) Epoch 4, batch 4400, loss[loss=0.2477, simple_loss=0.3332, pruned_loss=0.08108, over 21594.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3331, pruned_loss=0.096, over 4266443.87 frames. ], batch size: 263, lr: 8.40e-03, grad_scale: 32.0 2023-06-19 23:15:57,062 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=15.0 2023-06-19 23:16:22,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=575424.0, ans=0.0 2023-06-19 23:16:26,302 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.218e+02 2.820e+02 3.325e+02 4.010e+02 7.079e+02, threshold=6.649e+02, percent-clipped=1.0 2023-06-19 23:16:39,609 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.83 vs. limit=6.0 2023-06-19 23:16:50,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=575484.0, ans=0.125 2023-06-19 23:17:25,252 INFO [train.py:996] (2/4) Epoch 4, batch 4450, loss[loss=0.2701, simple_loss=0.3375, pruned_loss=0.1013, over 21773.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.3441, pruned_loss=0.1001, over 4271653.92 frames. ], batch size: 124, lr: 8.40e-03, grad_scale: 32.0 2023-06-19 23:17:52,640 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.70 vs. limit=10.0 2023-06-19 23:18:53,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=575844.0, ans=0.1 2023-06-19 23:18:55,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=575844.0, ans=0.125 2023-06-19 23:18:57,075 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:18:57,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=575844.0, ans=0.125 2023-06-19 23:19:08,178 INFO [train.py:996] (2/4) Epoch 4, batch 4500, loss[loss=0.4067, simple_loss=0.5155, pruned_loss=0.1489, over 19706.00 frames. ], tot_loss[loss=0.2757, simple_loss=0.3464, pruned_loss=0.1025, over 4276861.52 frames. ], batch size: 702, lr: 8.40e-03, grad_scale: 16.0 2023-06-19 23:20:01,045 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.953e+02 3.681e+02 4.394e+02 8.500e+02, threshold=7.362e+02, percent-clipped=5.0 2023-06-19 23:20:13,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=576084.0, ans=0.125 2023-06-19 23:20:53,486 INFO [train.py:996] (2/4) Epoch 4, batch 4550, loss[loss=0.3416, simple_loss=0.402, pruned_loss=0.1406, over 21623.00 frames. ], tot_loss[loss=0.2789, simple_loss=0.3506, pruned_loss=0.1036, over 4276214.33 frames. ], batch size: 389, lr: 8.40e-03, grad_scale: 16.0 2023-06-19 23:21:18,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=576264.0, ans=0.0 2023-06-19 23:21:51,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=576324.0, ans=10.0 2023-06-19 23:22:38,834 INFO [train.py:996] (2/4) Epoch 4, batch 4600, loss[loss=0.2706, simple_loss=0.3272, pruned_loss=0.107, over 21280.00 frames. ], tot_loss[loss=0.282, simple_loss=0.3521, pruned_loss=0.1059, over 4280071.14 frames. ], batch size: 176, lr: 8.39e-03, grad_scale: 16.0 2023-06-19 23:23:21,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=576624.0, ans=0.0 2023-06-19 23:23:34,660 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 2.974e+02 3.353e+02 4.220e+02 8.842e+02, threshold=6.706e+02, percent-clipped=3.0 2023-06-19 23:23:35,454 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:23:58,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=576744.0, ans=0.125 2023-06-19 23:24:01,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=576744.0, ans=0.125 2023-06-19 23:24:21,984 INFO [train.py:996] (2/4) Epoch 4, batch 4650, loss[loss=0.1383, simple_loss=0.1979, pruned_loss=0.03928, over 16248.00 frames. ], tot_loss[loss=0.275, simple_loss=0.3435, pruned_loss=0.1033, over 4282879.60 frames. ], batch size: 61, lr: 8.39e-03, grad_scale: 16.0 2023-06-19 23:24:34,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=576804.0, ans=0.125 2023-06-19 23:24:42,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=576864.0, ans=0.0 2023-06-19 23:24:58,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=576924.0, ans=0.2 2023-06-19 23:25:57,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=577044.0, ans=0.125 2023-06-19 23:26:00,220 INFO [train.py:996] (2/4) Epoch 4, batch 4700, loss[loss=0.2317, simple_loss=0.2886, pruned_loss=0.08741, over 22010.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3331, pruned_loss=0.09975, over 4282846.86 frames. ], batch size: 103, lr: 8.39e-03, grad_scale: 16.0 2023-06-19 23:26:00,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=577104.0, ans=0.035 2023-06-19 23:26:07,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=577104.0, ans=0.1 2023-06-19 23:26:07,738 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.02 vs. limit=10.0 2023-06-19 23:26:50,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=577224.0, ans=0.125 2023-06-19 23:26:50,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=577224.0, ans=0.0 2023-06-19 23:26:56,731 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 3.084e+02 3.825e+02 4.515e+02 8.128e+02, threshold=7.651e+02, percent-clipped=5.0 2023-06-19 23:27:24,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=577344.0, ans=0.2 2023-06-19 23:27:42,054 INFO [train.py:996] (2/4) Epoch 4, batch 4750, loss[loss=0.2465, simple_loss=0.3085, pruned_loss=0.09231, over 21522.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.328, pruned_loss=0.1001, over 4288465.45 frames. ], batch size: 548, lr: 8.39e-03, grad_scale: 16.0 2023-06-19 23:27:52,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=577404.0, ans=0.0 2023-06-19 23:28:15,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=577464.0, ans=0.025 2023-06-19 23:28:24,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=577464.0, ans=0.125 2023-06-19 23:28:51,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=577584.0, ans=10.0 2023-06-19 23:28:56,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=577584.0, ans=0.0 2023-06-19 23:29:27,876 INFO [train.py:996] (2/4) Epoch 4, batch 4800, loss[loss=0.2797, simple_loss=0.3226, pruned_loss=0.1184, over 20262.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3274, pruned_loss=0.1006, over 4283970.37 frames. ], batch size: 703, lr: 8.39e-03, grad_scale: 32.0 2023-06-19 23:29:47,255 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.34 vs. limit=10.0 2023-06-19 23:29:49,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=577764.0, ans=0.04949747468305833 2023-06-19 23:30:05,301 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:30:22,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=577824.0, ans=0.0 2023-06-19 23:30:23,123 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.43 vs. limit=15.0 2023-06-19 23:30:25,433 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.139e+02 3.016e+02 3.604e+02 4.520e+02 9.140e+02, threshold=7.207e+02, percent-clipped=2.0 2023-06-19 23:31:03,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=577944.0, ans=0.125 2023-06-19 23:31:11,098 INFO [train.py:996] (2/4) Epoch 4, batch 4850, loss[loss=0.259, simple_loss=0.3254, pruned_loss=0.0963, over 21818.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.327, pruned_loss=0.1001, over 4283658.59 frames. ], batch size: 351, lr: 8.38e-03, grad_scale: 32.0 2023-06-19 23:32:08,761 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.38 vs. limit=15.0 2023-06-19 23:32:34,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=578244.0, ans=0.1 2023-06-19 23:32:53,796 INFO [train.py:996] (2/4) Epoch 4, batch 4900, loss[loss=0.3202, simple_loss=0.3797, pruned_loss=0.1304, over 21513.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.3289, pruned_loss=0.1012, over 4279299.73 frames. ], batch size: 471, lr: 8.38e-03, grad_scale: 32.0 2023-06-19 23:33:14,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=578364.0, ans=0.0 2023-06-19 23:33:29,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=578364.0, ans=0.125 2023-06-19 23:33:40,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.whiten.whitening_limit, batch_count=578364.0, ans=15.0 2023-06-19 23:33:50,458 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.034e+02 3.075e+02 3.679e+02 4.552e+02 8.349e+02, threshold=7.359e+02, percent-clipped=3.0 2023-06-19 23:34:37,044 INFO [train.py:996] (2/4) Epoch 4, batch 4950, loss[loss=0.2202, simple_loss=0.3082, pruned_loss=0.06615, over 21200.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3328, pruned_loss=0.09968, over 4277915.16 frames. ], batch size: 159, lr: 8.38e-03, grad_scale: 32.0 2023-06-19 23:34:54,924 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=15.0 2023-06-19 23:35:27,258 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.88 vs. limit=15.0 2023-06-19 23:35:28,976 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=15.0 2023-06-19 23:36:05,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=578844.0, ans=0.1 2023-06-19 23:36:17,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=578904.0, ans=0.1 2023-06-19 23:36:19,067 INFO [train.py:996] (2/4) Epoch 4, batch 5000, loss[loss=0.1899, simple_loss=0.2552, pruned_loss=0.06227, over 15594.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3319, pruned_loss=0.09611, over 4266557.69 frames. ], batch size: 60, lr: 8.38e-03, grad_scale: 32.0 2023-06-19 23:36:36,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=578904.0, ans=0.125 2023-06-19 23:37:13,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=579024.0, ans=0.0 2023-06-19 23:37:15,619 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.766e+02 3.352e+02 4.422e+02 7.725e+02, threshold=6.703e+02, percent-clipped=2.0 2023-06-19 23:37:55,414 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-19 23:38:01,100 INFO [train.py:996] (2/4) Epoch 4, batch 5050, loss[loss=0.2544, simple_loss=0.3152, pruned_loss=0.09679, over 21260.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3334, pruned_loss=0.09704, over 4267761.68 frames. ], batch size: 176, lr: 8.38e-03, grad_scale: 32.0 2023-06-19 23:39:42,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=579504.0, ans=0.0 2023-06-19 23:39:43,724 INFO [train.py:996] (2/4) Epoch 4, batch 5100, loss[loss=0.2808, simple_loss=0.3376, pruned_loss=0.1121, over 21697.00 frames. ], tot_loss[loss=0.2647, simple_loss=0.3337, pruned_loss=0.09783, over 4271914.59 frames. ], batch size: 112, lr: 8.37e-03, grad_scale: 32.0 2023-06-19 23:40:39,513 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 2.860e+02 3.323e+02 3.950e+02 6.797e+02, threshold=6.645e+02, percent-clipped=1.0 2023-06-19 23:41:24,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=579744.0, ans=0.125 2023-06-19 23:41:26,673 INFO [train.py:996] (2/4) Epoch 4, batch 5150, loss[loss=0.2665, simple_loss=0.3183, pruned_loss=0.1074, over 21322.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3335, pruned_loss=0.09962, over 4280513.62 frames. ], batch size: 143, lr: 8.37e-03, grad_scale: 32.0 2023-06-19 23:41:31,324 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.00 vs. limit=15.0 2023-06-19 23:42:30,792 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=12.0 2023-06-19 23:43:11,027 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.02 vs. limit=5.0 2023-06-19 23:43:16,535 INFO [train.py:996] (2/4) Epoch 4, batch 5200, loss[loss=0.2524, simple_loss=0.3442, pruned_loss=0.08028, over 21714.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3357, pruned_loss=0.09998, over 4282016.62 frames. ], batch size: 247, lr: 8.37e-03, grad_scale: 32.0 2023-06-19 23:44:10,734 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.305e+02 2.847e+02 3.708e+02 4.367e+02 7.934e+02, threshold=7.417e+02, percent-clipped=2.0 2023-06-19 23:45:00,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=580404.0, ans=0.125 2023-06-19 23:45:01,067 INFO [train.py:996] (2/4) Epoch 4, batch 5250, loss[loss=0.2887, simple_loss=0.3667, pruned_loss=0.1053, over 21779.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3379, pruned_loss=0.09779, over 4280070.75 frames. ], batch size: 371, lr: 8.37e-03, grad_scale: 32.0 2023-06-19 23:45:03,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=580404.0, ans=0.0 2023-06-19 23:45:58,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=580584.0, ans=0.125 2023-06-19 23:46:08,248 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.07 vs. limit=12.0 2023-06-19 23:46:39,553 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.84 vs. limit=15.0 2023-06-19 23:46:41,737 INFO [train.py:996] (2/4) Epoch 4, batch 5300, loss[loss=0.2643, simple_loss=0.3293, pruned_loss=0.09964, over 21882.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3379, pruned_loss=0.09889, over 4290918.04 frames. ], batch size: 332, lr: 8.36e-03, grad_scale: 32.0 2023-06-19 23:46:52,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=580704.0, ans=0.125 2023-06-19 23:46:56,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=580704.0, ans=0.125 2023-06-19 23:47:08,602 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.259e-03 2023-06-19 23:47:34,115 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.860e+02 3.383e+02 4.031e+02 8.552e+02, threshold=6.767e+02, percent-clipped=2.0 2023-06-19 23:47:36,259 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:48:03,325 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.12 vs. limit=15.0 2023-06-19 23:48:23,148 INFO [train.py:996] (2/4) Epoch 4, batch 5350, loss[loss=0.3202, simple_loss=0.3524, pruned_loss=0.144, over 21809.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3385, pruned_loss=0.09994, over 4281235.12 frames. ], batch size: 508, lr: 8.36e-03, grad_scale: 16.0 2023-06-19 23:48:25,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=581004.0, ans=0.125 2023-06-19 23:49:25,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=581184.0, ans=0.125 2023-06-19 23:49:31,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=581184.0, ans=0.125 2023-06-19 23:49:46,111 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:50:10,528 INFO [train.py:996] (2/4) Epoch 4, batch 5400, loss[loss=0.2707, simple_loss=0.3359, pruned_loss=0.1027, over 21834.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.337, pruned_loss=0.1012, over 4283655.35 frames. ], batch size: 351, lr: 8.36e-03, grad_scale: 16.0 2023-06-19 23:50:11,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=581304.0, ans=0.2 2023-06-19 23:50:12,698 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-19 23:50:47,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=581364.0, ans=0.125 2023-06-19 23:51:04,841 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.927e+02 3.139e+02 3.601e+02 4.345e+02 9.321e+02, threshold=7.202e+02, percent-clipped=3.0 2023-06-19 23:51:54,869 INFO [train.py:996] (2/4) Epoch 4, batch 5450, loss[loss=0.2977, simple_loss=0.3974, pruned_loss=0.09895, over 21759.00 frames. ], tot_loss[loss=0.2675, simple_loss=0.3365, pruned_loss=0.09925, over 4285346.68 frames. ], batch size: 298, lr: 8.36e-03, grad_scale: 16.0 2023-06-19 23:51:57,721 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=22.5 2023-06-19 23:52:12,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=581604.0, ans=0.125 2023-06-19 23:52:14,347 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-19 23:52:39,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=581724.0, ans=0.125 2023-06-19 23:53:25,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=581844.0, ans=0.125 2023-06-19 23:53:44,859 INFO [train.py:996] (2/4) Epoch 4, batch 5500, loss[loss=0.2111, simple_loss=0.3239, pruned_loss=0.04913, over 21164.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3397, pruned_loss=0.09461, over 4282404.66 frames. ], batch size: 548, lr: 8.36e-03, grad_scale: 16.0 2023-06-19 23:54:10,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=581964.0, ans=0.1 2023-06-19 23:54:17,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=581964.0, ans=0.125 2023-06-19 23:54:33,619 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.703e+02 3.148e+02 3.931e+02 6.952e+02, threshold=6.296e+02, percent-clipped=0.0 2023-06-19 23:55:03,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=582084.0, ans=0.0 2023-06-19 23:55:08,805 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.93 vs. limit=15.0 2023-06-19 23:55:20,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=582144.0, ans=0.125 2023-06-19 23:55:30,391 INFO [train.py:996] (2/4) Epoch 4, batch 5550, loss[loss=0.1903, simple_loss=0.2714, pruned_loss=0.05463, over 21094.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3382, pruned_loss=0.09203, over 4274785.07 frames. ], batch size: 159, lr: 8.35e-03, grad_scale: 16.0 2023-06-19 23:56:20,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=582324.0, ans=0.05 2023-06-19 23:56:22,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=582324.0, ans=0.125 2023-06-19 23:56:31,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=582324.0, ans=0.2 2023-06-19 23:57:00,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=582444.0, ans=0.1 2023-06-19 23:57:19,430 INFO [train.py:996] (2/4) Epoch 4, batch 5600, loss[loss=0.3303, simple_loss=0.4146, pruned_loss=0.123, over 21679.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3359, pruned_loss=0.08903, over 4267579.26 frames. ], batch size: 389, lr: 8.35e-03, grad_scale: 32.0 2023-06-19 23:57:21,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=582504.0, ans=0.125 2023-06-19 23:58:12,429 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 2.737e+02 3.310e+02 4.006e+02 7.274e+02, threshold=6.621e+02, percent-clipped=1.0 2023-06-19 23:58:36,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=582684.0, ans=0.0 2023-06-19 23:59:01,256 INFO [train.py:996] (2/4) Epoch 4, batch 5650, loss[loss=0.247, simple_loss=0.3166, pruned_loss=0.08871, over 21788.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3395, pruned_loss=0.09177, over 4268397.25 frames. ], batch size: 298, lr: 8.35e-03, grad_scale: 32.0 2023-06-19 23:59:03,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=582804.0, ans=0.2 2023-06-19 23:59:21,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=582864.0, ans=0.125 2023-06-19 23:59:36,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=582864.0, ans=0.125 2023-06-19 23:59:52,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=582924.0, ans=0.125 2023-06-20 00:00:39,065 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=22.5 2023-06-20 00:00:40,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=583044.0, ans=0.5 2023-06-20 00:00:43,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=583104.0, ans=0.0 2023-06-20 00:00:44,950 INFO [train.py:996] (2/4) Epoch 4, batch 5700, loss[loss=0.2933, simple_loss=0.3753, pruned_loss=0.1057, over 21592.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3387, pruned_loss=0.09476, over 4277508.11 frames. ], batch size: 509, lr: 8.35e-03, grad_scale: 32.0 2023-06-20 00:00:54,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=583104.0, ans=0.1 2023-06-20 00:01:02,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=583104.0, ans=0.125 2023-06-20 00:01:07,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=583164.0, ans=0.0 2023-06-20 00:01:38,509 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 3.072e+02 3.794e+02 4.480e+02 7.487e+02, threshold=7.588e+02, percent-clipped=5.0 2023-06-20 00:01:53,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=583284.0, ans=0.125 2023-06-20 00:02:29,595 INFO [train.py:996] (2/4) Epoch 4, batch 5750, loss[loss=0.2131, simple_loss=0.3018, pruned_loss=0.06218, over 21642.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3334, pruned_loss=0.09077, over 4267319.23 frames. ], batch size: 247, lr: 8.35e-03, grad_scale: 32.0 2023-06-20 00:02:30,745 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-06-20 00:03:12,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=583524.0, ans=0.125 2023-06-20 00:04:13,606 INFO [train.py:996] (2/4) Epoch 4, batch 5800, loss[loss=0.2651, simple_loss=0.351, pruned_loss=0.08958, over 21772.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3293, pruned_loss=0.0878, over 4256481.45 frames. ], batch size: 282, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 00:04:54,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=583824.0, ans=0.0 2023-06-20 00:05:02,495 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.569e+02 2.603e+02 3.108e+02 3.966e+02 5.463e+02, threshold=6.216e+02, percent-clipped=0.0 2023-06-20 00:05:43,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=583944.0, ans=0.0 2023-06-20 00:05:53,804 INFO [train.py:996] (2/4) Epoch 4, batch 5850, loss[loss=0.2445, simple_loss=0.3299, pruned_loss=0.07952, over 21475.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3262, pruned_loss=0.08301, over 4269067.19 frames. ], batch size: 507, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 00:06:37,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=584124.0, ans=0.125 2023-06-20 00:06:43,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=584124.0, ans=0.125 2023-06-20 00:06:48,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=584124.0, ans=0.0 2023-06-20 00:07:00,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=584184.0, ans=0.07 2023-06-20 00:07:19,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=584244.0, ans=0.2 2023-06-20 00:07:21,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=584244.0, ans=0.0 2023-06-20 00:07:29,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=584244.0, ans=0.125 2023-06-20 00:07:36,931 INFO [train.py:996] (2/4) Epoch 4, batch 5900, loss[loss=0.2552, simple_loss=0.3044, pruned_loss=0.103, over 20234.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3183, pruned_loss=0.07769, over 4264787.33 frames. ], batch size: 703, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 00:08:23,942 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=22.5 2023-06-20 00:08:29,715 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.358e+02 2.549e+02 3.049e+02 3.679e+02 6.495e+02, threshold=6.098e+02, percent-clipped=1.0 2023-06-20 00:08:59,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=584484.0, ans=0.1 2023-06-20 00:09:28,650 INFO [train.py:996] (2/4) Epoch 4, batch 5950, loss[loss=0.1993, simple_loss=0.2845, pruned_loss=0.05704, over 21763.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3193, pruned_loss=0.08085, over 4275894.92 frames. ], batch size: 247, lr: 8.34e-03, grad_scale: 32.0 2023-06-20 00:09:40,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=584604.0, ans=0.0 2023-06-20 00:09:52,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=584664.0, ans=0.0 2023-06-20 00:11:04,367 INFO [train.py:996] (2/4) Epoch 4, batch 6000, loss[loss=0.226, simple_loss=0.2868, pruned_loss=0.0826, over 21794.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3176, pruned_loss=0.08613, over 4273127.71 frames. ], batch size: 102, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 00:11:04,368 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 00:11:26,268 INFO [train.py:1028] (2/4) Epoch 4, validation: loss=0.2686, simple_loss=0.3646, pruned_loss=0.08628, over 1796401.00 frames. 2023-06-20 00:11:26,269 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-20 00:12:19,541 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.164e+02 2.849e+02 3.273e+02 3.960e+02 7.085e+02, threshold=6.546e+02, percent-clipped=4.0 2023-06-20 00:12:19,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=585024.0, ans=0.125 2023-06-20 00:12:23,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=585024.0, ans=0.125 2023-06-20 00:12:36,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=585084.0, ans=0.1 2023-06-20 00:12:50,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=585144.0, ans=0.125 2023-06-20 00:13:09,993 INFO [train.py:996] (2/4) Epoch 4, batch 6050, loss[loss=0.2098, simple_loss=0.2843, pruned_loss=0.06764, over 21294.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3132, pruned_loss=0.08744, over 4264680.64 frames. ], batch size: 176, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 00:14:08,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=585384.0, ans=0.0 2023-06-20 00:14:12,803 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.96 vs. limit=15.0 2023-06-20 00:14:50,534 INFO [train.py:996] (2/4) Epoch 4, batch 6100, loss[loss=0.2582, simple_loss=0.3245, pruned_loss=0.09595, over 21937.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3116, pruned_loss=0.0858, over 4276763.65 frames. ], batch size: 316, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 00:14:55,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=585504.0, ans=0.125 2023-06-20 00:15:21,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=585624.0, ans=0.1 2023-06-20 00:15:43,392 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.539e+02 3.084e+02 3.751e+02 6.044e+02, threshold=6.168e+02, percent-clipped=0.0 2023-06-20 00:16:28,746 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=22.5 2023-06-20 00:16:32,413 INFO [train.py:996] (2/4) Epoch 4, batch 6150, loss[loss=0.2251, simple_loss=0.2897, pruned_loss=0.08022, over 21168.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3145, pruned_loss=0.0886, over 4283776.48 frames. ], batch size: 176, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 00:16:44,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=585804.0, ans=0.0 2023-06-20 00:17:21,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=585924.0, ans=0.1 2023-06-20 00:18:08,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=586044.0, ans=0.125 2023-06-20 00:18:14,447 INFO [train.py:996] (2/4) Epoch 4, batch 6200, loss[loss=0.2602, simple_loss=0.3109, pruned_loss=0.1048, over 21218.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3176, pruned_loss=0.08997, over 4276717.20 frames. ], batch size: 608, lr: 8.33e-03, grad_scale: 32.0 2023-06-20 00:18:17,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=586104.0, ans=0.0 2023-06-20 00:19:03,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=586224.0, ans=0.125 2023-06-20 00:19:08,246 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.711e+02 3.284e+02 3.994e+02 6.399e+02, threshold=6.568e+02, percent-clipped=2.0 2023-06-20 00:19:59,701 INFO [train.py:996] (2/4) Epoch 4, batch 6250, loss[loss=0.2402, simple_loss=0.3376, pruned_loss=0.07137, over 21617.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3234, pruned_loss=0.09061, over 4277791.85 frames. ], batch size: 230, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 00:20:02,349 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.31 vs. limit=15.0 2023-06-20 00:21:27,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=586644.0, ans=0.125 2023-06-20 00:21:43,513 INFO [train.py:996] (2/4) Epoch 4, batch 6300, loss[loss=0.2553, simple_loss=0.3178, pruned_loss=0.09639, over 21674.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3275, pruned_loss=0.08962, over 4282718.75 frames. ], batch size: 263, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 00:21:52,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=586704.0, ans=0.125 2023-06-20 00:22:22,166 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.60 vs. limit=22.5 2023-06-20 00:22:44,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=586824.0, ans=0.0 2023-06-20 00:22:45,838 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 2.688e+02 3.149e+02 3.968e+02 6.842e+02, threshold=6.299e+02, percent-clipped=2.0 2023-06-20 00:22:47,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=586824.0, ans=0.1 2023-06-20 00:23:26,105 INFO [train.py:996] (2/4) Epoch 4, batch 6350, loss[loss=0.2151, simple_loss=0.3327, pruned_loss=0.04871, over 20834.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3326, pruned_loss=0.09386, over 4282933.47 frames. ], batch size: 607, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 00:23:48,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=587064.0, ans=0.0 2023-06-20 00:24:39,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=587184.0, ans=0.0 2023-06-20 00:24:51,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=587244.0, ans=0.0 2023-06-20 00:24:57,517 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.31 vs. limit=15.0 2023-06-20 00:25:16,758 INFO [train.py:996] (2/4) Epoch 4, batch 6400, loss[loss=0.3146, simple_loss=0.3724, pruned_loss=0.1284, over 21348.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.3406, pruned_loss=0.09861, over 4274076.50 frames. ], batch size: 176, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 00:25:24,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=587304.0, ans=0.125 2023-06-20 00:25:53,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=587364.0, ans=0.1 2023-06-20 00:26:06,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=587424.0, ans=0.0 2023-06-20 00:26:10,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=587424.0, ans=0.04949747468305833 2023-06-20 00:26:11,106 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.474e+02 3.295e+02 3.771e+02 4.525e+02 8.192e+02, threshold=7.543e+02, percent-clipped=2.0 2023-06-20 00:26:28,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=587484.0, ans=0.125 2023-06-20 00:26:51,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=587544.0, ans=0.125 2023-06-20 00:26:51,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=587544.0, ans=0.0 2023-06-20 00:27:05,330 INFO [train.py:996] (2/4) Epoch 4, batch 6450, loss[loss=0.2103, simple_loss=0.2923, pruned_loss=0.06418, over 21684.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3439, pruned_loss=0.0991, over 4269831.18 frames. ], batch size: 332, lr: 8.32e-03, grad_scale: 32.0 2023-06-20 00:27:15,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=587604.0, ans=0.07 2023-06-20 00:27:27,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=587664.0, ans=0.125 2023-06-20 00:27:31,243 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-20 00:27:33,849 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 00:27:33,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=587664.0, ans=0.0 2023-06-20 00:27:35,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=587664.0, ans=0.125 2023-06-20 00:27:39,568 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.40 vs. limit=22.5 2023-06-20 00:27:47,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=587724.0, ans=0.1 2023-06-20 00:27:50,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=587724.0, ans=0.125 2023-06-20 00:28:09,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=587784.0, ans=0.0 2023-06-20 00:28:48,575 INFO [train.py:996] (2/4) Epoch 4, batch 6500, loss[loss=0.3252, simple_loss=0.3886, pruned_loss=0.1309, over 21402.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3352, pruned_loss=0.09779, over 4261219.34 frames. ], batch size: 471, lr: 8.31e-03, grad_scale: 32.0 2023-06-20 00:29:06,401 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.36 vs. limit=10.0 2023-06-20 00:29:35,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=588024.0, ans=0.2 2023-06-20 00:29:35,992 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.671e+02 3.231e+02 3.777e+02 5.375e+02, threshold=6.462e+02, percent-clipped=0.0 2023-06-20 00:30:00,532 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.02 vs. limit=15.0 2023-06-20 00:30:27,763 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 00:30:30,248 INFO [train.py:996] (2/4) Epoch 4, batch 6550, loss[loss=0.2312, simple_loss=0.3015, pruned_loss=0.08049, over 21601.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3318, pruned_loss=0.09588, over 4266824.10 frames. ], batch size: 230, lr: 8.31e-03, grad_scale: 32.0 2023-06-20 00:30:55,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=588264.0, ans=0.0 2023-06-20 00:31:09,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=588324.0, ans=15.0 2023-06-20 00:31:28,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=588384.0, ans=0.2 2023-06-20 00:32:13,078 INFO [train.py:996] (2/4) Epoch 4, batch 6600, loss[loss=0.2652, simple_loss=0.3201, pruned_loss=0.1051, over 21633.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.327, pruned_loss=0.09596, over 4266088.95 frames. ], batch size: 298, lr: 8.31e-03, grad_scale: 16.0 2023-06-20 00:32:47,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=588564.0, ans=0.125 2023-06-20 00:33:01,894 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.719e+02 3.222e+02 3.782e+02 6.837e+02, threshold=6.444e+02, percent-clipped=2.0 2023-06-20 00:33:29,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=588684.0, ans=0.2 2023-06-20 00:33:54,759 INFO [train.py:996] (2/4) Epoch 4, batch 6650, loss[loss=0.2493, simple_loss=0.3055, pruned_loss=0.09658, over 21780.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3213, pruned_loss=0.09264, over 4265149.03 frames. ], batch size: 317, lr: 8.31e-03, grad_scale: 16.0 2023-06-20 00:34:14,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=588804.0, ans=0.0 2023-06-20 00:35:07,351 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.84 vs. limit=22.5 2023-06-20 00:35:27,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=589044.0, ans=0.125 2023-06-20 00:35:35,097 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=15.0 2023-06-20 00:35:37,442 INFO [train.py:996] (2/4) Epoch 4, batch 6700, loss[loss=0.2172, simple_loss=0.2735, pruned_loss=0.08043, over 16745.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3169, pruned_loss=0.09204, over 4247131.63 frames. ], batch size: 63, lr: 8.31e-03, grad_scale: 16.0 2023-06-20 00:35:53,438 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.04 vs. limit=6.0 2023-06-20 00:36:21,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=589224.0, ans=0.125 2023-06-20 00:36:26,052 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.970e+02 2.795e+02 3.323e+02 4.034e+02 6.039e+02, threshold=6.647e+02, percent-clipped=0.0 2023-06-20 00:37:18,669 INFO [train.py:996] (2/4) Epoch 4, batch 6750, loss[loss=0.3033, simple_loss=0.3392, pruned_loss=0.1337, over 21385.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3152, pruned_loss=0.09299, over 4259781.92 frames. ], batch size: 473, lr: 8.30e-03, grad_scale: 16.0 2023-06-20 00:37:40,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=589464.0, ans=0.125 2023-06-20 00:37:52,453 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.73 vs. limit=10.0 2023-06-20 00:37:53,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=589464.0, ans=0.0 2023-06-20 00:38:04,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=589524.0, ans=0.1 2023-06-20 00:38:07,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=589524.0, ans=0.0 2023-06-20 00:38:15,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=589584.0, ans=0.125 2023-06-20 00:38:28,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=589584.0, ans=0.125 2023-06-20 00:38:51,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=589644.0, ans=0.1 2023-06-20 00:38:54,423 INFO [train.py:996] (2/4) Epoch 4, batch 6800, loss[loss=0.2715, simple_loss=0.3244, pruned_loss=0.1093, over 21353.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3167, pruned_loss=0.09501, over 4262216.68 frames. ], batch size: 143, lr: 8.30e-03, grad_scale: 32.0 2023-06-20 00:38:58,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=589704.0, ans=0.2 2023-06-20 00:39:40,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=589824.0, ans=0.1 2023-06-20 00:39:43,469 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.844e+02 3.168e+02 3.952e+02 7.008e+02, threshold=6.337e+02, percent-clipped=1.0 2023-06-20 00:40:35,919 INFO [train.py:996] (2/4) Epoch 4, batch 6850, loss[loss=0.2387, simple_loss=0.2878, pruned_loss=0.09482, over 21343.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3152, pruned_loss=0.09668, over 4268090.60 frames. ], batch size: 177, lr: 8.30e-03, grad_scale: 32.0 2023-06-20 00:41:18,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=590124.0, ans=0.0 2023-06-20 00:41:42,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=590184.0, ans=0.125 2023-06-20 00:42:06,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=590244.0, ans=0.125 2023-06-20 00:42:14,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=590244.0, ans=0.95 2023-06-20 00:42:19,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=590304.0, ans=0.125 2023-06-20 00:42:20,518 INFO [train.py:996] (2/4) Epoch 4, batch 6900, loss[loss=0.2195, simple_loss=0.3083, pruned_loss=0.06533, over 21617.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3175, pruned_loss=0.09689, over 4273077.69 frames. ], batch size: 263, lr: 8.30e-03, grad_scale: 16.0 2023-06-20 00:42:39,268 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.27 vs. limit=15.0 2023-06-20 00:42:49,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=590364.0, ans=0.125 2023-06-20 00:42:52,405 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.22 vs. limit=8.0 2023-06-20 00:43:22,484 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 3.145e+02 3.688e+02 5.056e+02 7.443e+02, threshold=7.376e+02, percent-clipped=5.0 2023-06-20 00:43:31,794 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-20 00:43:39,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=590484.0, ans=0.2 2023-06-20 00:44:02,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=590604.0, ans=0.1 2023-06-20 00:44:03,277 INFO [train.py:996] (2/4) Epoch 4, batch 6950, loss[loss=0.3052, simple_loss=0.368, pruned_loss=0.1212, over 21664.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3196, pruned_loss=0.09454, over 4270154.06 frames. ], batch size: 351, lr: 8.29e-03, grad_scale: 16.0 2023-06-20 00:44:24,628 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.60 vs. limit=10.0 2023-06-20 00:45:28,105 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 00:45:50,583 INFO [train.py:996] (2/4) Epoch 4, batch 7000, loss[loss=0.29, simple_loss=0.3439, pruned_loss=0.1181, over 15778.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3228, pruned_loss=0.09721, over 4264437.22 frames. ], batch size: 61, lr: 8.29e-03, grad_scale: 16.0 2023-06-20 00:45:54,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=590904.0, ans=0.07 2023-06-20 00:46:09,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=590964.0, ans=10.0 2023-06-20 00:46:46,593 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.182e+02 3.062e+02 3.465e+02 4.392e+02 8.171e+02, threshold=6.929e+02, percent-clipped=2.0 2023-06-20 00:47:10,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=591144.0, ans=0.0 2023-06-20 00:47:10,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=591144.0, ans=22.5 2023-06-20 00:47:32,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=591204.0, ans=0.5 2023-06-20 00:47:33,157 INFO [train.py:996] (2/4) Epoch 4, batch 7050, loss[loss=0.2372, simple_loss=0.2896, pruned_loss=0.0924, over 21793.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3195, pruned_loss=0.09553, over 4273041.80 frames. ], batch size: 102, lr: 8.29e-03, grad_scale: 16.0 2023-06-20 00:47:42,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=591204.0, ans=0.125 2023-06-20 00:47:50,871 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-20 00:49:11,726 INFO [train.py:996] (2/4) Epoch 4, batch 7100, loss[loss=0.2549, simple_loss=0.3353, pruned_loss=0.08727, over 21804.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3247, pruned_loss=0.09735, over 4268636.66 frames. ], batch size: 371, lr: 8.29e-03, grad_scale: 16.0 2023-06-20 00:49:17,683 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.87 vs. limit=15.0 2023-06-20 00:50:07,542 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.896e+02 2.760e+02 3.326e+02 4.324e+02 6.991e+02, threshold=6.652e+02, percent-clipped=1.0 2023-06-20 00:50:10,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=591684.0, ans=0.125 2023-06-20 00:50:53,175 INFO [train.py:996] (2/4) Epoch 4, batch 7150, loss[loss=0.2979, simple_loss=0.3951, pruned_loss=0.1004, over 19781.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3203, pruned_loss=0.09395, over 4261890.89 frames. ], batch size: 703, lr: 8.29e-03, grad_scale: 16.0 2023-06-20 00:51:53,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=591984.0, ans=0.125 2023-06-20 00:52:30,757 INFO [train.py:996] (2/4) Epoch 4, batch 7200, loss[loss=0.2709, simple_loss=0.3199, pruned_loss=0.1109, over 21568.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3269, pruned_loss=0.09876, over 4263789.14 frames. ], batch size: 415, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 00:53:31,851 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 2.719e+02 3.108e+02 3.931e+02 6.174e+02, threshold=6.217e+02, percent-clipped=0.0 2023-06-20 00:53:35,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=592284.0, ans=0.0 2023-06-20 00:54:12,867 INFO [train.py:996] (2/4) Epoch 4, batch 7250, loss[loss=0.239, simple_loss=0.2859, pruned_loss=0.09606, over 21282.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3209, pruned_loss=0.09828, over 4270051.23 frames. ], batch size: 551, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 00:54:26,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=592404.0, ans=0.1 2023-06-20 00:54:43,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=592464.0, ans=0.0 2023-06-20 00:55:38,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=592644.0, ans=0.0 2023-06-20 00:55:49,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=592644.0, ans=0.0 2023-06-20 00:55:55,958 INFO [train.py:996] (2/4) Epoch 4, batch 7300, loss[loss=0.2619, simple_loss=0.3104, pruned_loss=0.1067, over 21888.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3145, pruned_loss=0.09683, over 4272495.90 frames. ], batch size: 373, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 00:56:05,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=592704.0, ans=0.0 2023-06-20 00:56:13,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=592704.0, ans=0.2 2023-06-20 00:56:42,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=592824.0, ans=0.0 2023-06-20 00:56:42,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=592824.0, ans=0.1 2023-06-20 00:56:53,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=592824.0, ans=0.125 2023-06-20 00:56:58,547 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.938e+02 3.597e+02 4.532e+02 8.618e+02, threshold=7.193e+02, percent-clipped=4.0 2023-06-20 00:57:23,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=592944.0, ans=0.125 2023-06-20 00:57:24,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=592944.0, ans=0.0 2023-06-20 00:57:24,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=592944.0, ans=0.125 2023-06-20 00:57:45,799 INFO [train.py:996] (2/4) Epoch 4, batch 7350, loss[loss=0.2285, simple_loss=0.3034, pruned_loss=0.07674, over 21101.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3126, pruned_loss=0.09754, over 4271819.77 frames. ], batch size: 607, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 00:57:55,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=593004.0, ans=0.0 2023-06-20 00:58:54,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=593184.0, ans=0.125 2023-06-20 00:59:27,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=593244.0, ans=0.2 2023-06-20 00:59:31,747 INFO [train.py:996] (2/4) Epoch 4, batch 7400, loss[loss=0.2422, simple_loss=0.322, pruned_loss=0.08117, over 21919.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3176, pruned_loss=0.09877, over 4264472.27 frames. ], batch size: 317, lr: 8.28e-03, grad_scale: 32.0 2023-06-20 00:59:32,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=593304.0, ans=0.125 2023-06-20 00:59:47,812 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=22.5 2023-06-20 00:59:59,602 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.44 vs. limit=10.0 2023-06-20 01:00:28,492 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.139e+02 3.006e+02 3.626e+02 4.126e+02 7.462e+02, threshold=7.252e+02, percent-clipped=1.0 2023-06-20 01:00:51,570 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=15.0 2023-06-20 01:00:58,078 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-20 01:01:15,474 INFO [train.py:996] (2/4) Epoch 4, batch 7450, loss[loss=0.2413, simple_loss=0.3013, pruned_loss=0.09065, over 21420.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3155, pruned_loss=0.09719, over 4258272.50 frames. ], batch size: 131, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 01:01:23,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=593604.0, ans=0.125 2023-06-20 01:01:29,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=593604.0, ans=0.1 2023-06-20 01:02:12,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=593724.0, ans=0.1 2023-06-20 01:02:32,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=593784.0, ans=0.0 2023-06-20 01:02:42,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=593844.0, ans=0.125 2023-06-20 01:02:58,235 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=15.0 2023-06-20 01:03:05,781 INFO [train.py:996] (2/4) Epoch 4, batch 7500, loss[loss=0.2541, simple_loss=0.3445, pruned_loss=0.08182, over 21242.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3213, pruned_loss=0.09904, over 4263198.96 frames. ], batch size: 176, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 01:03:39,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=593964.0, ans=0.125 2023-06-20 01:03:49,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=594024.0, ans=0.0 2023-06-20 01:04:09,791 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 3.163e+02 3.635e+02 4.766e+02 7.864e+02, threshold=7.270e+02, percent-clipped=2.0 2023-06-20 01:04:12,770 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.72 vs. limit=10.0 2023-06-20 01:04:40,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=594144.0, ans=0.1 2023-06-20 01:04:40,937 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.88 vs. limit=6.0 2023-06-20 01:04:42,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=594144.0, ans=0.125 2023-06-20 01:04:51,506 INFO [train.py:996] (2/4) Epoch 4, batch 7550, loss[loss=0.2583, simple_loss=0.3469, pruned_loss=0.0848, over 21648.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3284, pruned_loss=0.09732, over 4272111.84 frames. ], batch size: 414, lr: 8.27e-03, grad_scale: 16.0 2023-06-20 01:06:05,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=594384.0, ans=0.0 2023-06-20 01:06:33,682 INFO [train.py:996] (2/4) Epoch 4, batch 7600, loss[loss=0.2986, simple_loss=0.3518, pruned_loss=0.1227, over 21833.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3286, pruned_loss=0.09621, over 4279531.04 frames. ], batch size: 441, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 01:06:34,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=594504.0, ans=0.125 2023-06-20 01:06:48,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=594564.0, ans=0.0 2023-06-20 01:07:10,385 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=15.0 2023-06-20 01:07:27,134 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.700e+02 3.107e+02 3.746e+02 5.626e+02, threshold=6.215e+02, percent-clipped=0.0 2023-06-20 01:07:52,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=594684.0, ans=0.1 2023-06-20 01:08:17,615 INFO [train.py:996] (2/4) Epoch 4, batch 7650, loss[loss=0.3072, simple_loss=0.3475, pruned_loss=0.1334, over 21621.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3284, pruned_loss=0.09895, over 4283362.75 frames. ], batch size: 471, lr: 8.27e-03, grad_scale: 32.0 2023-06-20 01:09:01,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=594924.0, ans=0.125 2023-06-20 01:09:08,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=594924.0, ans=0.125 2023-06-20 01:10:03,419 INFO [train.py:996] (2/4) Epoch 4, batch 7700, loss[loss=0.3447, simple_loss=0.394, pruned_loss=0.1477, over 21355.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3313, pruned_loss=0.1015, over 4288217.18 frames. ], batch size: 176, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 01:10:33,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=595164.0, ans=0.07 2023-06-20 01:10:52,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=595224.0, ans=0.1 2023-06-20 01:11:08,877 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.825e+02 3.572e+02 4.383e+02 7.085e+02, threshold=7.144e+02, percent-clipped=3.0 2023-06-20 01:11:11,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=595284.0, ans=0.0 2023-06-20 01:11:11,885 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=12.0 2023-06-20 01:11:54,628 INFO [train.py:996] (2/4) Epoch 4, batch 7750, loss[loss=0.2304, simple_loss=0.3164, pruned_loss=0.07222, over 21222.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3392, pruned_loss=0.102, over 4281323.93 frames. ], batch size: 176, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 01:11:55,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=595404.0, ans=0.0 2023-06-20 01:11:58,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=595404.0, ans=0.0 2023-06-20 01:12:07,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=595404.0, ans=0.05 2023-06-20 01:12:27,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=595464.0, ans=0.2 2023-06-20 01:13:03,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=595584.0, ans=0.2 2023-06-20 01:13:38,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=595644.0, ans=0.125 2023-06-20 01:13:41,044 INFO [train.py:996] (2/4) Epoch 4, batch 7800, loss[loss=0.2107, simple_loss=0.2626, pruned_loss=0.07937, over 20888.00 frames. ], tot_loss[loss=0.2724, simple_loss=0.3397, pruned_loss=0.1025, over 4267231.53 frames. ], batch size: 612, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 01:14:44,358 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.307e+02 3.079e+02 3.630e+02 4.586e+02 7.709e+02, threshold=7.261e+02, percent-clipped=1.0 2023-06-20 01:14:57,284 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=9.03 vs. limit=15.0 2023-06-20 01:14:59,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=595884.0, ans=0.07 2023-06-20 01:15:24,384 INFO [train.py:996] (2/4) Epoch 4, batch 7850, loss[loss=0.2446, simple_loss=0.3002, pruned_loss=0.09443, over 21682.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3349, pruned_loss=0.1016, over 4269567.27 frames. ], batch size: 333, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 01:15:38,709 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-20 01:16:49,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=596244.0, ans=0.2 2023-06-20 01:17:10,735 INFO [train.py:996] (2/4) Epoch 4, batch 7900, loss[loss=0.28, simple_loss=0.3456, pruned_loss=0.1072, over 21425.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3301, pruned_loss=0.1002, over 4269879.34 frames. ], batch size: 471, lr: 8.26e-03, grad_scale: 32.0 2023-06-20 01:18:14,890 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 3.214e+02 3.696e+02 4.914e+02 8.338e+02, threshold=7.393e+02, percent-clipped=4.0 2023-06-20 01:18:19,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=596484.0, ans=0.125 2023-06-20 01:18:38,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=596544.0, ans=0.2 2023-06-20 01:18:55,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=596604.0, ans=0.0 2023-06-20 01:18:56,173 INFO [train.py:996] (2/4) Epoch 4, batch 7950, loss[loss=0.2587, simple_loss=0.3366, pruned_loss=0.09046, over 21777.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.333, pruned_loss=0.09959, over 4269199.75 frames. ], batch size: 298, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 01:19:52,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=596724.0, ans=0.125 2023-06-20 01:20:27,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=596844.0, ans=0.025 2023-06-20 01:20:53,176 INFO [train.py:996] (2/4) Epoch 4, batch 8000, loss[loss=0.3404, simple_loss=0.3848, pruned_loss=0.148, over 21427.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3373, pruned_loss=0.1032, over 4273501.53 frames. ], batch size: 471, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 01:20:57,636 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.86 vs. limit=10.0 2023-06-20 01:21:31,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=597024.0, ans=0.2 2023-06-20 01:21:52,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=597024.0, ans=0.0 2023-06-20 01:21:55,200 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 2.970e+02 3.290e+02 4.047e+02 5.946e+02, threshold=6.580e+02, percent-clipped=0.0 2023-06-20 01:22:09,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=597084.0, ans=0.125 2023-06-20 01:22:40,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=597144.0, ans=0.125 2023-06-20 01:22:43,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=597144.0, ans=0.1 2023-06-20 01:22:46,659 INFO [train.py:996] (2/4) Epoch 4, batch 8050, loss[loss=0.2065, simple_loss=0.2557, pruned_loss=0.07869, over 21809.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.3395, pruned_loss=0.1024, over 4274676.69 frames. ], batch size: 118, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 01:23:39,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=597324.0, ans=0.0 2023-06-20 01:23:40,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=597384.0, ans=0.2 2023-06-20 01:24:01,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=597384.0, ans=0.125 2023-06-20 01:24:32,722 INFO [train.py:996] (2/4) Epoch 4, batch 8100, loss[loss=0.2558, simple_loss=0.3183, pruned_loss=0.09671, over 21293.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.338, pruned_loss=0.103, over 4275746.54 frames. ], batch size: 143, lr: 8.25e-03, grad_scale: 32.0 2023-06-20 01:24:34,075 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-20 01:25:03,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=597564.0, ans=0.2 2023-06-20 01:25:21,562 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.92 vs. limit=6.0 2023-06-20 01:25:39,526 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.366e+02 3.127e+02 3.761e+02 5.016e+02 1.103e+03, threshold=7.523e+02, percent-clipped=9.0 2023-06-20 01:25:55,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=597684.0, ans=0.1 2023-06-20 01:26:05,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=597744.0, ans=10.0 2023-06-20 01:26:20,073 INFO [train.py:996] (2/4) Epoch 4, batch 8150, loss[loss=0.2309, simple_loss=0.3179, pruned_loss=0.07199, over 21646.00 frames. ], tot_loss[loss=0.2778, simple_loss=0.3476, pruned_loss=0.104, over 4278625.11 frames. ], batch size: 247, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 01:27:15,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=597924.0, ans=0.1 2023-06-20 01:27:30,891 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=15.44 vs. limit=22.5 2023-06-20 01:27:47,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=598044.0, ans=0.1 2023-06-20 01:28:07,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=598044.0, ans=0.125 2023-06-20 01:28:10,474 INFO [train.py:996] (2/4) Epoch 4, batch 8200, loss[loss=0.2923, simple_loss=0.3446, pruned_loss=0.12, over 21780.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3384, pruned_loss=0.1008, over 4280224.69 frames. ], batch size: 98, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 01:28:11,486 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-06-20 01:28:19,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=598104.0, ans=0.125 2023-06-20 01:28:22,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=598104.0, ans=0.125 2023-06-20 01:28:25,883 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:28:33,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=598164.0, ans=0.125 2023-06-20 01:28:35,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=598164.0, ans=0.125 2023-06-20 01:28:45,953 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:29:05,096 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-20 01:29:13,768 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.320e+02 2.961e+02 3.420e+02 4.432e+02 7.003e+02, threshold=6.840e+02, percent-clipped=0.0 2023-06-20 01:29:53,782 INFO [train.py:996] (2/4) Epoch 4, batch 8250, loss[loss=0.3094, simple_loss=0.4098, pruned_loss=0.1045, over 20812.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.3365, pruned_loss=0.1005, over 4279236.87 frames. ], batch size: 607, lr: 8.24e-03, grad_scale: 32.0 2023-06-20 01:30:49,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=598524.0, ans=0.125 2023-06-20 01:31:37,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=598704.0, ans=0.0 2023-06-20 01:31:38,822 INFO [train.py:996] (2/4) Epoch 4, batch 8300, loss[loss=0.2191, simple_loss=0.2899, pruned_loss=0.07417, over 21545.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3322, pruned_loss=0.09692, over 4276697.76 frames. ], batch size: 195, lr: 8.24e-03, grad_scale: 16.0 2023-06-20 01:32:10,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=598764.0, ans=0.125 2023-06-20 01:32:45,081 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.814e+02 3.368e+02 3.938e+02 8.477e+02, threshold=6.736e+02, percent-clipped=1.0 2023-06-20 01:32:50,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=598884.0, ans=0.125 2023-06-20 01:33:00,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=598944.0, ans=0.2 2023-06-20 01:33:11,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=598944.0, ans=0.125 2023-06-20 01:33:23,492 INFO [train.py:996] (2/4) Epoch 4, batch 8350, loss[loss=0.2602, simple_loss=0.3299, pruned_loss=0.09521, over 21536.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3302, pruned_loss=0.09527, over 4278377.01 frames. ], batch size: 230, lr: 8.24e-03, grad_scale: 16.0 2023-06-20 01:33:47,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=599004.0, ans=0.2 2023-06-20 01:33:54,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=599064.0, ans=0.1 2023-06-20 01:33:54,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=599064.0, ans=0.125 2023-06-20 01:34:07,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=599064.0, ans=0.0 2023-06-20 01:34:14,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=599124.0, ans=0.125 2023-06-20 01:34:30,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=599184.0, ans=0.125 2023-06-20 01:34:43,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=599184.0, ans=0.125 2023-06-20 01:34:43,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=599184.0, ans=0.0 2023-06-20 01:34:46,040 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-20 01:35:08,080 INFO [train.py:996] (2/4) Epoch 4, batch 8400, loss[loss=0.3029, simple_loss=0.3766, pruned_loss=0.1146, over 20745.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3272, pruned_loss=0.09246, over 4266852.59 frames. ], batch size: 607, lr: 8.23e-03, grad_scale: 16.0 2023-06-20 01:35:12,493 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.31 vs. limit=22.5 2023-06-20 01:35:21,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=599304.0, ans=10.0 2023-06-20 01:35:39,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=599364.0, ans=0.125 2023-06-20 01:36:01,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=599424.0, ans=0.07 2023-06-20 01:36:03,365 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:36:14,378 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.979e+02 2.656e+02 3.140e+02 3.908e+02 6.671e+02, threshold=6.281e+02, percent-clipped=0.0 2023-06-20 01:36:20,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=599484.0, ans=0.2 2023-06-20 01:36:29,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=599544.0, ans=0.0 2023-06-20 01:36:50,937 INFO [train.py:996] (2/4) Epoch 4, batch 8450, loss[loss=0.2383, simple_loss=0.3086, pruned_loss=0.08397, over 21839.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3275, pruned_loss=0.09291, over 4274772.26 frames. ], batch size: 298, lr: 8.23e-03, grad_scale: 16.0 2023-06-20 01:37:25,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=599664.0, ans=0.2 2023-06-20 01:38:09,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=599784.0, ans=0.05 2023-06-20 01:38:34,748 INFO [train.py:996] (2/4) Epoch 4, batch 8500, loss[loss=0.3086, simple_loss=0.3522, pruned_loss=0.1325, over 21702.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3256, pruned_loss=0.09489, over 4274751.88 frames. ], batch size: 351, lr: 8.23e-03, grad_scale: 16.0 2023-06-20 01:39:22,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=599964.0, ans=0.2 2023-06-20 01:39:39,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=600024.0, ans=0.125 2023-06-20 01:39:41,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=600084.0, ans=0.125 2023-06-20 01:39:44,110 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.032e+02 3.480e+02 4.088e+02 6.738e+02, threshold=6.960e+02, percent-clipped=1.0 2023-06-20 01:39:54,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=600084.0, ans=10.0 2023-06-20 01:39:58,896 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.85 vs. limit=12.0 2023-06-20 01:39:59,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=600144.0, ans=0.125 2023-06-20 01:40:20,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=600204.0, ans=0.0 2023-06-20 01:40:21,782 INFO [train.py:996] (2/4) Epoch 4, batch 8550, loss[loss=0.2606, simple_loss=0.3314, pruned_loss=0.09496, over 21261.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3295, pruned_loss=0.09745, over 4267149.77 frames. ], batch size: 159, lr: 8.23e-03, grad_scale: 16.0 2023-06-20 01:40:40,421 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.53 vs. limit=15.0 2023-06-20 01:41:36,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=600384.0, ans=0.125 2023-06-20 01:41:48,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=600384.0, ans=0.125 2023-06-20 01:42:17,800 INFO [train.py:996] (2/4) Epoch 4, batch 8600, loss[loss=0.2831, simple_loss=0.4083, pruned_loss=0.07897, over 20779.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.3372, pruned_loss=0.09988, over 4276208.45 frames. ], batch size: 607, lr: 8.23e-03, grad_scale: 16.0 2023-06-20 01:42:32,613 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.14 vs. limit=22.5 2023-06-20 01:42:46,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=600564.0, ans=0.125 2023-06-20 01:42:50,919 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.92 vs. limit=6.0 2023-06-20 01:43:09,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=600624.0, ans=0.125 2023-06-20 01:43:15,164 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 3.082e+02 3.829e+02 4.661e+02 1.059e+03, threshold=7.657e+02, percent-clipped=7.0 2023-06-20 01:43:15,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=600684.0, ans=0.2 2023-06-20 01:43:16,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=600684.0, ans=0.0 2023-06-20 01:43:26,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=600684.0, ans=0.1 2023-06-20 01:43:29,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=600684.0, ans=0.1 2023-06-20 01:43:30,264 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.52 vs. limit=22.5 2023-06-20 01:43:33,366 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-06-20 01:43:34,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=600744.0, ans=0.1 2023-06-20 01:44:06,485 INFO [train.py:996] (2/4) Epoch 4, batch 8650, loss[loss=0.2028, simple_loss=0.2787, pruned_loss=0.06343, over 21267.00 frames. ], tot_loss[loss=0.2727, simple_loss=0.3442, pruned_loss=0.1006, over 4275570.38 frames. ], batch size: 131, lr: 8.22e-03, grad_scale: 16.0 2023-06-20 01:44:19,499 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.30 vs. limit=10.0 2023-06-20 01:44:36,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=600864.0, ans=0.125 2023-06-20 01:44:38,950 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.99 vs. limit=15.0 2023-06-20 01:44:44,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=600924.0, ans=0.05 2023-06-20 01:45:44,871 INFO [train.py:996] (2/4) Epoch 4, batch 8700, loss[loss=0.2293, simple_loss=0.2894, pruned_loss=0.08463, over 15801.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3372, pruned_loss=0.09735, over 4262120.17 frames. ], batch size: 64, lr: 8.22e-03, grad_scale: 16.0 2023-06-20 01:46:42,035 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.831e+02 3.437e+02 4.356e+02 1.035e+03, threshold=6.874e+02, percent-clipped=3.0 2023-06-20 01:47:35,540 INFO [train.py:996] (2/4) Epoch 4, batch 8750, loss[loss=0.2992, simple_loss=0.3558, pruned_loss=0.1213, over 21898.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3311, pruned_loss=0.09813, over 4270415.36 frames. ], batch size: 351, lr: 8.22e-03, grad_scale: 16.0 2023-06-20 01:47:36,209 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:49:10,321 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.31 vs. limit=22.5 2023-06-20 01:49:15,570 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.35 vs. limit=15.0 2023-06-20 01:49:22,815 INFO [train.py:996] (2/4) Epoch 4, batch 8800, loss[loss=0.3003, simple_loss=0.3663, pruned_loss=0.1172, over 21246.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3422, pruned_loss=0.1028, over 4272890.97 frames. ], batch size: 176, lr: 8.22e-03, grad_scale: 32.0 2023-06-20 01:50:11,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=601824.0, ans=0.125 2023-06-20 01:50:20,628 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.189e+02 2.907e+02 3.416e+02 4.301e+02 7.142e+02, threshold=6.833e+02, percent-clipped=3.0 2023-06-20 01:50:30,095 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:51:03,480 INFO [train.py:996] (2/4) Epoch 4, batch 8850, loss[loss=0.3324, simple_loss=0.4207, pruned_loss=0.122, over 19841.00 frames. ], tot_loss[loss=0.28, simple_loss=0.3493, pruned_loss=0.1053, over 4268363.20 frames. ], batch size: 702, lr: 8.22e-03, grad_scale: 32.0 2023-06-20 01:51:17,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=602004.0, ans=0.125 2023-06-20 01:52:40,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=602244.0, ans=0.95 2023-06-20 01:52:44,282 INFO [train.py:996] (2/4) Epoch 4, batch 8900, loss[loss=0.3747, simple_loss=0.481, pruned_loss=0.1342, over 19749.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.3464, pruned_loss=0.1042, over 4263781.81 frames. ], batch size: 702, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 01:52:53,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=602304.0, ans=0.1 2023-06-20 01:52:58,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=602304.0, ans=0.04949747468305833 2023-06-20 01:53:07,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=602364.0, ans=0.125 2023-06-20 01:53:09,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=602364.0, ans=0.125 2023-06-20 01:53:36,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=602424.0, ans=0.025 2023-06-20 01:54:00,144 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.914e+02 3.498e+02 4.067e+02 9.619e+02, threshold=6.997e+02, percent-clipped=2.0 2023-06-20 01:54:31,844 INFO [train.py:996] (2/4) Epoch 4, batch 8950, loss[loss=0.2342, simple_loss=0.2969, pruned_loss=0.08578, over 21418.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3465, pruned_loss=0.1026, over 4266449.17 frames. ], batch size: 194, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 01:54:33,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=602604.0, ans=0.125 2023-06-20 01:55:28,083 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-20 01:55:32,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=602724.0, ans=0.2 2023-06-20 01:56:15,347 INFO [train.py:996] (2/4) Epoch 4, batch 9000, loss[loss=0.2441, simple_loss=0.3127, pruned_loss=0.08775, over 21625.00 frames. ], tot_loss[loss=0.2715, simple_loss=0.3396, pruned_loss=0.1017, over 4272638.21 frames. ], batch size: 332, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 01:56:15,347 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 01:56:37,872 INFO [train.py:1028] (2/4) Epoch 4, validation: loss=0.2701, simple_loss=0.3695, pruned_loss=0.08531, over 1796401.00 frames. 2023-06-20 01:56:37,873 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-20 01:57:19,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=602964.0, ans=0.1 2023-06-20 01:57:26,875 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.42 vs. limit=10.0 2023-06-20 01:57:34,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=603024.0, ans=0.125 2023-06-20 01:57:38,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=603084.0, ans=0.0 2023-06-20 01:57:40,804 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.934e+02 3.477e+02 4.426e+02 7.521e+02, threshold=6.955e+02, percent-clipped=2.0 2023-06-20 01:58:24,281 INFO [train.py:996] (2/4) Epoch 4, batch 9050, loss[loss=0.2891, simple_loss=0.3457, pruned_loss=0.1163, over 21476.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3346, pruned_loss=0.09722, over 4275529.22 frames. ], batch size: 194, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 01:58:33,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=603204.0, ans=0.125 2023-06-20 01:59:12,482 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 01:59:31,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=603384.0, ans=0.125 2023-06-20 01:59:41,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=603444.0, ans=0.125 2023-06-20 01:59:43,303 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=22.5 2023-06-20 01:59:53,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=603444.0, ans=0.05 2023-06-20 01:59:55,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.00 vs. limit=12.0 2023-06-20 02:00:15,332 INFO [train.py:996] (2/4) Epoch 4, batch 9100, loss[loss=0.2575, simple_loss=0.3494, pruned_loss=0.0828, over 21566.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.3384, pruned_loss=0.09843, over 4277161.42 frames. ], batch size: 389, lr: 8.21e-03, grad_scale: 32.0 2023-06-20 02:00:31,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=603564.0, ans=0.0 2023-06-20 02:00:36,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=603564.0, ans=0.125 2023-06-20 02:01:08,279 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.683e+02 3.374e+02 4.242e+02 6.313e+02, threshold=6.748e+02, percent-clipped=0.0 2023-06-20 02:01:14,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=603684.0, ans=0.125 2023-06-20 02:01:43,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=603744.0, ans=0.125 2023-06-20 02:01:52,406 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.37 vs. limit=15.0 2023-06-20 02:01:56,037 INFO [train.py:996] (2/4) Epoch 4, batch 9150, loss[loss=0.2516, simple_loss=0.3332, pruned_loss=0.08507, over 21576.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3399, pruned_loss=0.09585, over 4281954.65 frames. ], batch size: 230, lr: 8.20e-03, grad_scale: 32.0 2023-06-20 02:02:43,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=603924.0, ans=0.04949747468305833 2023-06-20 02:02:50,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=603984.0, ans=0.2 2023-06-20 02:02:52,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=603984.0, ans=0.2 2023-06-20 02:03:08,227 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 02:03:31,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=604044.0, ans=0.125 2023-06-20 02:03:41,439 INFO [train.py:996] (2/4) Epoch 4, batch 9200, loss[loss=0.297, simple_loss=0.368, pruned_loss=0.1129, over 21747.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.343, pruned_loss=0.09528, over 4282009.54 frames. ], batch size: 332, lr: 8.20e-03, grad_scale: 32.0 2023-06-20 02:04:45,441 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.901e+02 3.630e+02 4.447e+02 7.984e+02, threshold=7.260e+02, percent-clipped=1.0 2023-06-20 02:05:17,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=604344.0, ans=0.0 2023-06-20 02:05:24,874 INFO [train.py:996] (2/4) Epoch 4, batch 9250, loss[loss=0.2769, simple_loss=0.3263, pruned_loss=0.1138, over 21662.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.3473, pruned_loss=0.09942, over 4269268.89 frames. ], batch size: 282, lr: 8.20e-03, grad_scale: 16.0 2023-06-20 02:05:29,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=604404.0, ans=0.125 2023-06-20 02:06:26,302 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=15.0 2023-06-20 02:06:53,540 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2023-06-20 02:07:06,131 INFO [train.py:996] (2/4) Epoch 4, batch 9300, loss[loss=0.2656, simple_loss=0.3358, pruned_loss=0.09764, over 21844.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3422, pruned_loss=0.09925, over 4270989.42 frames. ], batch size: 317, lr: 8.20e-03, grad_scale: 16.0 2023-06-20 02:07:13,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=604704.0, ans=0.125 2023-06-20 02:08:18,262 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 3.021e+02 3.641e+02 4.393e+02 8.139e+02, threshold=7.281e+02, percent-clipped=1.0 2023-06-20 02:08:20,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=604884.0, ans=0.0 2023-06-20 02:08:25,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=604884.0, ans=0.125 2023-06-20 02:08:27,502 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.48 vs. limit=10.0 2023-06-20 02:08:39,544 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.67 vs. limit=10.0 2023-06-20 02:08:40,871 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=12.0 2023-06-20 02:08:46,480 INFO [train.py:996] (2/4) Epoch 4, batch 9350, loss[loss=0.3141, simple_loss=0.3718, pruned_loss=0.1282, over 21606.00 frames. ], tot_loss[loss=0.2743, simple_loss=0.3465, pruned_loss=0.101, over 4269949.65 frames. ], batch size: 263, lr: 8.20e-03, grad_scale: 16.0 2023-06-20 02:10:13,633 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=22.5 2023-06-20 02:10:17,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=605244.0, ans=0.0 2023-06-20 02:10:30,869 INFO [train.py:996] (2/4) Epoch 4, batch 9400, loss[loss=0.2317, simple_loss=0.2882, pruned_loss=0.08764, over 21596.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3481, pruned_loss=0.1013, over 4270793.82 frames. ], batch size: 247, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 02:10:46,106 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 02:11:20,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=605364.0, ans=0.2 2023-06-20 02:11:27,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=605424.0, ans=0.125 2023-06-20 02:11:46,101 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.271e+02 3.026e+02 3.567e+02 4.359e+02 8.563e+02, threshold=7.134e+02, percent-clipped=2.0 2023-06-20 02:11:49,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=605484.0, ans=0.125 2023-06-20 02:12:13,949 INFO [train.py:996] (2/4) Epoch 4, batch 9450, loss[loss=0.2444, simple_loss=0.3, pruned_loss=0.09439, over 21803.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3392, pruned_loss=0.09979, over 4265966.69 frames. ], batch size: 317, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 02:12:17,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=605604.0, ans=0.125 2023-06-20 02:12:26,578 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=22.5 2023-06-20 02:12:48,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=605664.0, ans=0.02 2023-06-20 02:13:15,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=605724.0, ans=0.0 2023-06-20 02:13:15,995 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.94 vs. limit=22.5 2023-06-20 02:13:16,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=605724.0, ans=0.1 2023-06-20 02:13:21,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=605784.0, ans=0.125 2023-06-20 02:13:35,462 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=22.5 2023-06-20 02:13:52,798 INFO [train.py:996] (2/4) Epoch 4, batch 9500, loss[loss=0.2316, simple_loss=0.3125, pruned_loss=0.07532, over 21663.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.331, pruned_loss=0.09733, over 4271437.36 frames. ], batch size: 263, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 02:15:09,564 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 2.876e+02 3.483e+02 4.277e+02 8.627e+02, threshold=6.965e+02, percent-clipped=2.0 2023-06-20 02:15:25,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=606144.0, ans=0.2 2023-06-20 02:15:37,596 INFO [train.py:996] (2/4) Epoch 4, batch 9550, loss[loss=0.275, simple_loss=0.3546, pruned_loss=0.09769, over 21758.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3361, pruned_loss=0.1006, over 4271330.44 frames. ], batch size: 247, lr: 8.19e-03, grad_scale: 16.0 2023-06-20 02:15:53,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=606204.0, ans=0.125 2023-06-20 02:16:01,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=606264.0, ans=0.05 2023-06-20 02:16:21,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=606264.0, ans=0.125 2023-06-20 02:16:32,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=606324.0, ans=0.125 2023-06-20 02:16:32,874 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 02:16:53,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=606384.0, ans=0.0 2023-06-20 02:17:11,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=606444.0, ans=0.0 2023-06-20 02:17:21,051 INFO [train.py:996] (2/4) Epoch 4, batch 9600, loss[loss=0.2912, simple_loss=0.3432, pruned_loss=0.1196, over 21381.00 frames. ], tot_loss[loss=0.2726, simple_loss=0.3386, pruned_loss=0.1033, over 4276752.44 frames. ], batch size: 143, lr: 8.19e-03, grad_scale: 32.0 2023-06-20 02:17:41,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=606564.0, ans=0.125 2023-06-20 02:17:50,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=606564.0, ans=0.0 2023-06-20 02:18:36,067 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=22.5 2023-06-20 02:18:36,529 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 2.961e+02 3.442e+02 3.920e+02 7.478e+02, threshold=6.885e+02, percent-clipped=1.0 2023-06-20 02:18:55,946 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.47 vs. limit=15.0 2023-06-20 02:19:09,539 INFO [train.py:996] (2/4) Epoch 4, batch 9650, loss[loss=0.3088, simple_loss=0.3591, pruned_loss=0.1292, over 21601.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3384, pruned_loss=0.1027, over 4281800.53 frames. ], batch size: 508, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 02:19:33,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=606804.0, ans=0.0 2023-06-20 02:19:58,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=606924.0, ans=0.0 2023-06-20 02:20:02,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=606924.0, ans=0.0 2023-06-20 02:20:27,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=606984.0, ans=0.05 2023-06-20 02:20:54,122 INFO [train.py:996] (2/4) Epoch 4, batch 9700, loss[loss=0.2416, simple_loss=0.3123, pruned_loss=0.08547, over 21712.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3413, pruned_loss=0.1031, over 4279454.26 frames. ], batch size: 247, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 02:21:53,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=607224.0, ans=0.125 2023-06-20 02:22:07,446 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.382e+02 2.962e+02 3.413e+02 3.970e+02 9.096e+02, threshold=6.826e+02, percent-clipped=3.0 2023-06-20 02:22:38,394 INFO [train.py:996] (2/4) Epoch 4, batch 9750, loss[loss=0.2057, simple_loss=0.2638, pruned_loss=0.07379, over 21478.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.3344, pruned_loss=0.1015, over 4278483.32 frames. ], batch size: 212, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 02:23:18,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=607464.0, ans=0.09899494936611666 2023-06-20 02:23:34,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=607524.0, ans=0.125 2023-06-20 02:24:13,846 INFO [train.py:996] (2/4) Epoch 4, batch 9800, loss[loss=0.2787, simple_loss=0.3537, pruned_loss=0.1018, over 21492.00 frames. ], tot_loss[loss=0.268, simple_loss=0.333, pruned_loss=0.1015, over 4271876.76 frames. ], batch size: 131, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 02:24:59,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=607764.0, ans=0.125 2023-06-20 02:25:06,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=607824.0, ans=0.015 2023-06-20 02:25:15,215 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.97 vs. limit=15.0 2023-06-20 02:25:26,244 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.71 vs. limit=15.0 2023-06-20 02:25:29,995 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 2.802e+02 3.178e+02 3.680e+02 5.783e+02, threshold=6.355e+02, percent-clipped=0.0 2023-06-20 02:25:30,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=607884.0, ans=0.125 2023-06-20 02:25:45,754 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-20 02:25:51,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=607944.0, ans=0.0 2023-06-20 02:25:56,221 INFO [train.py:996] (2/4) Epoch 4, batch 9850, loss[loss=0.2498, simple_loss=0.3059, pruned_loss=0.09687, over 21883.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3299, pruned_loss=0.1016, over 4276256.79 frames. ], batch size: 371, lr: 8.18e-03, grad_scale: 16.0 2023-06-20 02:26:48,277 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-20 02:27:03,250 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.05 vs. limit=15.0 2023-06-20 02:27:37,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=608244.0, ans=0.125 2023-06-20 02:27:41,058 INFO [train.py:996] (2/4) Epoch 4, batch 9900, loss[loss=0.2313, simple_loss=0.2928, pruned_loss=0.08488, over 21900.00 frames. ], tot_loss[loss=0.2647, simple_loss=0.3268, pruned_loss=0.1013, over 4273115.56 frames. ], batch size: 107, lr: 8.17e-03, grad_scale: 16.0 2023-06-20 02:27:51,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=608304.0, ans=0.015 2023-06-20 02:28:50,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=608424.0, ans=0.125 2023-06-20 02:29:00,252 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.255e+02 2.957e+02 3.478e+02 4.825e+02 8.249e+02, threshold=6.956e+02, percent-clipped=2.0 2023-06-20 02:29:09,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=608544.0, ans=0.0 2023-06-20 02:29:31,831 INFO [train.py:996] (2/4) Epoch 4, batch 9950, loss[loss=0.2554, simple_loss=0.3131, pruned_loss=0.09887, over 21092.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3324, pruned_loss=0.1049, over 4277414.54 frames. ], batch size: 143, lr: 8.17e-03, grad_scale: 16.0 2023-06-20 02:29:33,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=608604.0, ans=0.125 2023-06-20 02:30:08,080 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=22.5 2023-06-20 02:30:43,858 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=22.5 2023-06-20 02:31:00,847 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.01 vs. limit=15.0 2023-06-20 02:31:23,740 INFO [train.py:996] (2/4) Epoch 4, batch 10000, loss[loss=0.221, simple_loss=0.287, pruned_loss=0.07748, over 21297.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.3277, pruned_loss=0.1026, over 4276052.13 frames. ], batch size: 176, lr: 8.17e-03, grad_scale: 32.0 2023-06-20 02:31:27,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=608904.0, ans=0.02 2023-06-20 02:32:32,035 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.674e+02 3.222e+02 3.735e+02 9.123e+02, threshold=6.443e+02, percent-clipped=2.0 2023-06-20 02:32:32,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=609084.0, ans=0.0 2023-06-20 02:32:51,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=609144.0, ans=10.0 2023-06-20 02:33:14,594 INFO [train.py:996] (2/4) Epoch 4, batch 10050, loss[loss=0.2045, simple_loss=0.2902, pruned_loss=0.05942, over 21873.00 frames. ], tot_loss[loss=0.2678, simple_loss=0.3302, pruned_loss=0.1027, over 4277699.52 frames. ], batch size: 317, lr: 8.17e-03, grad_scale: 32.0 2023-06-20 02:33:35,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=609204.0, ans=0.125 2023-06-20 02:34:03,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=609324.0, ans=0.5 2023-06-20 02:34:38,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=609444.0, ans=0.0 2023-06-20 02:34:53,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=609444.0, ans=0.0 2023-06-20 02:34:55,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=609444.0, ans=0.125 2023-06-20 02:35:07,530 INFO [train.py:996] (2/4) Epoch 4, batch 10100, loss[loss=0.2763, simple_loss=0.3428, pruned_loss=0.1048, over 21766.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3273, pruned_loss=0.09997, over 4276587.27 frames. ], batch size: 332, lr: 8.17e-03, grad_scale: 16.0 2023-06-20 02:35:08,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=609504.0, ans=0.0 2023-06-20 02:35:21,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=609504.0, ans=0.125 2023-06-20 02:35:22,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=609564.0, ans=0.125 2023-06-20 02:36:11,677 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.172e+02 3.027e+02 3.628e+02 4.360e+02 7.943e+02, threshold=7.256e+02, percent-clipped=2.0 2023-06-20 02:36:27,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=609744.0, ans=0.2 2023-06-20 02:36:53,350 INFO [train.py:996] (2/4) Epoch 4, batch 10150, loss[loss=0.231, simple_loss=0.3022, pruned_loss=0.07988, over 21023.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3331, pruned_loss=0.1027, over 4278335.31 frames. ], batch size: 608, lr: 8.16e-03, grad_scale: 16.0 2023-06-20 02:37:17,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=609864.0, ans=0.1 2023-06-20 02:37:51,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=609984.0, ans=0.1 2023-06-20 02:38:01,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=609984.0, ans=0.0 2023-06-20 02:38:04,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=609984.0, ans=0.125 2023-06-20 02:38:06,654 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 02:38:07,269 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=22.5 2023-06-20 02:38:37,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=610104.0, ans=0.125 2023-06-20 02:38:38,450 INFO [train.py:996] (2/4) Epoch 4, batch 10200, loss[loss=0.2182, simple_loss=0.3072, pruned_loss=0.06462, over 21741.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3322, pruned_loss=0.1002, over 4276541.73 frames. ], batch size: 298, lr: 8.16e-03, grad_scale: 16.0 2023-06-20 02:38:39,614 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=15.0 2023-06-20 02:38:45,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=610104.0, ans=0.1 2023-06-20 02:39:02,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=610164.0, ans=0.125 2023-06-20 02:39:09,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=610164.0, ans=0.125 2023-06-20 02:39:09,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=610164.0, ans=0.0 2023-06-20 02:39:47,092 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.39 vs. limit=15.0 2023-06-20 02:39:52,550 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.626e+02 2.503e+02 3.181e+02 4.095e+02 8.895e+02, threshold=6.363e+02, percent-clipped=3.0 2023-06-20 02:40:22,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=610404.0, ans=0.125 2023-06-20 02:40:23,072 INFO [train.py:996] (2/4) Epoch 4, batch 10250, loss[loss=0.2676, simple_loss=0.3652, pruned_loss=0.08494, over 19963.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.325, pruned_loss=0.09301, over 4263133.43 frames. ], batch size: 702, lr: 8.16e-03, grad_scale: 16.0 2023-06-20 02:40:52,872 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=22.5 2023-06-20 02:40:55,769 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 02:41:08,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=610524.0, ans=0.1 2023-06-20 02:41:27,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=610524.0, ans=0.1 2023-06-20 02:41:30,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=610584.0, ans=0.125 2023-06-20 02:41:49,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=610584.0, ans=0.125 2023-06-20 02:41:57,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=610644.0, ans=0.0 2023-06-20 02:42:02,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=610644.0, ans=0.2 2023-06-20 02:42:09,920 INFO [train.py:996] (2/4) Epoch 4, batch 10300, loss[loss=0.2659, simple_loss=0.3491, pruned_loss=0.09132, over 21735.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3292, pruned_loss=0.09368, over 4272871.84 frames. ], batch size: 332, lr: 8.16e-03, grad_scale: 16.0 2023-06-20 02:42:28,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=610764.0, ans=0.125 2023-06-20 02:42:47,132 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 02:43:06,231 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=12.0 2023-06-20 02:43:07,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=610824.0, ans=0.1 2023-06-20 02:43:11,385 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2023-06-20 02:43:25,985 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.712e+02 3.082e+02 3.756e+02 4.509e+02 8.129e+02, threshold=7.512e+02, percent-clipped=5.0 2023-06-20 02:43:43,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=610944.0, ans=0.125 2023-06-20 02:43:51,336 INFO [train.py:996] (2/4) Epoch 4, batch 10350, loss[loss=0.2916, simple_loss=0.3579, pruned_loss=0.1127, over 21192.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3322, pruned_loss=0.09439, over 4272242.14 frames. ], batch size: 159, lr: 8.16e-03, grad_scale: 16.0 2023-06-20 02:43:53,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=611004.0, ans=0.125 2023-06-20 02:44:06,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=611064.0, ans=0.1 2023-06-20 02:44:32,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=611064.0, ans=0.125 2023-06-20 02:44:39,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=611124.0, ans=0.2 2023-06-20 02:45:35,490 INFO [train.py:996] (2/4) Epoch 4, batch 10400, loss[loss=0.2315, simple_loss=0.2766, pruned_loss=0.09324, over 20842.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3235, pruned_loss=0.09262, over 4271277.57 frames. ], batch size: 608, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 02:45:46,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=611304.0, ans=0.07 2023-06-20 02:46:16,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=611364.0, ans=0.0 2023-06-20 02:46:35,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=611424.0, ans=0.125 2023-06-20 02:46:56,620 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 3.060e+02 3.521e+02 4.304e+02 7.584e+02, threshold=7.042e+02, percent-clipped=1.0 2023-06-20 02:47:21,578 INFO [train.py:996] (2/4) Epoch 4, batch 10450, loss[loss=0.2621, simple_loss=0.3456, pruned_loss=0.08931, over 19978.00 frames. ], tot_loss[loss=0.261, simple_loss=0.329, pruned_loss=0.09649, over 4276368.91 frames. ], batch size: 704, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 02:47:45,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=611604.0, ans=0.125 2023-06-20 02:47:50,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=611664.0, ans=0.2 2023-06-20 02:48:12,327 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.98 vs. limit=22.5 2023-06-20 02:48:24,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=611724.0, ans=0.125 2023-06-20 02:49:17,124 INFO [train.py:996] (2/4) Epoch 4, batch 10500, loss[loss=0.2664, simple_loss=0.324, pruned_loss=0.1044, over 21765.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3284, pruned_loss=0.09585, over 4280221.81 frames. ], batch size: 351, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 02:49:49,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=611964.0, ans=0.2 2023-06-20 02:50:06,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=612024.0, ans=0.025 2023-06-20 02:50:21,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=612084.0, ans=0.0 2023-06-20 02:50:24,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=612084.0, ans=0.0 2023-06-20 02:50:25,510 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.941e+02 2.864e+02 3.538e+02 4.749e+02 1.100e+03, threshold=7.075e+02, percent-clipped=4.0 2023-06-20 02:50:33,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=612144.0, ans=0.125 2023-06-20 02:51:01,183 INFO [train.py:996] (2/4) Epoch 4, batch 10550, loss[loss=0.2608, simple_loss=0.3163, pruned_loss=0.1027, over 21865.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3238, pruned_loss=0.09603, over 4275689.82 frames. ], batch size: 98, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 02:51:02,106 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.25 vs. limit=22.5 2023-06-20 02:51:41,888 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 02:51:51,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=612324.0, ans=0.0 2023-06-20 02:52:25,129 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-20 02:52:41,244 INFO [train.py:996] (2/4) Epoch 4, batch 10600, loss[loss=0.2348, simple_loss=0.2897, pruned_loss=0.08995, over 21859.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3168, pruned_loss=0.09344, over 4276078.51 frames. ], batch size: 107, lr: 8.15e-03, grad_scale: 32.0 2023-06-20 02:52:47,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=612504.0, ans=0.2 2023-06-20 02:53:14,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=612564.0, ans=0.1 2023-06-20 02:53:38,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=612624.0, ans=0.0 2023-06-20 02:53:45,797 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=15.0 2023-06-20 02:53:53,408 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.105e+02 2.887e+02 3.753e+02 5.124e+02 8.898e+02, threshold=7.506e+02, percent-clipped=7.0 2023-06-20 02:54:28,421 INFO [train.py:996] (2/4) Epoch 4, batch 10650, loss[loss=0.2079, simple_loss=0.2774, pruned_loss=0.06921, over 21759.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3184, pruned_loss=0.09285, over 4266307.78 frames. ], batch size: 282, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 02:54:44,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=612804.0, ans=0.125 2023-06-20 02:54:56,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=612864.0, ans=0.2 2023-06-20 02:55:13,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=612924.0, ans=0.125 2023-06-20 02:55:15,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=612924.0, ans=0.05 2023-06-20 02:55:42,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=612984.0, ans=0.0 2023-06-20 02:56:08,749 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=15.0 2023-06-20 02:56:26,527 INFO [train.py:996] (2/4) Epoch 4, batch 10700, loss[loss=0.2396, simple_loss=0.3099, pruned_loss=0.0846, over 21627.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.32, pruned_loss=0.09313, over 4265997.77 frames. ], batch size: 263, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 02:56:28,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=613104.0, ans=0.2 2023-06-20 02:57:04,232 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-20 02:57:12,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=613224.0, ans=0.125 2023-06-20 02:57:36,051 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 3.228e+02 3.667e+02 4.550e+02 7.977e+02, threshold=7.334e+02, percent-clipped=1.0 2023-06-20 02:58:11,822 INFO [train.py:996] (2/4) Epoch 4, batch 10750, loss[loss=0.3478, simple_loss=0.426, pruned_loss=0.1348, over 21553.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3304, pruned_loss=0.09725, over 4264835.00 frames. ], batch size: 508, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 02:58:27,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=613464.0, ans=0.125 2023-06-20 02:58:52,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=613524.0, ans=0.0 2023-06-20 02:59:39,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=613584.0, ans=0.0 2023-06-20 02:59:56,232 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.03 vs. limit=8.0 2023-06-20 02:59:58,081 INFO [train.py:996] (2/4) Epoch 4, batch 10800, loss[loss=0.3784, simple_loss=0.4236, pruned_loss=0.1666, over 21428.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3358, pruned_loss=0.09796, over 4267268.33 frames. ], batch size: 471, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 03:00:27,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=613764.0, ans=0.0 2023-06-20 03:00:48,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=613824.0, ans=0.125 2023-06-20 03:01:16,786 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.370e+02 2.918e+02 3.193e+02 3.822e+02 6.360e+02, threshold=6.386e+02, percent-clipped=0.0 2023-06-20 03:01:23,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=613944.0, ans=0.0 2023-06-20 03:01:31,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=613944.0, ans=0.125 2023-06-20 03:01:41,184 INFO [train.py:996] (2/4) Epoch 4, batch 10850, loss[loss=0.2183, simple_loss=0.2859, pruned_loss=0.07541, over 21664.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.3361, pruned_loss=0.09813, over 4262474.79 frames. ], batch size: 333, lr: 8.14e-03, grad_scale: 32.0 2023-06-20 03:01:43,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=614004.0, ans=0.125 2023-06-20 03:01:59,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=614064.0, ans=0.125 2023-06-20 03:02:03,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=614064.0, ans=0.125 2023-06-20 03:02:36,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=614124.0, ans=0.125 2023-06-20 03:02:46,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=614184.0, ans=0.125 2023-06-20 03:03:01,099 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.11 vs. limit=10.0 2023-06-20 03:03:05,539 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.73 vs. limit=10.0 2023-06-20 03:03:13,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=614244.0, ans=0.125 2023-06-20 03:03:24,904 INFO [train.py:996] (2/4) Epoch 4, batch 10900, loss[loss=0.2001, simple_loss=0.2582, pruned_loss=0.07101, over 21234.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3294, pruned_loss=0.09603, over 4260658.79 frames. ], batch size: 549, lr: 8.13e-03, grad_scale: 32.0 2023-06-20 03:04:43,223 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.644e+02 3.102e+02 4.157e+02 6.653e+02, threshold=6.203e+02, percent-clipped=4.0 2023-06-20 03:05:07,365 INFO [train.py:996] (2/4) Epoch 4, batch 10950, loss[loss=0.2384, simple_loss=0.2945, pruned_loss=0.09112, over 21671.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3254, pruned_loss=0.0946, over 4255401.94 frames. ], batch size: 248, lr: 8.13e-03, grad_scale: 32.0 2023-06-20 03:05:13,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=614604.0, ans=0.0 2023-06-20 03:05:44,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=614664.0, ans=0.0 2023-06-20 03:06:37,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=614844.0, ans=0.2 2023-06-20 03:06:49,636 INFO [train.py:996] (2/4) Epoch 4, batch 11000, loss[loss=0.275, simple_loss=0.3336, pruned_loss=0.1082, over 21853.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3228, pruned_loss=0.09514, over 4256226.75 frames. ], batch size: 371, lr: 8.13e-03, grad_scale: 32.0 2023-06-20 03:07:12,145 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-20 03:08:03,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=615084.0, ans=0.2 2023-06-20 03:08:07,866 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.675e+02 3.019e+02 3.528e+02 7.831e+02, threshold=6.038e+02, percent-clipped=1.0 2023-06-20 03:08:13,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=615084.0, ans=0.125 2023-06-20 03:08:24,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=615144.0, ans=10.0 2023-06-20 03:08:31,501 INFO [train.py:996] (2/4) Epoch 4, batch 11050, loss[loss=0.232, simple_loss=0.2872, pruned_loss=0.08835, over 21674.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3192, pruned_loss=0.09558, over 4265256.25 frames. ], batch size: 282, lr: 8.13e-03, grad_scale: 32.0 2023-06-20 03:08:32,549 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-20 03:10:13,743 INFO [train.py:996] (2/4) Epoch 4, batch 11100, loss[loss=0.2395, simple_loss=0.3005, pruned_loss=0.08921, over 21682.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3193, pruned_loss=0.09663, over 4272358.30 frames. ], batch size: 316, lr: 8.13e-03, grad_scale: 16.0 2023-06-20 03:10:22,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=615504.0, ans=0.025 2023-06-20 03:10:27,677 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=15.0 2023-06-20 03:11:33,206 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.189e+02 3.025e+02 3.574e+02 4.736e+02 7.982e+02, threshold=7.148e+02, percent-clipped=11.0 2023-06-20 03:11:47,393 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=12.0 2023-06-20 03:11:56,231 INFO [train.py:996] (2/4) Epoch 4, batch 11150, loss[loss=0.2691, simple_loss=0.3567, pruned_loss=0.09077, over 21543.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3179, pruned_loss=0.09561, over 4275549.88 frames. ], batch size: 441, lr: 8.12e-03, grad_scale: 16.0 2023-06-20 03:12:47,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=615924.0, ans=10.0 2023-06-20 03:13:10,285 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-20 03:13:25,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=616044.0, ans=0.125 2023-06-20 03:13:33,720 INFO [train.py:996] (2/4) Epoch 4, batch 11200, loss[loss=0.2436, simple_loss=0.2922, pruned_loss=0.09748, over 21544.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.317, pruned_loss=0.095, over 4277374.20 frames. ], batch size: 231, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 03:14:09,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=616164.0, ans=0.0 2023-06-20 03:14:12,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=616164.0, ans=0.125 2023-06-20 03:14:16,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=616224.0, ans=0.1 2023-06-20 03:14:53,615 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=15.0 2023-06-20 03:14:54,153 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 2.789e+02 3.395e+02 4.282e+02 6.875e+02, threshold=6.790e+02, percent-clipped=0.0 2023-06-20 03:15:17,025 INFO [train.py:996] (2/4) Epoch 4, batch 11250, loss[loss=0.2527, simple_loss=0.3105, pruned_loss=0.09746, over 21640.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3159, pruned_loss=0.09521, over 4272211.77 frames. ], batch size: 415, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 03:15:30,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=616404.0, ans=0.125 2023-06-20 03:16:12,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=616524.0, ans=0.125 2023-06-20 03:16:50,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=616644.0, ans=0.1 2023-06-20 03:17:00,627 INFO [train.py:996] (2/4) Epoch 4, batch 11300, loss[loss=0.2193, simple_loss=0.2929, pruned_loss=0.07289, over 21517.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3171, pruned_loss=0.09517, over 4285837.60 frames. ], batch size: 195, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 03:17:04,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=616704.0, ans=0.125 2023-06-20 03:17:07,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=616704.0, ans=0.125 2023-06-20 03:17:16,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=616764.0, ans=0.125 2023-06-20 03:17:31,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=616764.0, ans=0.0 2023-06-20 03:17:42,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=616764.0, ans=0.125 2023-06-20 03:17:45,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=616824.0, ans=0.04949747468305833 2023-06-20 03:18:01,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=616824.0, ans=0.0 2023-06-20 03:18:21,775 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.889e+02 3.588e+02 4.478e+02 6.707e+02, threshold=7.176e+02, percent-clipped=0.0 2023-06-20 03:18:45,403 INFO [train.py:996] (2/4) Epoch 4, batch 11350, loss[loss=0.2622, simple_loss=0.3369, pruned_loss=0.09375, over 21288.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3195, pruned_loss=0.09511, over 4285293.73 frames. ], batch size: 159, lr: 8.12e-03, grad_scale: 32.0 2023-06-20 03:18:50,477 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.17 vs. limit=15.0 2023-06-20 03:20:12,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=617244.0, ans=0.05 2023-06-20 03:20:26,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=617244.0, ans=0.0 2023-06-20 03:20:41,069 INFO [train.py:996] (2/4) Epoch 4, batch 11400, loss[loss=0.2473, simple_loss=0.3343, pruned_loss=0.08017, over 21863.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.328, pruned_loss=0.09895, over 4278644.87 frames. ], batch size: 317, lr: 8.11e-03, grad_scale: 32.0 2023-06-20 03:21:01,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=617364.0, ans=0.0 2023-06-20 03:21:16,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=617364.0, ans=0.125 2023-06-20 03:21:20,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=617424.0, ans=0.1 2023-06-20 03:21:53,022 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 3.300e+02 4.007e+02 5.041e+02 8.202e+02, threshold=8.013e+02, percent-clipped=6.0 2023-06-20 03:22:00,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=617544.0, ans=0.2 2023-06-20 03:22:23,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=617544.0, ans=0.035 2023-06-20 03:22:26,664 INFO [train.py:996] (2/4) Epoch 4, batch 11450, loss[loss=0.2467, simple_loss=0.3214, pruned_loss=0.08594, over 20057.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.3291, pruned_loss=0.09739, over 4272327.54 frames. ], batch size: 704, lr: 8.11e-03, grad_scale: 16.0 2023-06-20 03:22:36,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=617604.0, ans=0.05 2023-06-20 03:23:09,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=617724.0, ans=0.1 2023-06-20 03:23:18,301 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=15.0 2023-06-20 03:23:20,225 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=22.5 2023-06-20 03:23:21,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=617724.0, ans=15.0 2023-06-20 03:23:57,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=617844.0, ans=0.125 2023-06-20 03:24:05,918 INFO [train.py:996] (2/4) Epoch 4, batch 11500, loss[loss=0.2132, simple_loss=0.2977, pruned_loss=0.06435, over 21478.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3322, pruned_loss=0.09866, over 4275734.96 frames. ], batch size: 194, lr: 8.11e-03, grad_scale: 16.0 2023-06-20 03:24:56,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=618024.0, ans=0.125 2023-06-20 03:24:58,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=618024.0, ans=0.125 2023-06-20 03:25:11,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=618084.0, ans=0.2 2023-06-20 03:25:18,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=618084.0, ans=0.125 2023-06-20 03:25:19,772 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 3.106e+02 3.529e+02 4.777e+02 9.700e+02, threshold=7.057e+02, percent-clipped=3.0 2023-06-20 03:25:47,102 INFO [train.py:996] (2/4) Epoch 4, batch 11550, loss[loss=0.3637, simple_loss=0.4727, pruned_loss=0.1273, over 21205.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3389, pruned_loss=0.09917, over 4276511.84 frames. ], batch size: 548, lr: 8.11e-03, grad_scale: 16.0 2023-06-20 03:26:06,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=618204.0, ans=0.1 2023-06-20 03:26:08,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=618264.0, ans=0.04949747468305833 2023-06-20 03:26:38,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=618324.0, ans=0.125 2023-06-20 03:26:40,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=618324.0, ans=0.0 2023-06-20 03:26:43,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=618324.0, ans=0.125 2023-06-20 03:27:12,568 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-20 03:27:37,867 INFO [train.py:996] (2/4) Epoch 4, batch 11600, loss[loss=0.3497, simple_loss=0.4625, pruned_loss=0.1185, over 21613.00 frames. ], tot_loss[loss=0.277, simple_loss=0.3516, pruned_loss=0.1012, over 4277417.54 frames. ], batch size: 441, lr: 8.11e-03, grad_scale: 32.0 2023-06-20 03:28:07,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=618564.0, ans=0.0 2023-06-20 03:28:14,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=618564.0, ans=0.0 2023-06-20 03:28:31,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=618624.0, ans=0.125 2023-06-20 03:28:55,730 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.219e+02 2.978e+02 3.550e+02 4.282e+02 6.763e+02, threshold=7.099e+02, percent-clipped=1.0 2023-06-20 03:28:57,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=618684.0, ans=0.125 2023-06-20 03:29:17,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=618744.0, ans=0.0 2023-06-20 03:29:22,222 INFO [train.py:996] (2/4) Epoch 4, batch 11650, loss[loss=0.2635, simple_loss=0.3192, pruned_loss=0.1039, over 20120.00 frames. ], tot_loss[loss=0.2787, simple_loss=0.3556, pruned_loss=0.101, over 4267908.87 frames. ], batch size: 704, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 03:30:59,663 INFO [train.py:996] (2/4) Epoch 4, batch 11700, loss[loss=0.2425, simple_loss=0.2964, pruned_loss=0.09433, over 21426.00 frames. ], tot_loss[loss=0.2728, simple_loss=0.3456, pruned_loss=0.1, over 4269420.03 frames. ], batch size: 195, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 03:31:35,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=619164.0, ans=0.125 2023-06-20 03:31:37,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=619224.0, ans=0.0 2023-06-20 03:31:37,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=619224.0, ans=0.05 2023-06-20 03:32:14,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=619284.0, ans=0.0 2023-06-20 03:32:17,022 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.228e+02 2.962e+02 3.618e+02 4.622e+02 7.851e+02, threshold=7.236e+02, percent-clipped=2.0 2023-06-20 03:32:49,963 INFO [train.py:996] (2/4) Epoch 4, batch 11750, loss[loss=0.249, simple_loss=0.3131, pruned_loss=0.09244, over 21719.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3361, pruned_loss=0.09855, over 4276052.46 frames. ], batch size: 282, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 03:33:43,122 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.62 vs. limit=6.0 2023-06-20 03:34:16,759 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.14 vs. limit=22.5 2023-06-20 03:34:18,514 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=22.5 2023-06-20 03:34:19,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=619644.0, ans=0.125 2023-06-20 03:34:35,089 INFO [train.py:996] (2/4) Epoch 4, batch 11800, loss[loss=0.2862, simple_loss=0.378, pruned_loss=0.09716, over 21271.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3397, pruned_loss=0.101, over 4270597.34 frames. ], batch size: 549, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 03:34:39,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=619704.0, ans=0.125 2023-06-20 03:34:51,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=619764.0, ans=0.125 2023-06-20 03:35:14,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=619824.0, ans=0.125 2023-06-20 03:35:29,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=619824.0, ans=22.5 2023-06-20 03:35:30,849 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=15.0 2023-06-20 03:35:38,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=619884.0, ans=0.125 2023-06-20 03:35:40,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=619884.0, ans=0.125 2023-06-20 03:35:51,523 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.341e+02 3.018e+02 3.704e+02 4.604e+02 8.251e+02, threshold=7.407e+02, percent-clipped=4.0 2023-06-20 03:35:57,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=619944.0, ans=0.0 2023-06-20 03:36:11,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=619944.0, ans=0.1 2023-06-20 03:36:19,828 INFO [train.py:996] (2/4) Epoch 4, batch 11850, loss[loss=0.2617, simple_loss=0.3439, pruned_loss=0.08974, over 21648.00 frames. ], tot_loss[loss=0.2709, simple_loss=0.3408, pruned_loss=0.1005, over 4276673.60 frames. ], batch size: 263, lr: 8.10e-03, grad_scale: 32.0 2023-06-20 03:36:30,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=620004.0, ans=0.125 2023-06-20 03:37:03,504 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=15.0 2023-06-20 03:37:59,364 INFO [train.py:996] (2/4) Epoch 4, batch 11900, loss[loss=0.2403, simple_loss=0.3377, pruned_loss=0.07146, over 21250.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3397, pruned_loss=0.09707, over 4272572.03 frames. ], batch size: 548, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 03:38:32,367 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:38:39,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=620364.0, ans=0.125 2023-06-20 03:39:22,739 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 2.597e+02 3.084e+02 3.622e+02 5.304e+02, threshold=6.167e+02, percent-clipped=0.0 2023-06-20 03:39:30,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=620544.0, ans=0.1 2023-06-20 03:39:44,846 INFO [train.py:996] (2/4) Epoch 4, batch 11950, loss[loss=0.2242, simple_loss=0.3152, pruned_loss=0.06664, over 21741.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3391, pruned_loss=0.09357, over 4270704.26 frames. ], batch size: 316, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 03:40:20,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=620664.0, ans=0.0 2023-06-20 03:40:21,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=620664.0, ans=0.125 2023-06-20 03:40:40,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=620724.0, ans=0.1 2023-06-20 03:40:50,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=620724.0, ans=0.5 2023-06-20 03:40:52,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=620724.0, ans=0.0 2023-06-20 03:40:58,178 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.93 vs. limit=6.0 2023-06-20 03:41:09,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=620784.0, ans=0.125 2023-06-20 03:41:30,017 INFO [train.py:996] (2/4) Epoch 4, batch 12000, loss[loss=0.2247, simple_loss=0.2765, pruned_loss=0.08646, over 21365.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3357, pruned_loss=0.09193, over 4265533.75 frames. ], batch size: 160, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 03:41:30,017 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 03:41:51,446 INFO [train.py:1028] (2/4) Epoch 4, validation: loss=0.2681, simple_loss=0.3653, pruned_loss=0.08549, over 1796401.00 frames. 2023-06-20 03:41:51,446 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-20 03:42:37,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=621024.0, ans=0.125 2023-06-20 03:42:39,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=621024.0, ans=0.0 2023-06-20 03:42:49,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=621084.0, ans=0.0 2023-06-20 03:43:04,289 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 2.836e+02 3.434e+02 3.961e+02 6.580e+02, threshold=6.867e+02, percent-clipped=2.0 2023-06-20 03:43:34,848 INFO [train.py:996] (2/4) Epoch 4, batch 12050, loss[loss=0.3055, simple_loss=0.3432, pruned_loss=0.1339, over 21717.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3325, pruned_loss=0.09533, over 4271156.13 frames. ], batch size: 508, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 03:44:19,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=621324.0, ans=0.125 2023-06-20 03:45:21,138 INFO [train.py:996] (2/4) Epoch 4, batch 12100, loss[loss=0.31, simple_loss=0.3622, pruned_loss=0.1289, over 21505.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3351, pruned_loss=0.09954, over 4275419.64 frames. ], batch size: 194, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 03:46:08,201 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=12.0 2023-06-20 03:46:17,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=621624.0, ans=0.0 2023-06-20 03:46:23,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=621684.0, ans=0.2 2023-06-20 03:46:47,632 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 3.023e+02 3.734e+02 4.594e+02 9.342e+02, threshold=7.469e+02, percent-clipped=3.0 2023-06-20 03:47:14,447 INFO [train.py:996] (2/4) Epoch 4, batch 12150, loss[loss=0.2273, simple_loss=0.3212, pruned_loss=0.06671, over 21719.00 frames. ], tot_loss[loss=0.2687, simple_loss=0.3395, pruned_loss=0.09891, over 4273660.41 frames. ], batch size: 298, lr: 8.09e-03, grad_scale: 32.0 2023-06-20 03:47:18,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=621804.0, ans=0.0 2023-06-20 03:48:03,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=621924.0, ans=0.07 2023-06-20 03:48:24,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=621984.0, ans=0.125 2023-06-20 03:48:31,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=621984.0, ans=0.125 2023-06-20 03:49:02,740 INFO [train.py:996] (2/4) Epoch 4, batch 12200, loss[loss=0.292, simple_loss=0.3341, pruned_loss=0.125, over 15076.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3357, pruned_loss=0.09788, over 4273112.43 frames. ], batch size: 61, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 03:49:03,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=622104.0, ans=0.125 2023-06-20 03:49:16,499 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:50:04,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=622284.0, ans=0.125 2023-06-20 03:50:18,574 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.874e+02 3.548e+02 4.515e+02 8.617e+02, threshold=7.096e+02, percent-clipped=2.0 2023-06-20 03:50:19,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=622284.0, ans=0.125 2023-06-20 03:50:25,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=622344.0, ans=0.125 2023-06-20 03:50:43,053 INFO [train.py:996] (2/4) Epoch 4, batch 12250, loss[loss=0.2071, simple_loss=0.295, pruned_loss=0.05956, over 21605.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3273, pruned_loss=0.09476, over 4269848.98 frames. ], batch size: 414, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 03:50:57,121 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 03:51:20,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=622524.0, ans=0.0 2023-06-20 03:51:23,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=622524.0, ans=0.1 2023-06-20 03:51:53,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=622584.0, ans=0.0 2023-06-20 03:52:21,933 INFO [train.py:996] (2/4) Epoch 4, batch 12300, loss[loss=0.2209, simple_loss=0.3071, pruned_loss=0.06735, over 21462.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3216, pruned_loss=0.08836, over 4261188.38 frames. ], batch size: 471, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 03:52:42,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=622704.0, ans=0.0 2023-06-20 03:53:03,424 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-20 03:53:12,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=622824.0, ans=0.0 2023-06-20 03:53:33,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=622884.0, ans=0.125 2023-06-20 03:53:36,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=622884.0, ans=0.125 2023-06-20 03:53:36,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=622884.0, ans=0.1 2023-06-20 03:53:39,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=622884.0, ans=0.04949747468305833 2023-06-20 03:53:45,505 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.651e+02 2.451e+02 2.752e+02 3.675e+02 6.755e+02, threshold=5.504e+02, percent-clipped=0.0 2023-06-20 03:54:09,746 INFO [train.py:996] (2/4) Epoch 4, batch 12350, loss[loss=0.2472, simple_loss=0.3211, pruned_loss=0.08666, over 21656.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3232, pruned_loss=0.08635, over 4258120.58 frames. ], batch size: 263, lr: 8.08e-03, grad_scale: 16.0 2023-06-20 03:54:14,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=623004.0, ans=0.0 2023-06-20 03:54:23,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=623004.0, ans=0.0 2023-06-20 03:54:39,731 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.18 vs. limit=10.0 2023-06-20 03:54:51,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=623124.0, ans=0.0 2023-06-20 03:54:53,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=623124.0, ans=0.0 2023-06-20 03:54:55,296 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.22 vs. limit=6.0 2023-06-20 03:55:35,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=623244.0, ans=0.125 2023-06-20 03:55:44,971 INFO [train.py:996] (2/4) Epoch 4, batch 12400, loss[loss=0.2502, simple_loss=0.3107, pruned_loss=0.09485, over 21789.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3248, pruned_loss=0.09042, over 4269645.63 frames. ], batch size: 247, lr: 8.08e-03, grad_scale: 32.0 2023-06-20 03:56:41,556 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.09 vs. limit=6.0 2023-06-20 03:57:02,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=623484.0, ans=0.0 2023-06-20 03:57:06,679 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.226e+02 3.090e+02 3.958e+02 4.907e+02 8.874e+02, threshold=7.916e+02, percent-clipped=17.0 2023-06-20 03:57:33,460 INFO [train.py:996] (2/4) Epoch 4, batch 12450, loss[loss=0.3211, simple_loss=0.3822, pruned_loss=0.1299, over 21418.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3295, pruned_loss=0.09424, over 4279398.06 frames. ], batch size: 131, lr: 8.07e-03, grad_scale: 16.0 2023-06-20 03:58:00,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=623664.0, ans=0.0 2023-06-20 03:58:24,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=623724.0, ans=0.125 2023-06-20 03:58:29,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=623724.0, ans=0.035 2023-06-20 03:59:05,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=623844.0, ans=0.125 2023-06-20 03:59:18,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=623904.0, ans=0.025 2023-06-20 03:59:19,815 INFO [train.py:996] (2/4) Epoch 4, batch 12500, loss[loss=0.2949, simple_loss=0.3778, pruned_loss=0.1059, over 21271.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.341, pruned_loss=0.0989, over 4281197.87 frames. ], batch size: 176, lr: 8.07e-03, grad_scale: 16.0 2023-06-20 04:00:06,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=624024.0, ans=0.0 2023-06-20 04:00:53,423 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 2.989e+02 3.343e+02 3.829e+02 7.985e+02, threshold=6.687e+02, percent-clipped=1.0 2023-06-20 04:00:54,669 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=22.5 2023-06-20 04:01:10,309 INFO [train.py:996] (2/4) Epoch 4, batch 12550, loss[loss=0.2528, simple_loss=0.3264, pruned_loss=0.08962, over 21172.00 frames. ], tot_loss[loss=0.2757, simple_loss=0.3469, pruned_loss=0.1023, over 4278614.55 frames. ], batch size: 143, lr: 8.07e-03, grad_scale: 16.0 2023-06-20 04:01:24,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=624204.0, ans=0.125 2023-06-20 04:01:48,398 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.92 vs. limit=15.0 2023-06-20 04:02:16,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=624324.0, ans=0.125 2023-06-20 04:02:41,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=624444.0, ans=0.0 2023-06-20 04:02:42,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=624444.0, ans=0.0 2023-06-20 04:03:04,266 INFO [train.py:996] (2/4) Epoch 4, batch 12600, loss[loss=0.2716, simple_loss=0.353, pruned_loss=0.09514, over 21637.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3444, pruned_loss=0.09902, over 4273249.53 frames. ], batch size: 414, lr: 8.07e-03, grad_scale: 8.0 2023-06-20 04:03:06,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=624504.0, ans=0.1 2023-06-20 04:03:56,378 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=15.0 2023-06-20 04:04:21,101 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.838e+02 3.340e+02 3.938e+02 7.249e+02, threshold=6.681e+02, percent-clipped=1.0 2023-06-20 04:04:21,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=624744.0, ans=0.1 2023-06-20 04:04:40,926 INFO [train.py:996] (2/4) Epoch 4, batch 12650, loss[loss=0.2213, simple_loss=0.2947, pruned_loss=0.07397, over 21817.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3357, pruned_loss=0.0945, over 4276628.94 frames. ], batch size: 282, lr: 8.07e-03, grad_scale: 8.0 2023-06-20 04:04:44,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=624804.0, ans=0.125 2023-06-20 04:04:49,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=624804.0, ans=0.0 2023-06-20 04:06:24,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=625104.0, ans=0.04949747468305833 2023-06-20 04:06:25,544 INFO [train.py:996] (2/4) Epoch 4, batch 12700, loss[loss=0.3044, simple_loss=0.3639, pruned_loss=0.1224, over 21950.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3356, pruned_loss=0.0972, over 4279511.94 frames. ], batch size: 316, lr: 8.06e-03, grad_scale: 8.0 2023-06-20 04:06:29,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=625104.0, ans=0.125 2023-06-20 04:06:41,179 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.02 vs. limit=10.0 2023-06-20 04:06:52,041 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.28 vs. limit=12.0 2023-06-20 04:06:58,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=625164.0, ans=0.125 2023-06-20 04:06:58,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=625164.0, ans=0.125 2023-06-20 04:07:01,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=625164.0, ans=0.1 2023-06-20 04:07:06,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=625164.0, ans=0.1 2023-06-20 04:07:53,696 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 3.213e+02 3.839e+02 4.784e+02 8.311e+02, threshold=7.678e+02, percent-clipped=6.0 2023-06-20 04:07:54,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=625344.0, ans=0.125 2023-06-20 04:08:00,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=625344.0, ans=0.025 2023-06-20 04:08:07,950 INFO [train.py:996] (2/4) Epoch 4, batch 12750, loss[loss=0.3291, simple_loss=0.3877, pruned_loss=0.1353, over 21558.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.339, pruned_loss=0.099, over 4270758.78 frames. ], batch size: 471, lr: 8.06e-03, grad_scale: 8.0 2023-06-20 04:08:08,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=625404.0, ans=0.1 2023-06-20 04:08:41,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=625464.0, ans=0.125 2023-06-20 04:08:57,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=625524.0, ans=0.2 2023-06-20 04:09:48,361 INFO [train.py:996] (2/4) Epoch 4, batch 12800, loss[loss=0.2688, simple_loss=0.3355, pruned_loss=0.101, over 21899.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3391, pruned_loss=0.1005, over 4277609.55 frames. ], batch size: 316, lr: 8.06e-03, grad_scale: 16.0 2023-06-20 04:09:58,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=625704.0, ans=0.125 2023-06-20 04:10:17,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=625764.0, ans=0.1 2023-06-20 04:11:00,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=625884.0, ans=0.0 2023-06-20 04:11:16,964 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.809e+02 3.205e+02 4.130e+02 6.634e+02, threshold=6.411e+02, percent-clipped=0.0 2023-06-20 04:11:37,282 INFO [train.py:996] (2/4) Epoch 4, batch 12850, loss[loss=0.2388, simple_loss=0.323, pruned_loss=0.07734, over 21628.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.3419, pruned_loss=0.1019, over 4277310.67 frames. ], batch size: 263, lr: 8.06e-03, grad_scale: 16.0 2023-06-20 04:13:17,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=626244.0, ans=0.125 2023-06-20 04:13:27,095 INFO [train.py:996] (2/4) Epoch 4, batch 12900, loss[loss=0.2252, simple_loss=0.2956, pruned_loss=0.07741, over 21345.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3381, pruned_loss=0.09776, over 4272377.02 frames. ], batch size: 176, lr: 8.06e-03, grad_scale: 16.0 2023-06-20 04:14:33,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=626484.0, ans=15.0 2023-06-20 04:14:49,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=626544.0, ans=0.125 2023-06-20 04:14:50,915 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.844e+02 2.539e+02 2.943e+02 3.440e+02 5.830e+02, threshold=5.886e+02, percent-clipped=0.0 2023-06-20 04:15:09,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=626544.0, ans=0.025 2023-06-20 04:15:11,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=626604.0, ans=0.0 2023-06-20 04:15:12,145 INFO [train.py:996] (2/4) Epoch 4, batch 12950, loss[loss=0.3132, simple_loss=0.3682, pruned_loss=0.1291, over 21814.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.336, pruned_loss=0.09542, over 4272678.94 frames. ], batch size: 118, lr: 8.05e-03, grad_scale: 16.0 2023-06-20 04:15:20,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=626604.0, ans=0.0 2023-06-20 04:15:53,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=626724.0, ans=0.125 2023-06-20 04:16:06,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=626784.0, ans=0.125 2023-06-20 04:16:50,328 INFO [train.py:996] (2/4) Epoch 4, batch 13000, loss[loss=0.2386, simple_loss=0.3127, pruned_loss=0.08221, over 21759.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3365, pruned_loss=0.09574, over 4273105.33 frames. ], batch size: 118, lr: 8.05e-03, grad_scale: 16.0 2023-06-20 04:16:54,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=626904.0, ans=0.0 2023-06-20 04:17:05,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=626964.0, ans=0.2 2023-06-20 04:17:08,011 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-20 04:18:08,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=627084.0, ans=0.0 2023-06-20 04:18:13,297 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.950e+02 2.732e+02 3.360e+02 4.009e+02 6.698e+02, threshold=6.719e+02, percent-clipped=5.0 2023-06-20 04:18:13,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=627144.0, ans=0.07 2023-06-20 04:18:32,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=627204.0, ans=0.125 2023-06-20 04:18:33,517 INFO [train.py:996] (2/4) Epoch 4, batch 13050, loss[loss=0.3139, simple_loss=0.3633, pruned_loss=0.1322, over 21902.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.331, pruned_loss=0.09325, over 4276614.80 frames. ], batch size: 414, lr: 8.05e-03, grad_scale: 16.0 2023-06-20 04:18:34,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=627204.0, ans=0.1 2023-06-20 04:19:55,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=627444.0, ans=0.0 2023-06-20 04:20:18,890 INFO [train.py:996] (2/4) Epoch 4, batch 13100, loss[loss=0.3297, simple_loss=0.3914, pruned_loss=0.134, over 21536.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3349, pruned_loss=0.09481, over 4277537.16 frames. ], batch size: 471, lr: 8.05e-03, grad_scale: 16.0 2023-06-20 04:20:22,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=627504.0, ans=0.125 2023-06-20 04:21:23,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=627684.0, ans=0.125 2023-06-20 04:21:48,720 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.033e+02 3.050e+02 3.631e+02 4.148e+02 7.105e+02, threshold=7.262e+02, percent-clipped=1.0 2023-06-20 04:22:03,665 INFO [train.py:996] (2/4) Epoch 4, batch 13150, loss[loss=0.2228, simple_loss=0.2518, pruned_loss=0.09691, over 19925.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.3355, pruned_loss=0.09785, over 4279980.92 frames. ], batch size: 702, lr: 8.05e-03, grad_scale: 16.0 2023-06-20 04:23:02,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=627924.0, ans=0.0 2023-06-20 04:23:16,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=627984.0, ans=0.5 2023-06-20 04:23:18,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=627984.0, ans=0.125 2023-06-20 04:24:02,016 INFO [train.py:996] (2/4) Epoch 4, batch 13200, loss[loss=0.3178, simple_loss=0.3779, pruned_loss=0.1289, over 21430.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3339, pruned_loss=0.09826, over 4278587.01 frames. ], batch size: 471, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 04:24:44,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=628224.0, ans=0.125 2023-06-20 04:24:44,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=628224.0, ans=0.125 2023-06-20 04:24:45,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=628224.0, ans=0.125 2023-06-20 04:25:01,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=628224.0, ans=10.0 2023-06-20 04:25:09,911 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=22.5 2023-06-20 04:25:27,052 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.207e+02 2.669e+02 3.031e+02 3.643e+02 6.014e+02, threshold=6.063e+02, percent-clipped=0.0 2023-06-20 04:25:47,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=628404.0, ans=0.125 2023-06-20 04:25:48,487 INFO [train.py:996] (2/4) Epoch 4, batch 13250, loss[loss=0.2656, simple_loss=0.3402, pruned_loss=0.09547, over 21824.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3344, pruned_loss=0.0994, over 4278186.44 frames. ], batch size: 351, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 04:26:06,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=628404.0, ans=0.2 2023-06-20 04:27:40,908 INFO [train.py:996] (2/4) Epoch 4, batch 13300, loss[loss=0.2503, simple_loss=0.3392, pruned_loss=0.08073, over 21694.00 frames. ], tot_loss[loss=0.2675, simple_loss=0.3377, pruned_loss=0.09868, over 4278510.28 frames. ], batch size: 351, lr: 8.04e-03, grad_scale: 32.0 2023-06-20 04:27:46,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=628704.0, ans=0.2 2023-06-20 04:27:51,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=628704.0, ans=0.0 2023-06-20 04:28:47,272 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.47 vs. limit=15.0 2023-06-20 04:28:58,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=628944.0, ans=0.0 2023-06-20 04:29:02,516 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 2.810e+02 3.267e+02 3.707e+02 6.798e+02, threshold=6.534e+02, percent-clipped=1.0 2023-06-20 04:29:13,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=628944.0, ans=10.0 2023-06-20 04:29:21,152 INFO [train.py:996] (2/4) Epoch 4, batch 13350, loss[loss=0.3133, simple_loss=0.3885, pruned_loss=0.119, over 21610.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3426, pruned_loss=0.1005, over 4267076.12 frames. ], batch size: 414, lr: 8.04e-03, grad_scale: 16.0 2023-06-20 04:29:29,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=629004.0, ans=0.0 2023-06-20 04:29:29,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=629004.0, ans=0.1 2023-06-20 04:29:51,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=629064.0, ans=0.1 2023-06-20 04:29:58,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=629124.0, ans=15.0 2023-06-20 04:30:37,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=629184.0, ans=0.1 2023-06-20 04:31:05,734 INFO [train.py:996] (2/4) Epoch 4, batch 13400, loss[loss=0.2738, simple_loss=0.3379, pruned_loss=0.1048, over 21701.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3434, pruned_loss=0.1027, over 4276200.86 frames. ], batch size: 389, lr: 8.04e-03, grad_scale: 16.0 2023-06-20 04:31:20,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=629304.0, ans=0.0 2023-06-20 04:32:38,236 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.331e+02 3.043e+02 3.589e+02 4.349e+02 7.690e+02, threshold=7.178e+02, percent-clipped=6.0 2023-06-20 04:32:51,551 INFO [train.py:996] (2/4) Epoch 4, batch 13450, loss[loss=0.3183, simple_loss=0.373, pruned_loss=0.1318, over 21684.00 frames. ], tot_loss[loss=0.2786, simple_loss=0.3459, pruned_loss=0.1056, over 4276761.70 frames. ], batch size: 441, lr: 8.04e-03, grad_scale: 16.0 2023-06-20 04:33:17,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=629664.0, ans=0.125 2023-06-20 04:33:17,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=629664.0, ans=0.0 2023-06-20 04:33:57,192 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-06-20 04:34:46,890 INFO [train.py:996] (2/4) Epoch 4, batch 13500, loss[loss=0.2655, simple_loss=0.3745, pruned_loss=0.07824, over 20832.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3358, pruned_loss=0.102, over 4269808.24 frames. ], batch size: 607, lr: 8.03e-03, grad_scale: 16.0 2023-06-20 04:35:23,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=629964.0, ans=0.95 2023-06-20 04:35:35,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=630024.0, ans=0.125 2023-06-20 04:35:51,772 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=15.0 2023-06-20 04:35:59,906 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:36:03,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=630084.0, ans=0.125 2023-06-20 04:36:21,747 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 3.218e+02 3.754e+02 4.451e+02 7.704e+02, threshold=7.508e+02, percent-clipped=1.0 2023-06-20 04:36:34,735 INFO [train.py:996] (2/4) Epoch 4, batch 13550, loss[loss=0.2661, simple_loss=0.3424, pruned_loss=0.09493, over 20072.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3392, pruned_loss=0.1013, over 4260636.14 frames. ], batch size: 702, lr: 8.03e-03, grad_scale: 16.0 2023-06-20 04:37:52,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=630384.0, ans=0.0 2023-06-20 04:38:06,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=630444.0, ans=0.2 2023-06-20 04:38:07,018 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-20 04:38:18,742 INFO [train.py:996] (2/4) Epoch 4, batch 13600, loss[loss=0.3049, simple_loss=0.3599, pruned_loss=0.1249, over 21794.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.3409, pruned_loss=0.1028, over 4271853.68 frames. ], batch size: 441, lr: 8.03e-03, grad_scale: 32.0 2023-06-20 04:38:47,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=630564.0, ans=0.1 2023-06-20 04:39:09,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=630624.0, ans=0.1 2023-06-20 04:39:09,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=630624.0, ans=0.07 2023-06-20 04:39:50,139 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 2.831e+02 3.255e+02 3.648e+02 6.704e+02, threshold=6.511e+02, percent-clipped=0.0 2023-06-20 04:39:58,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=630744.0, ans=0.95 2023-06-20 04:40:01,084 INFO [train.py:996] (2/4) Epoch 4, batch 13650, loss[loss=0.2343, simple_loss=0.2924, pruned_loss=0.08809, over 21201.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3349, pruned_loss=0.09904, over 4277081.56 frames. ], batch size: 159, lr: 8.03e-03, grad_scale: 16.0 2023-06-20 04:40:11,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=630804.0, ans=0.125 2023-06-20 04:40:24,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=630804.0, ans=0.125 2023-06-20 04:40:42,614 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:41:45,824 INFO [train.py:996] (2/4) Epoch 4, batch 13700, loss[loss=0.2122, simple_loss=0.2601, pruned_loss=0.08211, over 21208.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3306, pruned_loss=0.09868, over 4269502.38 frames. ], batch size: 159, lr: 8.03e-03, grad_scale: 16.0 2023-06-20 04:42:24,230 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 04:42:24,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=631164.0, ans=0.125 2023-06-20 04:43:24,878 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.956e+02 3.502e+02 4.364e+02 8.587e+02, threshold=7.005e+02, percent-clipped=5.0 2023-06-20 04:43:32,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=631344.0, ans=0.125 2023-06-20 04:43:42,177 INFO [train.py:996] (2/4) Epoch 4, batch 13750, loss[loss=0.2216, simple_loss=0.2842, pruned_loss=0.07952, over 21191.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3263, pruned_loss=0.0972, over 4260823.45 frames. ], batch size: 176, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 04:43:56,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=631404.0, ans=0.0 2023-06-20 04:44:46,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=631584.0, ans=0.125 2023-06-20 04:45:00,346 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=15.0 2023-06-20 04:45:07,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=631584.0, ans=0.125 2023-06-20 04:45:12,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=631584.0, ans=0.125 2023-06-20 04:45:32,631 INFO [train.py:996] (2/4) Epoch 4, batch 13800, loss[loss=0.3184, simple_loss=0.4066, pruned_loss=0.1151, over 21891.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3343, pruned_loss=0.09589, over 4254126.44 frames. ], batch size: 372, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 04:47:06,847 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 3.073e+02 3.642e+02 4.560e+02 8.359e+02, threshold=7.284e+02, percent-clipped=3.0 2023-06-20 04:47:23,546 INFO [train.py:996] (2/4) Epoch 4, batch 13850, loss[loss=0.3785, simple_loss=0.4336, pruned_loss=0.1617, over 21468.00 frames. ], tot_loss[loss=0.2664, simple_loss=0.34, pruned_loss=0.09641, over 4259738.32 frames. ], batch size: 508, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 04:48:12,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=632124.0, ans=0.125 2023-06-20 04:48:21,797 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.55 vs. limit=6.0 2023-06-20 04:48:44,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=632184.0, ans=0.125 2023-06-20 04:48:45,404 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=15.0 2023-06-20 04:49:09,030 INFO [train.py:996] (2/4) Epoch 4, batch 13900, loss[loss=0.3107, simple_loss=0.3667, pruned_loss=0.1273, over 21452.00 frames. ], tot_loss[loss=0.2736, simple_loss=0.3445, pruned_loss=0.1014, over 4262222.21 frames. ], batch size: 548, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 04:49:57,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=632424.0, ans=0.125 2023-06-20 04:50:08,597 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-06-20 04:50:40,993 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.633e+02 3.556e+02 4.265e+02 5.269e+02 7.474e+02, threshold=8.529e+02, percent-clipped=1.0 2023-06-20 04:50:52,510 INFO [train.py:996] (2/4) Epoch 4, batch 13950, loss[loss=0.2902, simple_loss=0.3457, pruned_loss=0.1174, over 21932.00 frames. ], tot_loss[loss=0.2775, simple_loss=0.3466, pruned_loss=0.1042, over 4270941.85 frames. ], batch size: 316, lr: 8.02e-03, grad_scale: 16.0 2023-06-20 04:51:34,259 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.14 vs. limit=6.0 2023-06-20 04:51:46,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.whiten.whitening_limit, batch_count=632724.0, ans=12.0 2023-06-20 04:51:46,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=632724.0, ans=0.125 2023-06-20 04:51:55,676 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-20 04:52:35,138 INFO [train.py:996] (2/4) Epoch 4, batch 14000, loss[loss=0.1748, simple_loss=0.236, pruned_loss=0.05677, over 21766.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3447, pruned_loss=0.1021, over 4276699.54 frames. ], batch size: 102, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 04:53:01,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=632964.0, ans=0.0 2023-06-20 04:53:04,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=632964.0, ans=0.95 2023-06-20 04:53:18,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=633024.0, ans=0.0 2023-06-20 04:53:27,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=633024.0, ans=0.125 2023-06-20 04:53:41,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=633084.0, ans=0.125 2023-06-20 04:54:05,709 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.790e+02 2.775e+02 3.339e+02 4.026e+02 8.444e+02, threshold=6.679e+02, percent-clipped=0.0 2023-06-20 04:54:17,333 INFO [train.py:996] (2/4) Epoch 4, batch 14050, loss[loss=0.1971, simple_loss=0.2852, pruned_loss=0.05451, over 21559.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.3379, pruned_loss=0.09721, over 4278840.92 frames. ], batch size: 230, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 04:54:34,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=633204.0, ans=0.0 2023-06-20 04:54:36,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=633204.0, ans=0.0 2023-06-20 04:54:37,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=633264.0, ans=0.2 2023-06-20 04:54:48,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=633264.0, ans=0.125 2023-06-20 04:54:55,977 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=15.0 2023-06-20 04:56:00,966 INFO [train.py:996] (2/4) Epoch 4, batch 14100, loss[loss=0.281, simple_loss=0.3466, pruned_loss=0.1078, over 21510.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3306, pruned_loss=0.09628, over 4275059.88 frames. ], batch size: 389, lr: 8.01e-03, grad_scale: 32.0 2023-06-20 04:56:28,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=633564.0, ans=0.125 2023-06-20 04:57:12,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=633684.0, ans=0.125 2023-06-20 04:57:13,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=633684.0, ans=0.2 2023-06-20 04:57:15,864 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=1.93 vs. limit=12.0 2023-06-20 04:57:34,360 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 2.697e+02 3.241e+02 4.086e+02 6.665e+02, threshold=6.483e+02, percent-clipped=0.0 2023-06-20 04:57:43,888 INFO [train.py:996] (2/4) Epoch 4, batch 14150, loss[loss=0.2699, simple_loss=0.3494, pruned_loss=0.09519, over 21876.00 frames. ], tot_loss[loss=0.2647, simple_loss=0.3342, pruned_loss=0.0976, over 4276646.90 frames. ], batch size: 317, lr: 8.01e-03, grad_scale: 16.0 2023-06-20 04:58:05,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=633864.0, ans=0.015 2023-06-20 04:58:54,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=633984.0, ans=0.125 2023-06-20 04:58:59,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=633984.0, ans=0.125 2023-06-20 04:59:24,543 INFO [train.py:996] (2/4) Epoch 4, batch 14200, loss[loss=0.2478, simple_loss=0.3096, pruned_loss=0.09297, over 21515.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3306, pruned_loss=0.09451, over 4278908.83 frames. ], batch size: 195, lr: 8.01e-03, grad_scale: 16.0 2023-06-20 04:59:33,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=634104.0, ans=0.0 2023-06-20 04:59:52,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=634164.0, ans=0.0 2023-06-20 05:00:01,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=634224.0, ans=0.015 2023-06-20 05:00:05,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=634224.0, ans=0.0 2023-06-20 05:00:52,668 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.805e+02 2.482e+02 2.802e+02 3.379e+02 6.129e+02, threshold=5.605e+02, percent-clipped=0.0 2023-06-20 05:01:07,626 INFO [train.py:996] (2/4) Epoch 4, batch 14250, loss[loss=0.2256, simple_loss=0.2778, pruned_loss=0.0867, over 21545.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3245, pruned_loss=0.09413, over 4264828.96 frames. ], batch size: 263, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 05:01:16,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=634404.0, ans=0.07 2023-06-20 05:01:30,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=634464.0, ans=0.2 2023-06-20 05:01:40,600 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=24.85 vs. limit=15.0 2023-06-20 05:01:59,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=634524.0, ans=0.0 2023-06-20 05:02:41,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=634644.0, ans=0.2 2023-06-20 05:02:43,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=634644.0, ans=0.125 2023-06-20 05:02:46,952 INFO [train.py:996] (2/4) Epoch 4, batch 14300, loss[loss=0.4056, simple_loss=0.4789, pruned_loss=0.1661, over 21579.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3275, pruned_loss=0.09432, over 4263229.12 frames. ], batch size: 441, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 05:03:07,039 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2023-06-20 05:04:21,886 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.272e+02 2.821e+02 3.422e+02 4.230e+02 9.010e+02, threshold=6.844e+02, percent-clipped=9.0 2023-06-20 05:04:31,918 INFO [train.py:996] (2/4) Epoch 4, batch 14350, loss[loss=0.3253, simple_loss=0.379, pruned_loss=0.1358, over 21609.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3335, pruned_loss=0.09549, over 4252068.78 frames. ], batch size: 471, lr: 8.00e-03, grad_scale: 16.0 2023-06-20 05:04:43,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=635004.0, ans=0.0 2023-06-20 05:04:46,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=635004.0, ans=0.125 2023-06-20 05:05:10,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=635064.0, ans=0.0 2023-06-20 05:05:25,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=635124.0, ans=0.0 2023-06-20 05:06:19,444 INFO [train.py:996] (2/4) Epoch 4, batch 14400, loss[loss=0.2784, simple_loss=0.3322, pruned_loss=0.1124, over 21716.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3296, pruned_loss=0.09566, over 4254101.77 frames. ], batch size: 332, lr: 8.00e-03, grad_scale: 32.0 2023-06-20 05:07:07,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=635424.0, ans=0.05 2023-06-20 05:07:33,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=635484.0, ans=0.125 2023-06-20 05:07:42,771 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.165e+02 2.821e+02 3.350e+02 4.136e+02 6.839e+02, threshold=6.700e+02, percent-clipped=0.0 2023-06-20 05:07:57,250 INFO [train.py:996] (2/4) Epoch 4, batch 14450, loss[loss=0.2283, simple_loss=0.2893, pruned_loss=0.0837, over 21657.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3254, pruned_loss=0.09638, over 4244574.87 frames. ], batch size: 231, lr: 8.00e-03, grad_scale: 32.0 2023-06-20 05:08:24,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=635664.0, ans=0.125 2023-06-20 05:09:39,342 INFO [train.py:996] (2/4) Epoch 4, batch 14500, loss[loss=0.2483, simple_loss=0.3343, pruned_loss=0.08114, over 21751.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3218, pruned_loss=0.09576, over 4248064.60 frames. ], batch size: 351, lr: 8.00e-03, grad_scale: 32.0 2023-06-20 05:09:52,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=635904.0, ans=0.125 2023-06-20 05:10:29,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=636024.0, ans=0.1 2023-06-20 05:11:13,698 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.861e+02 3.336e+02 4.612e+02 7.217e+02, threshold=6.672e+02, percent-clipped=2.0 2023-06-20 05:11:24,714 INFO [train.py:996] (2/4) Epoch 4, batch 14550, loss[loss=0.3009, simple_loss=0.3622, pruned_loss=0.1198, over 21848.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3301, pruned_loss=0.09928, over 4255789.91 frames. ], batch size: 247, lr: 7.99e-03, grad_scale: 32.0 2023-06-20 05:12:41,549 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.76 vs. limit=15.0 2023-06-20 05:12:59,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=636444.0, ans=0.0 2023-06-20 05:13:03,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=636444.0, ans=0.1 2023-06-20 05:13:16,345 INFO [train.py:996] (2/4) Epoch 4, batch 14600, loss[loss=0.3015, simple_loss=0.3845, pruned_loss=0.1093, over 21625.00 frames. ], tot_loss[loss=0.272, simple_loss=0.3383, pruned_loss=0.1029, over 4255689.73 frames. ], batch size: 389, lr: 7.99e-03, grad_scale: 32.0 2023-06-20 05:14:09,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=636624.0, ans=0.125 2023-06-20 05:14:43,558 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.324e+02 3.039e+02 3.637e+02 4.481e+02 9.662e+02, threshold=7.275e+02, percent-clipped=5.0 2023-06-20 05:14:53,033 INFO [train.py:996] (2/4) Epoch 4, batch 14650, loss[loss=0.2602, simple_loss=0.3427, pruned_loss=0.08882, over 21736.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3397, pruned_loss=0.1016, over 4262340.89 frames. ], batch size: 332, lr: 7.99e-03, grad_scale: 32.0 2023-06-20 05:15:42,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=636924.0, ans=0.0 2023-06-20 05:15:45,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=636924.0, ans=0.125 2023-06-20 05:16:40,857 INFO [train.py:996] (2/4) Epoch 4, batch 14700, loss[loss=0.2903, simple_loss=0.3645, pruned_loss=0.1081, over 21479.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3324, pruned_loss=0.09495, over 4257146.51 frames. ], batch size: 508, lr: 7.99e-03, grad_scale: 32.0 2023-06-20 05:16:42,254 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-20 05:17:11,290 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-06-20 05:17:33,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=637224.0, ans=0.1 2023-06-20 05:17:34,607 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:17:49,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=637284.0, ans=0.125 2023-06-20 05:18:00,849 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=15.0 2023-06-20 05:18:09,455 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-20 05:18:12,039 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.562e+02 2.351e+02 2.788e+02 3.519e+02 6.135e+02, threshold=5.577e+02, percent-clipped=0.0 2023-06-20 05:18:22,368 INFO [train.py:996] (2/4) Epoch 4, batch 14750, loss[loss=0.2729, simple_loss=0.3363, pruned_loss=0.1048, over 21489.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3383, pruned_loss=0.09873, over 4251431.73 frames. ], batch size: 211, lr: 7.99e-03, grad_scale: 32.0 2023-06-20 05:18:28,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=637404.0, ans=0.125 2023-06-20 05:18:31,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=637404.0, ans=0.1 2023-06-20 05:18:57,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=637464.0, ans=0.125 2023-06-20 05:19:56,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=637644.0, ans=0.2 2023-06-20 05:20:03,230 INFO [train.py:996] (2/4) Epoch 4, batch 14800, loss[loss=0.298, simple_loss=0.354, pruned_loss=0.121, over 15319.00 frames. ], tot_loss[loss=0.2802, simple_loss=0.351, pruned_loss=0.1047, over 4251279.02 frames. ], batch size: 60, lr: 7.98e-03, grad_scale: 32.0 2023-06-20 05:20:36,559 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-06-20 05:21:40,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=637944.0, ans=0.1 2023-06-20 05:21:40,895 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-06-20 05:21:42,993 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 3.164e+02 3.889e+02 4.731e+02 8.129e+02, threshold=7.778e+02, percent-clipped=15.0 2023-06-20 05:21:52,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=637944.0, ans=0.125 2023-06-20 05:22:00,586 INFO [train.py:996] (2/4) Epoch 4, batch 14850, loss[loss=0.2437, simple_loss=0.3002, pruned_loss=0.09364, over 21571.00 frames. ], tot_loss[loss=0.2744, simple_loss=0.3429, pruned_loss=0.1029, over 4251685.98 frames. ], batch size: 247, lr: 7.98e-03, grad_scale: 16.0 2023-06-20 05:22:52,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=638124.0, ans=0.125 2023-06-20 05:22:59,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=638124.0, ans=0.125 2023-06-20 05:23:08,080 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:23:23,397 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:23:35,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=638244.0, ans=0.0 2023-06-20 05:23:38,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=638244.0, ans=0.0 2023-06-20 05:23:46,629 INFO [train.py:996] (2/4) Epoch 4, batch 14900, loss[loss=0.2886, simple_loss=0.3579, pruned_loss=0.1096, over 21408.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3431, pruned_loss=0.1039, over 4249299.18 frames. ], batch size: 131, lr: 7.98e-03, grad_scale: 16.0 2023-06-20 05:23:57,642 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.11 vs. limit=6.0 2023-06-20 05:24:21,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=638364.0, ans=0.0 2023-06-20 05:24:27,463 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-20 05:25:02,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=638484.0, ans=0.125 2023-06-20 05:25:22,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=638544.0, ans=0.125 2023-06-20 05:25:23,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=638544.0, ans=0.0 2023-06-20 05:25:25,414 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.971e+02 3.790e+02 5.715e+02 1.373e+03, threshold=7.580e+02, percent-clipped=7.0 2023-06-20 05:25:32,246 INFO [train.py:996] (2/4) Epoch 4, batch 14950, loss[loss=0.3115, simple_loss=0.3748, pruned_loss=0.1241, over 21411.00 frames. ], tot_loss[loss=0.2779, simple_loss=0.3462, pruned_loss=0.1048, over 4253287.58 frames. ], batch size: 471, lr: 7.98e-03, grad_scale: 16.0 2023-06-20 05:25:54,206 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=22.5 2023-06-20 05:25:58,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=638664.0, ans=0.0 2023-06-20 05:26:07,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=638664.0, ans=0.125 2023-06-20 05:26:14,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=638724.0, ans=0.1 2023-06-20 05:26:14,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=638724.0, ans=15.0 2023-06-20 05:27:16,861 INFO [train.py:996] (2/4) Epoch 4, batch 15000, loss[loss=0.2859, simple_loss=0.3549, pruned_loss=0.1085, over 21755.00 frames. ], tot_loss[loss=0.2809, simple_loss=0.3485, pruned_loss=0.1066, over 4256232.80 frames. ], batch size: 441, lr: 7.98e-03, grad_scale: 16.0 2023-06-20 05:27:16,861 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 05:27:33,904 INFO [train.py:1028] (2/4) Epoch 4, validation: loss=0.2743, simple_loss=0.3665, pruned_loss=0.09108, over 1796401.00 frames. 2023-06-20 05:27:33,905 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-20 05:27:43,487 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-20 05:28:11,550 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=22.5 2023-06-20 05:28:49,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=639084.0, ans=0.125 2023-06-20 05:28:49,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=639084.0, ans=0.0 2023-06-20 05:29:02,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=639144.0, ans=0.2 2023-06-20 05:29:12,214 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.414e+02 3.360e+02 3.927e+02 4.837e+02 8.029e+02, threshold=7.853e+02, percent-clipped=2.0 2023-06-20 05:29:24,481 INFO [train.py:996] (2/4) Epoch 4, batch 15050, loss[loss=0.312, simple_loss=0.3865, pruned_loss=0.1188, over 21663.00 frames. ], tot_loss[loss=0.2808, simple_loss=0.3474, pruned_loss=0.1071, over 4252096.87 frames. ], batch size: 441, lr: 7.97e-03, grad_scale: 16.0 2023-06-20 05:29:59,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=639264.0, ans=0.0 2023-06-20 05:30:01,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=639264.0, ans=0.125 2023-06-20 05:30:40,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=639384.0, ans=0.0 2023-06-20 05:31:09,511 INFO [train.py:996] (2/4) Epoch 4, batch 15100, loss[loss=0.3392, simple_loss=0.4014, pruned_loss=0.1385, over 21548.00 frames. ], tot_loss[loss=0.2805, simple_loss=0.3479, pruned_loss=0.1065, over 4243776.43 frames. ], batch size: 414, lr: 7.97e-03, grad_scale: 16.0 2023-06-20 05:32:04,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=639624.0, ans=0.125 2023-06-20 05:32:45,900 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 3.063e+02 3.378e+02 3.992e+02 7.623e+02, threshold=6.756e+02, percent-clipped=0.0 2023-06-20 05:32:49,032 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-20 05:32:52,587 INFO [train.py:996] (2/4) Epoch 4, batch 15150, loss[loss=0.2905, simple_loss=0.3369, pruned_loss=0.1221, over 21426.00 frames. ], tot_loss[loss=0.2796, simple_loss=0.3449, pruned_loss=0.1071, over 4250488.66 frames. ], batch size: 389, lr: 7.97e-03, grad_scale: 16.0 2023-06-20 05:33:08,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=639804.0, ans=0.125 2023-06-20 05:33:36,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=639864.0, ans=0.0 2023-06-20 05:33:42,501 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=22.5 2023-06-20 05:33:44,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=639924.0, ans=15.0 2023-06-20 05:34:41,874 INFO [train.py:996] (2/4) Epoch 4, batch 15200, loss[loss=0.2295, simple_loss=0.2892, pruned_loss=0.08487, over 21771.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.3359, pruned_loss=0.1023, over 4251487.12 frames. ], batch size: 112, lr: 7.97e-03, grad_scale: 32.0 2023-06-20 05:35:00,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=640104.0, ans=0.125 2023-06-20 05:35:40,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=640284.0, ans=0.0 2023-06-20 05:35:40,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=640284.0, ans=0.1 2023-06-20 05:35:42,357 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:36:14,281 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.035e+02 3.960e+02 4.645e+02 7.650e+02, threshold=7.920e+02, percent-clipped=3.0 2023-06-20 05:36:25,892 INFO [train.py:996] (2/4) Epoch 4, batch 15250, loss[loss=0.2621, simple_loss=0.322, pruned_loss=0.1011, over 21792.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.33, pruned_loss=0.1004, over 4254141.03 frames. ], batch size: 112, lr: 7.97e-03, grad_scale: 32.0 2023-06-20 05:36:32,240 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-20 05:36:40,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=640404.0, ans=0.0 2023-06-20 05:36:42,395 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-20 05:37:38,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=640584.0, ans=0.2 2023-06-20 05:38:05,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=640644.0, ans=0.125 2023-06-20 05:38:17,107 INFO [train.py:996] (2/4) Epoch 4, batch 15300, loss[loss=0.3142, simple_loss=0.3647, pruned_loss=0.1318, over 21762.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.333, pruned_loss=0.104, over 4261776.28 frames. ], batch size: 441, lr: 7.97e-03, grad_scale: 32.0 2023-06-20 05:38:19,952 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.31 vs. limit=15.0 2023-06-20 05:38:50,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=640764.0, ans=0.2 2023-06-20 05:39:00,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=640824.0, ans=0.2 2023-06-20 05:39:05,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=640824.0, ans=0.2 2023-06-20 05:39:08,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=640824.0, ans=0.125 2023-06-20 05:39:54,775 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.484e+02 2.881e+02 3.296e+02 3.984e+02 9.139e+02, threshold=6.591e+02, percent-clipped=2.0 2023-06-20 05:39:55,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=640944.0, ans=0.95 2023-06-20 05:39:59,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=640944.0, ans=0.125 2023-06-20 05:39:59,823 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-20 05:40:01,930 INFO [train.py:996] (2/4) Epoch 4, batch 15350, loss[loss=0.2454, simple_loss=0.3499, pruned_loss=0.07043, over 21795.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3398, pruned_loss=0.1066, over 4266369.28 frames. ], batch size: 298, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 05:40:33,187 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=12.0 2023-06-20 05:40:34,998 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-20 05:40:46,523 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.54 vs. limit=10.0 2023-06-20 05:41:16,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=641244.0, ans=0.0 2023-06-20 05:41:27,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=641244.0, ans=0.0 2023-06-20 05:41:36,558 INFO [train.py:996] (2/4) Epoch 4, batch 15400, loss[loss=0.2806, simple_loss=0.3429, pruned_loss=0.1092, over 21889.00 frames. ], tot_loss[loss=0.274, simple_loss=0.339, pruned_loss=0.1045, over 4262540.26 frames. ], batch size: 371, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 05:42:02,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=641364.0, ans=0.0 2023-06-20 05:42:21,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=641424.0, ans=0.125 2023-06-20 05:43:07,584 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 2.755e+02 3.304e+02 3.947e+02 7.271e+02, threshold=6.607e+02, percent-clipped=2.0 2023-06-20 05:43:13,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=641544.0, ans=0.125 2023-06-20 05:43:19,995 INFO [train.py:996] (2/4) Epoch 4, batch 15450, loss[loss=0.2758, simple_loss=0.3399, pruned_loss=0.1059, over 21861.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3361, pruned_loss=0.1037, over 4273317.52 frames. ], batch size: 118, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 05:43:33,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=641604.0, ans=0.125 2023-06-20 05:44:06,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=641724.0, ans=0.0 2023-06-20 05:44:08,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=641724.0, ans=0.125 2023-06-20 05:44:41,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=641844.0, ans=0.1 2023-06-20 05:45:10,542 INFO [train.py:996] (2/4) Epoch 4, batch 15500, loss[loss=0.2901, simple_loss=0.3624, pruned_loss=0.1089, over 21300.00 frames. ], tot_loss[loss=0.2727, simple_loss=0.339, pruned_loss=0.1032, over 4262706.58 frames. ], batch size: 176, lr: 7.96e-03, grad_scale: 32.0 2023-06-20 05:45:37,457 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.61 vs. limit=22.5 2023-06-20 05:46:07,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=642084.0, ans=0.0 2023-06-20 05:46:30,808 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.84 vs. limit=15.0 2023-06-20 05:46:42,697 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=15.0 2023-06-20 05:46:56,142 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.827e+02 3.458e+02 4.345e+02 6.798e+02, threshold=6.916e+02, percent-clipped=2.0 2023-06-20 05:46:57,336 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-20 05:47:00,856 INFO [train.py:996] (2/4) Epoch 4, batch 15550, loss[loss=0.2458, simple_loss=0.3297, pruned_loss=0.08095, over 21689.00 frames. ], tot_loss[loss=0.2696, simple_loss=0.3379, pruned_loss=0.1007, over 4265854.71 frames. ], batch size: 298, lr: 7.96e-03, grad_scale: 16.0 2023-06-20 05:47:20,844 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:47:44,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=642324.0, ans=0.125 2023-06-20 05:48:39,097 INFO [train.py:996] (2/4) Epoch 4, batch 15600, loss[loss=0.268, simple_loss=0.3377, pruned_loss=0.09914, over 21620.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.331, pruned_loss=0.09858, over 4261669.68 frames. ], batch size: 247, lr: 7.95e-03, grad_scale: 32.0 2023-06-20 05:48:44,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=642504.0, ans=0.125 2023-06-20 05:49:11,640 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.51 vs. limit=12.0 2023-06-20 05:49:49,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=642684.0, ans=0.125 2023-06-20 05:49:52,524 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 05:50:00,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=642744.0, ans=0.0 2023-06-20 05:50:17,883 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.141e+02 2.694e+02 3.221e+02 3.969e+02 6.566e+02, threshold=6.442e+02, percent-clipped=0.0 2023-06-20 05:50:21,217 INFO [train.py:996] (2/4) Epoch 4, batch 15650, loss[loss=0.2424, simple_loss=0.3068, pruned_loss=0.08897, over 21866.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3322, pruned_loss=0.09786, over 4256495.34 frames. ], batch size: 107, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 05:50:33,663 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.84 vs. limit=15.0 2023-06-20 05:50:35,429 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.13 vs. limit=6.0 2023-06-20 05:50:58,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=642924.0, ans=0.0 2023-06-20 05:51:02,045 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-20 05:51:19,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=642924.0, ans=0.2 2023-06-20 05:51:21,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=642924.0, ans=0.2 2023-06-20 05:51:57,147 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-06-20 05:52:06,102 INFO [train.py:996] (2/4) Epoch 4, batch 15700, loss[loss=0.2676, simple_loss=0.3358, pruned_loss=0.09967, over 21538.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3289, pruned_loss=0.09767, over 4251318.72 frames. ], batch size: 389, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 05:52:16,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=643104.0, ans=0.2 2023-06-20 05:52:20,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=643104.0, ans=0.125 2023-06-20 05:52:27,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=643164.0, ans=0.1 2023-06-20 05:53:30,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=643284.0, ans=0.0 2023-06-20 05:53:46,788 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.101e+02 2.685e+02 3.204e+02 3.710e+02 6.179e+02, threshold=6.407e+02, percent-clipped=0.0 2023-06-20 05:53:47,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=643344.0, ans=0.125 2023-06-20 05:53:49,723 INFO [train.py:996] (2/4) Epoch 4, batch 15750, loss[loss=0.2324, simple_loss=0.2972, pruned_loss=0.0838, over 21213.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3235, pruned_loss=0.09632, over 4252505.26 frames. ], batch size: 143, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 05:54:52,639 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.37 vs. limit=22.5 2023-06-20 05:54:57,663 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.25 vs. limit=22.5 2023-06-20 05:55:31,458 INFO [train.py:996] (2/4) Epoch 4, batch 15800, loss[loss=0.2115, simple_loss=0.2794, pruned_loss=0.07183, over 21526.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3194, pruned_loss=0.09668, over 4253959.29 frames. ], batch size: 230, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 05:56:16,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=643824.0, ans=0.125 2023-06-20 05:56:45,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=643884.0, ans=0.0 2023-06-20 05:57:11,281 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.402e+02 3.033e+02 3.511e+02 4.104e+02 6.063e+02, threshold=7.023e+02, percent-clipped=0.0 2023-06-20 05:57:14,457 INFO [train.py:996] (2/4) Epoch 4, batch 15850, loss[loss=0.3136, simple_loss=0.3661, pruned_loss=0.1306, over 21347.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3214, pruned_loss=0.09867, over 4252948.40 frames. ], batch size: 471, lr: 7.95e-03, grad_scale: 16.0 2023-06-20 05:57:16,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=644004.0, ans=0.125 2023-06-20 05:57:52,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=644124.0, ans=0.0 2023-06-20 05:58:48,848 INFO [train.py:996] (2/4) Epoch 4, batch 15900, loss[loss=0.2695, simple_loss=0.3478, pruned_loss=0.09561, over 21523.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3211, pruned_loss=0.1, over 4251539.61 frames. ], batch size: 389, lr: 7.94e-03, grad_scale: 16.0 2023-06-20 05:59:14,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=644364.0, ans=0.2 2023-06-20 05:59:48,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=644484.0, ans=0.1 2023-06-20 05:59:50,948 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-20 06:00:28,955 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.103e+02 2.691e+02 3.018e+02 3.866e+02 6.282e+02, threshold=6.037e+02, percent-clipped=0.0 2023-06-20 06:00:32,144 INFO [train.py:996] (2/4) Epoch 4, batch 15950, loss[loss=0.2343, simple_loss=0.295, pruned_loss=0.0868, over 15870.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3202, pruned_loss=0.0973, over 4255408.01 frames. ], batch size: 62, lr: 7.94e-03, grad_scale: 16.0 2023-06-20 06:00:46,442 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=15.0 2023-06-20 06:00:55,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=644664.0, ans=0.125 2023-06-20 06:00:57,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=644664.0, ans=0.0 2023-06-20 06:01:03,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=644724.0, ans=0.1 2023-06-20 06:01:37,962 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=12.0 2023-06-20 06:02:04,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=644844.0, ans=0.125 2023-06-20 06:02:14,386 INFO [train.py:996] (2/4) Epoch 4, batch 16000, loss[loss=0.2005, simple_loss=0.2835, pruned_loss=0.05879, over 21366.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3219, pruned_loss=0.0944, over 4246551.73 frames. ], batch size: 176, lr: 7.94e-03, grad_scale: 32.0 2023-06-20 06:03:11,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=645084.0, ans=0.125 2023-06-20 06:03:12,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=645084.0, ans=0.1 2023-06-20 06:03:53,926 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.739e+02 2.704e+02 3.132e+02 3.952e+02 7.192e+02, threshold=6.264e+02, percent-clipped=3.0 2023-06-20 06:03:57,360 INFO [train.py:996] (2/4) Epoch 4, batch 16050, loss[loss=0.3241, simple_loss=0.4176, pruned_loss=0.1153, over 21661.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3246, pruned_loss=0.09201, over 4249453.72 frames. ], batch size: 441, lr: 7.94e-03, grad_scale: 32.0 2023-06-20 06:04:15,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=645264.0, ans=0.05 2023-06-20 06:04:32,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=645324.0, ans=0.0 2023-06-20 06:04:36,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=645324.0, ans=0.2 2023-06-20 06:04:46,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=645384.0, ans=0.0 2023-06-20 06:05:17,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=645444.0, ans=0.125 2023-06-20 06:05:26,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=645444.0, ans=0.125 2023-06-20 06:05:40,828 INFO [train.py:996] (2/4) Epoch 4, batch 16100, loss[loss=0.2726, simple_loss=0.3319, pruned_loss=0.1066, over 21828.00 frames. ], tot_loss[loss=0.2579, simple_loss=0.3282, pruned_loss=0.0938, over 4259366.71 frames. ], batch size: 441, lr: 7.94e-03, grad_scale: 32.0 2023-06-20 06:06:32,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=645684.0, ans=0.0 2023-06-20 06:07:03,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=645744.0, ans=10.0 2023-06-20 06:07:20,429 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 3.023e+02 3.501e+02 4.351e+02 8.172e+02, threshold=7.003e+02, percent-clipped=2.0 2023-06-20 06:07:23,554 INFO [train.py:996] (2/4) Epoch 4, batch 16150, loss[loss=0.2492, simple_loss=0.321, pruned_loss=0.08876, over 21886.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3295, pruned_loss=0.09653, over 4266241.19 frames. ], batch size: 332, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 06:07:34,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=645804.0, ans=0.125 2023-06-20 06:08:48,312 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.84 vs. limit=6.0 2023-06-20 06:09:06,401 INFO [train.py:996] (2/4) Epoch 4, batch 16200, loss[loss=0.3107, simple_loss=0.3758, pruned_loss=0.1228, over 21220.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3359, pruned_loss=0.09882, over 4274342.77 frames. ], batch size: 143, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 06:10:07,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=646284.0, ans=0.125 2023-06-20 06:10:28,770 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-20 06:10:40,631 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.687e+02 3.085e+02 4.076e+02 6.886e+02, threshold=6.170e+02, percent-clipped=1.0 2023-06-20 06:10:44,042 INFO [train.py:996] (2/4) Epoch 4, batch 16250, loss[loss=0.2034, simple_loss=0.2691, pruned_loss=0.06886, over 21174.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3353, pruned_loss=0.09801, over 4273172.18 frames. ], batch size: 143, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 06:10:57,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=646404.0, ans=0.1 2023-06-20 06:10:59,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=646464.0, ans=0.0 2023-06-20 06:11:01,744 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-20 06:12:18,558 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=22.5 2023-06-20 06:12:19,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=646644.0, ans=0.05 2023-06-20 06:12:23,107 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2023-06-20 06:12:25,369 INFO [train.py:996] (2/4) Epoch 4, batch 16300, loss[loss=0.1709, simple_loss=0.22, pruned_loss=0.06094, over 17034.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3273, pruned_loss=0.09365, over 4253508.78 frames. ], batch size: 62, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 06:12:52,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=646764.0, ans=0.125 2023-06-20 06:12:59,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=646824.0, ans=0.2 2023-06-20 06:13:04,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=646824.0, ans=0.0 2023-06-20 06:14:05,366 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 2.535e+02 3.018e+02 3.417e+02 5.954e+02, threshold=6.036e+02, percent-clipped=0.0 2023-06-20 06:14:08,707 INFO [train.py:996] (2/4) Epoch 4, batch 16350, loss[loss=0.2626, simple_loss=0.3261, pruned_loss=0.09957, over 21703.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3276, pruned_loss=0.09505, over 4256199.27 frames. ], batch size: 298, lr: 7.93e-03, grad_scale: 32.0 2023-06-20 06:14:13,564 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=15.0 2023-06-20 06:14:55,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=647124.0, ans=0.2 2023-06-20 06:15:52,750 INFO [train.py:996] (2/4) Epoch 4, batch 16400, loss[loss=0.2518, simple_loss=0.3107, pruned_loss=0.09648, over 21806.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3306, pruned_loss=0.0967, over 4256463.00 frames. ], batch size: 247, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 06:16:14,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=647364.0, ans=0.0 2023-06-20 06:16:35,859 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-20 06:16:41,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=647424.0, ans=0.0 2023-06-20 06:17:12,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=647484.0, ans=0.125 2023-06-20 06:17:31,940 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.209e+02 2.888e+02 3.304e+02 3.936e+02 7.106e+02, threshold=6.607e+02, percent-clipped=3.0 2023-06-20 06:17:34,695 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-06-20 06:17:35,187 INFO [train.py:996] (2/4) Epoch 4, batch 16450, loss[loss=0.2517, simple_loss=0.3154, pruned_loss=0.09403, over 21923.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3309, pruned_loss=0.09862, over 4265423.38 frames. ], batch size: 124, lr: 7.92e-03, grad_scale: 32.0 2023-06-20 06:17:39,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=647604.0, ans=0.125 2023-06-20 06:19:03,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=647844.0, ans=0.2 2023-06-20 06:19:17,745 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-20 06:19:19,935 INFO [train.py:996] (2/4) Epoch 4, batch 16500, loss[loss=0.2637, simple_loss=0.3245, pruned_loss=0.1014, over 21818.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3282, pruned_loss=0.09817, over 4275031.83 frames. ], batch size: 316, lr: 7.92e-03, grad_scale: 16.0 2023-06-20 06:20:32,767 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.20 vs. limit=15.0 2023-06-20 06:20:36,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=648084.0, ans=0.0 2023-06-20 06:20:39,999 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.55 vs. limit=15.0 2023-06-20 06:20:42,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=648084.0, ans=0.125 2023-06-20 06:20:52,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=648144.0, ans=0.0 2023-06-20 06:20:57,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=648144.0, ans=0.0 2023-06-20 06:21:03,947 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 2.948e+02 3.472e+02 4.236e+02 9.691e+02, threshold=6.943e+02, percent-clipped=9.0 2023-06-20 06:21:05,584 INFO [train.py:996] (2/4) Epoch 4, batch 16550, loss[loss=0.2832, simple_loss=0.3671, pruned_loss=0.09968, over 19913.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.325, pruned_loss=0.09417, over 4274097.46 frames. ], batch size: 702, lr: 7.92e-03, grad_scale: 16.0 2023-06-20 06:21:10,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=648204.0, ans=0.125 2023-06-20 06:21:17,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=648204.0, ans=0.0 2023-06-20 06:21:29,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=648204.0, ans=0.125 2023-06-20 06:22:12,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=648324.0, ans=0.1 2023-06-20 06:22:22,146 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=15.0 2023-06-20 06:22:22,176 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.65 vs. limit=6.0 2023-06-20 06:22:28,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=648384.0, ans=0.125 2023-06-20 06:22:42,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=648444.0, ans=22.5 2023-06-20 06:22:59,782 INFO [train.py:996] (2/4) Epoch 4, batch 16600, loss[loss=0.2919, simple_loss=0.3692, pruned_loss=0.1073, over 21156.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3356, pruned_loss=0.09866, over 4277884.43 frames. ], batch size: 143, lr: 7.92e-03, grad_scale: 16.0 2023-06-20 06:23:26,321 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=15.0 2023-06-20 06:23:27,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=648564.0, ans=0.125 2023-06-20 06:23:44,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=648624.0, ans=0.2 2023-06-20 06:23:54,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=648624.0, ans=0.1 2023-06-20 06:24:45,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=648744.0, ans=0.125 2023-06-20 06:24:48,215 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 3.275e+02 4.120e+02 5.174e+02 8.172e+02, threshold=8.240e+02, percent-clipped=2.0 2023-06-20 06:24:49,947 INFO [train.py:996] (2/4) Epoch 4, batch 16650, loss[loss=0.2791, simple_loss=0.3498, pruned_loss=0.1043, over 21995.00 frames. ], tot_loss[loss=0.2771, simple_loss=0.3482, pruned_loss=0.1031, over 4277972.47 frames. ], batch size: 317, lr: 7.92e-03, grad_scale: 16.0 2023-06-20 06:25:28,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=648924.0, ans=0.1 2023-06-20 06:25:42,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=648924.0, ans=0.0 2023-06-20 06:25:52,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=648984.0, ans=0.0 2023-06-20 06:25:58,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=648984.0, ans=0.0 2023-06-20 06:26:25,788 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.45 vs. limit=15.0 2023-06-20 06:26:41,408 INFO [train.py:996] (2/4) Epoch 4, batch 16700, loss[loss=0.2186, simple_loss=0.283, pruned_loss=0.07708, over 21453.00 frames. ], tot_loss[loss=0.2791, simple_loss=0.3495, pruned_loss=0.1044, over 4273237.40 frames. ], batch size: 194, lr: 7.91e-03, grad_scale: 16.0 2023-06-20 06:26:58,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=649164.0, ans=0.0 2023-06-20 06:27:02,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=649164.0, ans=0.125 2023-06-20 06:27:26,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=649224.0, ans=0.0 2023-06-20 06:28:26,848 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.468e+02 3.053e+02 3.657e+02 4.357e+02 8.504e+02, threshold=7.314e+02, percent-clipped=1.0 2023-06-20 06:28:28,517 INFO [train.py:996] (2/4) Epoch 4, batch 16750, loss[loss=0.3038, simple_loss=0.3755, pruned_loss=0.116, over 21931.00 frames. ], tot_loss[loss=0.2819, simple_loss=0.3515, pruned_loss=0.1061, over 4275776.59 frames. ], batch size: 317, lr: 7.91e-03, grad_scale: 16.0 2023-06-20 06:28:29,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=649404.0, ans=0.125 2023-06-20 06:28:31,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=649404.0, ans=0.1 2023-06-20 06:28:54,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=649464.0, ans=0.125 2023-06-20 06:29:12,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=649464.0, ans=0.125 2023-06-20 06:29:59,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=649644.0, ans=0.0 2023-06-20 06:30:12,946 INFO [train.py:996] (2/4) Epoch 4, batch 16800, loss[loss=0.3269, simple_loss=0.4121, pruned_loss=0.1209, over 19874.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3552, pruned_loss=0.1054, over 4274243.78 frames. ], batch size: 703, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 06:31:03,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=649764.0, ans=0.125 2023-06-20 06:31:05,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=649824.0, ans=0.125 2023-06-20 06:31:14,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=649824.0, ans=0.1 2023-06-20 06:31:34,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=649884.0, ans=0.09899494936611666 2023-06-20 06:31:58,996 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.418e+02 3.215e+02 3.899e+02 5.075e+02 9.613e+02, threshold=7.798e+02, percent-clipped=9.0 2023-06-20 06:32:00,541 INFO [train.py:996] (2/4) Epoch 4, batch 16850, loss[loss=0.241, simple_loss=0.302, pruned_loss=0.09003, over 21809.00 frames. ], tot_loss[loss=0.2814, simple_loss=0.3516, pruned_loss=0.1056, over 4273251.20 frames. ], batch size: 247, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 06:33:17,182 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=15.0 2023-06-20 06:33:37,219 INFO [train.py:996] (2/4) Epoch 4, batch 16900, loss[loss=0.2266, simple_loss=0.2944, pruned_loss=0.07935, over 21568.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.346, pruned_loss=0.104, over 4280872.98 frames. ], batch size: 230, lr: 7.91e-03, grad_scale: 32.0 2023-06-20 06:33:38,496 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.91 vs. limit=10.0 2023-06-20 06:34:55,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=650484.0, ans=0.2 2023-06-20 06:35:10,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=650544.0, ans=0.1 2023-06-20 06:35:16,824 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.89 vs. limit=22.5 2023-06-20 06:35:17,143 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.124e+02 2.495e+02 2.882e+02 3.347e+02 4.730e+02, threshold=5.764e+02, percent-clipped=0.0 2023-06-20 06:35:18,596 INFO [train.py:996] (2/4) Epoch 4, batch 16950, loss[loss=0.231, simple_loss=0.2998, pruned_loss=0.08105, over 21907.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3382, pruned_loss=0.1021, over 4278844.45 frames. ], batch size: 316, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 06:37:00,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=650904.0, ans=0.125 2023-06-20 06:37:00,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=650904.0, ans=0.2 2023-06-20 06:37:01,053 INFO [train.py:996] (2/4) Epoch 4, batch 17000, loss[loss=0.2695, simple_loss=0.3277, pruned_loss=0.1057, over 21737.00 frames. ], tot_loss[loss=0.2695, simple_loss=0.3348, pruned_loss=0.1021, over 4287291.89 frames. ], batch size: 230, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 06:37:46,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=650964.0, ans=15.0 2023-06-20 06:38:28,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=651144.0, ans=0.1 2023-06-20 06:38:36,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=651144.0, ans=0.09899494936611666 2023-06-20 06:38:57,176 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 2.930e+02 3.368e+02 4.146e+02 7.848e+02, threshold=6.737e+02, percent-clipped=5.0 2023-06-20 06:38:57,206 INFO [train.py:996] (2/4) Epoch 4, batch 17050, loss[loss=0.3033, simple_loss=0.3876, pruned_loss=0.1095, over 21802.00 frames. ], tot_loss[loss=0.2769, simple_loss=0.3429, pruned_loss=0.1055, over 4288762.88 frames. ], batch size: 332, lr: 7.90e-03, grad_scale: 16.0 2023-06-20 06:40:03,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=651384.0, ans=0.1 2023-06-20 06:40:09,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=651444.0, ans=0.0 2023-06-20 06:40:12,033 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.12 vs. limit=15.0 2023-06-20 06:40:20,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=651444.0, ans=0.125 2023-06-20 06:40:37,635 INFO [train.py:996] (2/4) Epoch 4, batch 17100, loss[loss=0.2963, simple_loss=0.3581, pruned_loss=0.1173, over 21776.00 frames. ], tot_loss[loss=0.2781, simple_loss=0.3429, pruned_loss=0.1066, over 4293078.19 frames. ], batch size: 112, lr: 7.90e-03, grad_scale: 16.0 2023-06-20 06:40:38,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=651504.0, ans=0.04949747468305833 2023-06-20 06:41:21,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=651624.0, ans=0.125 2023-06-20 06:42:19,607 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 2.907e+02 3.321e+02 3.692e+02 6.035e+02, threshold=6.643e+02, percent-clipped=0.0 2023-06-20 06:42:19,637 INFO [train.py:996] (2/4) Epoch 4, batch 17150, loss[loss=0.2286, simple_loss=0.3022, pruned_loss=0.07753, over 21667.00 frames. ], tot_loss[loss=0.2731, simple_loss=0.3369, pruned_loss=0.1047, over 4295962.80 frames. ], batch size: 389, lr: 7.90e-03, grad_scale: 16.0 2023-06-20 06:43:24,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=651984.0, ans=0.125 2023-06-20 06:44:07,980 INFO [train.py:996] (2/4) Epoch 4, batch 17200, loss[loss=0.2719, simple_loss=0.3404, pruned_loss=0.1017, over 21768.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3361, pruned_loss=0.1034, over 4296440.57 frames. ], batch size: 441, lr: 7.90e-03, grad_scale: 32.0 2023-06-20 06:44:23,764 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 06:44:48,957 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.78 vs. limit=22.5 2023-06-20 06:45:41,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=652344.0, ans=0.125 2023-06-20 06:45:57,641 INFO [train.py:996] (2/4) Epoch 4, batch 17250, loss[loss=0.2932, simple_loss=0.3554, pruned_loss=0.1155, over 21466.00 frames. ], tot_loss[loss=0.276, simple_loss=0.3401, pruned_loss=0.1059, over 4295373.99 frames. ], batch size: 211, lr: 7.89e-03, grad_scale: 16.0 2023-06-20 06:45:59,494 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.975e+02 3.249e+02 4.201e+02 6.802e+02, threshold=6.498e+02, percent-clipped=2.0 2023-06-20 06:46:29,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=652464.0, ans=15.0 2023-06-20 06:47:07,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=652584.0, ans=0.0 2023-06-20 06:47:12,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=652584.0, ans=0.125 2023-06-20 06:47:36,658 INFO [train.py:996] (2/4) Epoch 4, batch 17300, loss[loss=0.3311, simple_loss=0.4143, pruned_loss=0.124, over 20801.00 frames. ], tot_loss[loss=0.2853, simple_loss=0.35, pruned_loss=0.1103, over 4296814.33 frames. ], batch size: 607, lr: 7.89e-03, grad_scale: 16.0 2023-06-20 06:48:08,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=652824.0, ans=0.2 2023-06-20 06:49:17,064 INFO [train.py:996] (2/4) Epoch 4, batch 17350, loss[loss=0.2951, simple_loss=0.3802, pruned_loss=0.105, over 21486.00 frames. ], tot_loss[loss=0.2841, simple_loss=0.3499, pruned_loss=0.1091, over 4288797.73 frames. ], batch size: 471, lr: 7.89e-03, grad_scale: 16.0 2023-06-20 06:49:18,636 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 3.182e+02 3.780e+02 4.285e+02 5.975e+02, threshold=7.560e+02, percent-clipped=0.0 2023-06-20 06:49:19,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=653004.0, ans=0.04949747468305833 2023-06-20 06:49:27,939 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=15.0 2023-06-20 06:49:32,715 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.20 vs. limit=15.0 2023-06-20 06:49:37,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=653064.0, ans=0.2 2023-06-20 06:50:55,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=653244.0, ans=0.125 2023-06-20 06:50:59,202 INFO [train.py:996] (2/4) Epoch 4, batch 17400, loss[loss=0.2813, simple_loss=0.3831, pruned_loss=0.08972, over 20699.00 frames. ], tot_loss[loss=0.2781, simple_loss=0.3473, pruned_loss=0.1044, over 4287828.00 frames. ], batch size: 608, lr: 7.89e-03, grad_scale: 16.0 2023-06-20 06:51:03,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=653304.0, ans=0.0 2023-06-20 06:52:42,173 INFO [train.py:996] (2/4) Epoch 4, batch 17450, loss[loss=0.24, simple_loss=0.3305, pruned_loss=0.07471, over 21588.00 frames. ], tot_loss[loss=0.2706, simple_loss=0.3402, pruned_loss=0.1005, over 4278364.19 frames. ], batch size: 441, lr: 7.89e-03, grad_scale: 16.0 2023-06-20 06:52:43,980 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.974e+02 3.600e+02 4.231e+02 7.262e+02, threshold=7.200e+02, percent-clipped=0.0 2023-06-20 06:52:58,947 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-20 06:53:34,865 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.22 vs. limit=12.0 2023-06-20 06:53:37,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=653724.0, ans=0.125 2023-06-20 06:54:06,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=653844.0, ans=0.07 2023-06-20 06:54:20,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=653844.0, ans=0.125 2023-06-20 06:54:28,288 INFO [train.py:996] (2/4) Epoch 4, batch 17500, loss[loss=0.2634, simple_loss=0.3146, pruned_loss=0.1061, over 21570.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.3346, pruned_loss=0.0976, over 4282306.74 frames. ], batch size: 212, lr: 7.89e-03, grad_scale: 16.0 2023-06-20 06:54:57,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=653964.0, ans=15.0 2023-06-20 06:55:04,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=653964.0, ans=0.125 2023-06-20 06:55:05,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=653964.0, ans=0.04949747468305833 2023-06-20 06:55:06,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=653964.0, ans=0.1 2023-06-20 06:55:21,320 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.87 vs. limit=12.0 2023-06-20 06:55:51,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=654144.0, ans=0.07 2023-06-20 06:56:02,411 INFO [train.py:996] (2/4) Epoch 4, batch 17550, loss[loss=0.2354, simple_loss=0.3196, pruned_loss=0.07556, over 21440.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3335, pruned_loss=0.09588, over 4291434.33 frames. ], batch size: 211, lr: 7.88e-03, grad_scale: 16.0 2023-06-20 06:56:04,009 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.582e+02 2.427e+02 2.904e+02 3.611e+02 6.733e+02, threshold=5.808e+02, percent-clipped=0.0 2023-06-20 06:56:26,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=654204.0, ans=0.125 2023-06-20 06:56:29,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=654264.0, ans=0.2 2023-06-20 06:56:52,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=654324.0, ans=0.0 2023-06-20 06:56:56,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=654324.0, ans=15.0 2023-06-20 06:57:43,307 INFO [train.py:996] (2/4) Epoch 4, batch 17600, loss[loss=0.2895, simple_loss=0.353, pruned_loss=0.113, over 21403.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3366, pruned_loss=0.09648, over 4270463.23 frames. ], batch size: 176, lr: 7.88e-03, grad_scale: 32.0 2023-06-20 06:58:03,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=654504.0, ans=0.125 2023-06-20 06:58:27,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=654564.0, ans=0.025 2023-06-20 06:58:40,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=654624.0, ans=0.0 2023-06-20 06:58:43,287 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=22.5 2023-06-20 06:59:14,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=654744.0, ans=0.2 2023-06-20 06:59:32,095 INFO [train.py:996] (2/4) Epoch 4, batch 17650, loss[loss=0.2228, simple_loss=0.2956, pruned_loss=0.07499, over 21860.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3344, pruned_loss=0.09658, over 4262570.79 frames. ], batch size: 317, lr: 7.88e-03, grad_scale: 16.0 2023-06-20 06:59:40,729 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 3.037e+02 3.780e+02 4.490e+02 8.251e+02, threshold=7.559e+02, percent-clipped=12.0 2023-06-20 07:01:20,972 INFO [train.py:996] (2/4) Epoch 4, batch 17700, loss[loss=0.2895, simple_loss=0.3678, pruned_loss=0.1056, over 21677.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3291, pruned_loss=0.09389, over 4267563.53 frames. ], batch size: 332, lr: 7.88e-03, grad_scale: 16.0 2023-06-20 07:01:29,073 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-06-20 07:01:40,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=655104.0, ans=0.125 2023-06-20 07:01:48,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=655164.0, ans=0.125 2023-06-20 07:01:48,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=655164.0, ans=0.0 2023-06-20 07:02:15,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=655284.0, ans=0.2 2023-06-20 07:02:43,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=655344.0, ans=0.0 2023-06-20 07:03:06,211 INFO [train.py:996] (2/4) Epoch 4, batch 17750, loss[loss=0.2738, simple_loss=0.3483, pruned_loss=0.09965, over 21759.00 frames. ], tot_loss[loss=0.2646, simple_loss=0.3354, pruned_loss=0.09689, over 4263181.37 frames. ], batch size: 332, lr: 7.88e-03, grad_scale: 16.0 2023-06-20 07:03:09,450 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.874e+02 3.454e+02 4.122e+02 5.655e+02, threshold=6.909e+02, percent-clipped=0.0 2023-06-20 07:04:50,180 INFO [train.py:996] (2/4) Epoch 4, batch 17800, loss[loss=0.2144, simple_loss=0.2895, pruned_loss=0.06961, over 21350.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3353, pruned_loss=0.09602, over 4261980.53 frames. ], batch size: 211, lr: 7.87e-03, grad_scale: 16.0 2023-06-20 07:06:09,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=655884.0, ans=0.0 2023-06-20 07:06:11,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=655884.0, ans=0.125 2023-06-20 07:06:12,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=655884.0, ans=0.125 2023-06-20 07:06:24,627 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.70 vs. limit=12.0 2023-06-20 07:06:33,130 INFO [train.py:996] (2/4) Epoch 4, batch 17850, loss[loss=0.2524, simple_loss=0.3235, pruned_loss=0.09069, over 21625.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3354, pruned_loss=0.09579, over 4269796.65 frames. ], batch size: 230, lr: 7.87e-03, grad_scale: 16.0 2023-06-20 07:06:36,780 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.792e+02 3.277e+02 4.116e+02 6.981e+02, threshold=6.554e+02, percent-clipped=1.0 2023-06-20 07:07:24,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=656124.0, ans=0.0 2023-06-20 07:07:53,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=656184.0, ans=0.125 2023-06-20 07:07:57,419 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.51 vs. limit=10.0 2023-06-20 07:08:12,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=656244.0, ans=0.125 2023-06-20 07:08:17,936 INFO [train.py:996] (2/4) Epoch 4, batch 17900, loss[loss=0.3348, simple_loss=0.3972, pruned_loss=0.1362, over 21595.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3421, pruned_loss=0.09907, over 4269252.14 frames. ], batch size: 389, lr: 7.87e-03, grad_scale: 16.0 2023-06-20 07:09:20,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=656424.0, ans=0.125 2023-06-20 07:09:30,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=656484.0, ans=0.0 2023-06-20 07:09:36,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.75 vs. limit=15.0 2023-06-20 07:09:41,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=656484.0, ans=0.0 2023-06-20 07:09:57,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=656544.0, ans=0.1 2023-06-20 07:10:01,334 INFO [train.py:996] (2/4) Epoch 4, batch 17950, loss[loss=0.2254, simple_loss=0.302, pruned_loss=0.07442, over 21497.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3413, pruned_loss=0.09561, over 4267473.31 frames. ], batch size: 195, lr: 7.87e-03, grad_scale: 16.0 2023-06-20 07:10:04,347 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.801e+02 2.785e+02 3.219e+02 3.835e+02 8.514e+02, threshold=6.438e+02, percent-clipped=3.0 2023-06-20 07:10:51,740 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-20 07:10:57,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=656724.0, ans=0.2 2023-06-20 07:11:44,104 INFO [train.py:996] (2/4) Epoch 4, batch 18000, loss[loss=0.2493, simple_loss=0.2957, pruned_loss=0.1014, over 21320.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3335, pruned_loss=0.09363, over 4263590.97 frames. ], batch size: 160, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 07:11:44,105 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 07:12:05,686 INFO [train.py:1028] (2/4) Epoch 4, validation: loss=0.2767, simple_loss=0.3741, pruned_loss=0.08966, over 1796401.00 frames. 2023-06-20 07:12:05,687 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-20 07:12:17,714 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.44 vs. limit=15.0 2023-06-20 07:12:40,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=656964.0, ans=0.125 2023-06-20 07:13:00,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=657024.0, ans=0.035 2023-06-20 07:13:28,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=657144.0, ans=0.125 2023-06-20 07:13:55,483 INFO [train.py:996] (2/4) Epoch 4, batch 18050, loss[loss=0.2959, simple_loss=0.3583, pruned_loss=0.1168, over 21835.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3292, pruned_loss=0.09357, over 4255466.02 frames. ], batch size: 124, lr: 7.87e-03, grad_scale: 32.0 2023-06-20 07:14:03,901 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.050e+02 3.056e+02 3.833e+02 4.791e+02 8.139e+02, threshold=7.666e+02, percent-clipped=8.0 2023-06-20 07:14:27,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=657264.0, ans=0.125 2023-06-20 07:14:36,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=657264.0, ans=0.125 2023-06-20 07:14:49,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=657324.0, ans=0.125 2023-06-20 07:14:54,650 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-20 07:15:41,244 INFO [train.py:996] (2/4) Epoch 4, batch 18100, loss[loss=0.2579, simple_loss=0.3535, pruned_loss=0.08109, over 21679.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3346, pruned_loss=0.09701, over 4262064.94 frames. ], batch size: 351, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 07:15:50,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=657504.0, ans=0.125 2023-06-20 07:16:28,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=657624.0, ans=0.125 2023-06-20 07:16:36,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=657684.0, ans=0.0 2023-06-20 07:17:10,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=657744.0, ans=0.125 2023-06-20 07:17:26,437 INFO [train.py:996] (2/4) Epoch 4, batch 18150, loss[loss=0.279, simple_loss=0.3385, pruned_loss=0.1097, over 21498.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3361, pruned_loss=0.09672, over 4268187.98 frames. ], batch size: 441, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 07:17:31,383 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.178e+02 2.784e+02 3.397e+02 4.397e+02 8.554e+02, threshold=6.794e+02, percent-clipped=1.0 2023-06-20 07:17:57,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=657924.0, ans=0.5 2023-06-20 07:18:05,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=657924.0, ans=0.125 2023-06-20 07:18:10,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=657924.0, ans=0.05 2023-06-20 07:18:47,927 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.70 vs. limit=15.0 2023-06-20 07:19:00,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=658044.0, ans=0.125 2023-06-20 07:19:02,658 INFO [train.py:996] (2/4) Epoch 4, batch 18200, loss[loss=0.2746, simple_loss=0.3264, pruned_loss=0.1114, over 21412.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3297, pruned_loss=0.09632, over 4271154.83 frames. ], batch size: 473, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 07:19:59,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=658284.0, ans=0.125 2023-06-20 07:20:00,482 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-20 07:20:03,621 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.49 vs. limit=15.0 2023-06-20 07:20:09,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=658344.0, ans=0.0 2023-06-20 07:20:15,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=658344.0, ans=0.125 2023-06-20 07:20:21,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=658344.0, ans=0.125 2023-06-20 07:20:31,256 INFO [train.py:996] (2/4) Epoch 4, batch 18250, loss[loss=0.1962, simple_loss=0.266, pruned_loss=0.06318, over 21759.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3196, pruned_loss=0.0917, over 4276097.79 frames. ], batch size: 124, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 07:20:41,505 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.911e+02 2.684e+02 3.179e+02 3.917e+02 6.277e+02, threshold=6.359e+02, percent-clipped=0.0 2023-06-20 07:20:44,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=658404.0, ans=15.0 2023-06-20 07:21:11,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=658464.0, ans=0.05 2023-06-20 07:21:34,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=658584.0, ans=0.1 2023-06-20 07:21:34,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=658584.0, ans=0.2 2023-06-20 07:21:35,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=658584.0, ans=0.125 2023-06-20 07:21:44,904 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=15.0 2023-06-20 07:22:07,539 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.07 vs. limit=10.0 2023-06-20 07:22:08,080 INFO [train.py:996] (2/4) Epoch 4, batch 18300, loss[loss=0.2899, simple_loss=0.3908, pruned_loss=0.09452, over 21876.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.321, pruned_loss=0.09259, over 4282004.63 frames. ], batch size: 316, lr: 7.86e-03, grad_scale: 16.0 2023-06-20 07:22:26,498 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-20 07:22:52,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=658764.0, ans=0.125 2023-06-20 07:22:54,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=658764.0, ans=0.07 2023-06-20 07:23:02,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=658824.0, ans=0.0 2023-06-20 07:23:27,919 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 07:23:49,859 INFO [train.py:996] (2/4) Epoch 4, batch 18350, loss[loss=0.2778, simple_loss=0.3834, pruned_loss=0.08611, over 20958.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3256, pruned_loss=0.09283, over 4269332.33 frames. ], batch size: 607, lr: 7.85e-03, grad_scale: 16.0 2023-06-20 07:24:00,023 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.835e+02 3.516e+02 4.714e+02 7.993e+02, threshold=7.032e+02, percent-clipped=4.0 2023-06-20 07:24:17,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=659004.0, ans=0.1 2023-06-20 07:24:35,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=659064.0, ans=0.125 2023-06-20 07:24:50,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=659124.0, ans=0.1 2023-06-20 07:25:37,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=659304.0, ans=0.025 2023-06-20 07:25:38,246 INFO [train.py:996] (2/4) Epoch 4, batch 18400, loss[loss=0.2467, simple_loss=0.3106, pruned_loss=0.09141, over 21470.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3215, pruned_loss=0.09165, over 4274260.41 frames. ], batch size: 441, lr: 7.85e-03, grad_scale: 32.0 2023-06-20 07:25:48,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=659304.0, ans=0.125 2023-06-20 07:26:15,999 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.08 vs. limit=12.0 2023-06-20 07:26:27,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=659424.0, ans=15.0 2023-06-20 07:26:56,566 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-20 07:27:27,411 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.56 vs. limit=10.0 2023-06-20 07:27:27,675 INFO [train.py:996] (2/4) Epoch 4, batch 18450, loss[loss=0.2543, simple_loss=0.3226, pruned_loss=0.09297, over 21597.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3168, pruned_loss=0.08681, over 4274241.21 frames. ], batch size: 391, lr: 7.85e-03, grad_scale: 32.0 2023-06-20 07:27:37,443 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.607e+02 3.111e+02 4.080e+02 7.142e+02, threshold=6.222e+02, percent-clipped=1.0 2023-06-20 07:27:41,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=659604.0, ans=0.0 2023-06-20 07:28:10,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=659724.0, ans=0.0 2023-06-20 07:28:25,038 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-20 07:28:48,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=659844.0, ans=0.125 2023-06-20 07:29:02,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=659904.0, ans=0.0 2023-06-20 07:29:09,406 INFO [train.py:996] (2/4) Epoch 4, batch 18500, loss[loss=0.2197, simple_loss=0.3014, pruned_loss=0.06905, over 21180.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3113, pruned_loss=0.08575, over 4267122.36 frames. ], batch size: 548, lr: 7.85e-03, grad_scale: 32.0 2023-06-20 07:29:17,161 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.31 vs. limit=15.0 2023-06-20 07:29:58,241 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=15.0 2023-06-20 07:30:11,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=660084.0, ans=0.125 2023-06-20 07:30:28,984 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.89 vs. limit=15.0 2023-06-20 07:30:39,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=660144.0, ans=0.125 2023-06-20 07:30:51,282 INFO [train.py:996] (2/4) Epoch 4, batch 18550, loss[loss=0.2661, simple_loss=0.3115, pruned_loss=0.1103, over 21758.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3117, pruned_loss=0.08552, over 4264637.05 frames. ], batch size: 102, lr: 7.85e-03, grad_scale: 32.0 2023-06-20 07:31:01,303 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.666e+02 3.227e+02 3.857e+02 6.093e+02, threshold=6.453e+02, percent-clipped=0.0 2023-06-20 07:31:13,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=660204.0, ans=0.0 2023-06-20 07:31:20,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=660264.0, ans=15.0 2023-06-20 07:31:28,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=660264.0, ans=0.125 2023-06-20 07:31:31,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=660264.0, ans=0.0 2023-06-20 07:31:33,703 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.03 vs. limit=15.0 2023-06-20 07:31:54,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=660384.0, ans=0.125 2023-06-20 07:31:58,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=660384.0, ans=0.125 2023-06-20 07:32:03,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=660384.0, ans=0.125 2023-06-20 07:32:16,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=660444.0, ans=0.1 2023-06-20 07:32:37,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=660444.0, ans=0.0 2023-06-20 07:32:39,757 INFO [train.py:996] (2/4) Epoch 4, batch 18600, loss[loss=0.2262, simple_loss=0.2887, pruned_loss=0.08183, over 21217.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3113, pruned_loss=0.08744, over 4263686.02 frames. ], batch size: 143, lr: 7.85e-03, grad_scale: 32.0 2023-06-20 07:33:07,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=660564.0, ans=0.125 2023-06-20 07:33:22,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=660624.0, ans=0.125 2023-06-20 07:33:29,393 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=15.0 2023-06-20 07:33:44,469 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=22.5 2023-06-20 07:34:03,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=660744.0, ans=0.125 2023-06-20 07:34:15,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=660804.0, ans=0.035 2023-06-20 07:34:16,721 INFO [train.py:996] (2/4) Epoch 4, batch 18650, loss[loss=0.1997, simple_loss=0.2707, pruned_loss=0.06434, over 21588.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3085, pruned_loss=0.0868, over 4262056.32 frames. ], batch size: 263, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 07:34:26,814 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.903e+02 3.307e+02 4.145e+02 6.218e+02, threshold=6.614e+02, percent-clipped=0.0 2023-06-20 07:35:04,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=660924.0, ans=0.0 2023-06-20 07:35:45,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=661044.0, ans=0.125 2023-06-20 07:35:47,927 INFO [train.py:996] (2/4) Epoch 4, batch 18700, loss[loss=0.2905, simple_loss=0.3455, pruned_loss=0.1178, over 21865.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.309, pruned_loss=0.08944, over 4271840.87 frames. ], batch size: 118, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 07:36:35,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=661224.0, ans=0.125 2023-06-20 07:36:56,903 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-06-20 07:37:02,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=661284.0, ans=0.1 2023-06-20 07:37:30,163 INFO [train.py:996] (2/4) Epoch 4, batch 18750, loss[loss=0.2584, simple_loss=0.3288, pruned_loss=0.09396, over 21394.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3136, pruned_loss=0.09381, over 4264437.58 frames. ], batch size: 211, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 07:37:45,241 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 2.659e+02 3.125e+02 3.916e+02 7.035e+02, threshold=6.249e+02, percent-clipped=1.0 2023-06-20 07:37:51,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=661404.0, ans=0.0 2023-06-20 07:37:53,084 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=12.0 2023-06-20 07:38:49,007 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.28 vs. limit=6.0 2023-06-20 07:39:02,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=661644.0, ans=0.2 2023-06-20 07:39:06,611 INFO [train.py:996] (2/4) Epoch 4, batch 18800, loss[loss=0.2462, simple_loss=0.3445, pruned_loss=0.07399, over 21339.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3167, pruned_loss=0.09409, over 4251175.09 frames. ], batch size: 548, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 07:39:09,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=661704.0, ans=0.1 2023-06-20 07:39:17,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=661704.0, ans=0.0 2023-06-20 07:39:34,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=661704.0, ans=0.125 2023-06-20 07:39:49,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=661764.0, ans=0.125 2023-06-20 07:40:55,491 INFO [train.py:996] (2/4) Epoch 4, batch 18850, loss[loss=0.1819, simple_loss=0.2659, pruned_loss=0.04894, over 21356.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3125, pruned_loss=0.08841, over 4255985.11 frames. ], batch size: 211, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 07:41:00,440 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.806e+02 2.538e+02 3.009e+02 3.652e+02 6.341e+02, threshold=6.019e+02, percent-clipped=1.0 2023-06-20 07:42:32,265 INFO [train.py:996] (2/4) Epoch 4, batch 18900, loss[loss=0.2249, simple_loss=0.2829, pruned_loss=0.08342, over 14751.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3084, pruned_loss=0.08828, over 4245019.16 frames. ], batch size: 61, lr: 7.84e-03, grad_scale: 32.0 2023-06-20 07:42:33,488 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.64 vs. limit=6.0 2023-06-20 07:43:23,617 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 07:44:02,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=662544.0, ans=0.0 2023-06-20 07:44:09,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=662604.0, ans=0.0 2023-06-20 07:44:10,287 INFO [train.py:996] (2/4) Epoch 4, batch 18950, loss[loss=0.2622, simple_loss=0.3546, pruned_loss=0.08495, over 21646.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3102, pruned_loss=0.09137, over 4248883.28 frames. ], batch size: 263, lr: 7.83e-03, grad_scale: 32.0 2023-06-20 07:44:25,541 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 2.815e+02 3.147e+02 3.726e+02 6.285e+02, threshold=6.294e+02, percent-clipped=0.0 2023-06-20 07:45:14,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=662784.0, ans=0.035 2023-06-20 07:45:14,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=662784.0, ans=0.1 2023-06-20 07:46:03,993 INFO [train.py:996] (2/4) Epoch 4, batch 19000, loss[loss=0.2947, simple_loss=0.3564, pruned_loss=0.1165, over 21728.00 frames. ], tot_loss[loss=0.255, simple_loss=0.321, pruned_loss=0.09454, over 4254237.09 frames. ], batch size: 298, lr: 7.83e-03, grad_scale: 16.0 2023-06-20 07:46:11,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=662904.0, ans=0.0 2023-06-20 07:46:28,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=662964.0, ans=0.07 2023-06-20 07:47:09,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=663144.0, ans=0.125 2023-06-20 07:47:47,404 INFO [train.py:996] (2/4) Epoch 4, batch 19050, loss[loss=0.3463, simple_loss=0.3773, pruned_loss=0.1576, over 21633.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3268, pruned_loss=0.0991, over 4267567.55 frames. ], batch size: 507, lr: 7.83e-03, grad_scale: 16.0 2023-06-20 07:47:53,693 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.483e+02 3.224e+02 3.773e+02 4.391e+02 1.056e+03, threshold=7.547e+02, percent-clipped=6.0 2023-06-20 07:48:08,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=663264.0, ans=0.04949747468305833 2023-06-20 07:48:33,102 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.75 vs. limit=6.0 2023-06-20 07:48:47,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=663384.0, ans=0.0 2023-06-20 07:48:57,145 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-20 07:49:28,675 INFO [train.py:996] (2/4) Epoch 4, batch 19100, loss[loss=0.237, simple_loss=0.2957, pruned_loss=0.08915, over 21586.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.3254, pruned_loss=0.09923, over 4278564.19 frames. ], batch size: 263, lr: 7.83e-03, grad_scale: 16.0 2023-06-20 07:51:14,834 INFO [train.py:996] (2/4) Epoch 4, batch 19150, loss[loss=0.4359, simple_loss=0.4914, pruned_loss=0.1902, over 21417.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3299, pruned_loss=0.101, over 4277076.52 frames. ], batch size: 507, lr: 7.83e-03, grad_scale: 16.0 2023-06-20 07:51:21,647 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 3.031e+02 3.382e+02 4.089e+02 6.377e+02, threshold=6.763e+02, percent-clipped=0.0 2023-06-20 07:51:42,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=663864.0, ans=0.125 2023-06-20 07:52:21,611 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.82 vs. limit=6.0 2023-06-20 07:52:26,670 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=22.5 2023-06-20 07:52:50,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=664044.0, ans=0.0 2023-06-20 07:52:57,827 INFO [train.py:996] (2/4) Epoch 4, batch 19200, loss[loss=0.2775, simple_loss=0.3817, pruned_loss=0.08668, over 21653.00 frames. ], tot_loss[loss=0.2688, simple_loss=0.3374, pruned_loss=0.1001, over 4278372.21 frames. ], batch size: 389, lr: 7.82e-03, grad_scale: 32.0 2023-06-20 07:53:00,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=664104.0, ans=0.125 2023-06-20 07:53:16,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=664164.0, ans=6.0 2023-06-20 07:53:16,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=664164.0, ans=0.1 2023-06-20 07:53:34,112 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-20 07:54:09,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=664284.0, ans=0.1 2023-06-20 07:54:37,029 INFO [train.py:996] (2/4) Epoch 4, batch 19250, loss[loss=0.3041, simple_loss=0.4199, pruned_loss=0.09413, over 20758.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.338, pruned_loss=0.09451, over 4281541.25 frames. ], batch size: 607, lr: 7.82e-03, grad_scale: 16.0 2023-06-20 07:54:39,021 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 07:54:42,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=664404.0, ans=0.125 2023-06-20 07:54:43,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=664404.0, ans=0.125 2023-06-20 07:54:44,732 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.666e+02 2.530e+02 3.089e+02 4.105e+02 6.871e+02, threshold=6.178e+02, percent-clipped=1.0 2023-06-20 07:54:51,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=664464.0, ans=0.125 2023-06-20 07:55:04,266 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-20 07:56:13,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=664644.0, ans=22.5 2023-06-20 07:56:19,722 INFO [train.py:996] (2/4) Epoch 4, batch 19300, loss[loss=0.2019, simple_loss=0.2867, pruned_loss=0.05861, over 21797.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3349, pruned_loss=0.09403, over 4292472.96 frames. ], batch size: 102, lr: 7.82e-03, grad_scale: 16.0 2023-06-20 07:57:23,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=664824.0, ans=10.0 2023-06-20 07:58:04,214 INFO [train.py:996] (2/4) Epoch 4, batch 19350, loss[loss=0.3223, simple_loss=0.3849, pruned_loss=0.1299, over 21571.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3292, pruned_loss=0.09095, over 4288774.56 frames. ], batch size: 509, lr: 7.82e-03, grad_scale: 16.0 2023-06-20 07:58:12,488 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.563e+02 3.117e+02 3.921e+02 9.439e+02, threshold=6.235e+02, percent-clipped=6.0 2023-06-20 07:58:25,235 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.83 vs. limit=22.5 2023-06-20 07:58:41,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=665124.0, ans=0.0 2023-06-20 07:59:39,473 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.67 vs. limit=15.0 2023-06-20 07:59:43,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=665244.0, ans=0.125 2023-06-20 07:59:46,550 INFO [train.py:996] (2/4) Epoch 4, batch 19400, loss[loss=0.2602, simple_loss=0.3202, pruned_loss=0.1002, over 21843.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.325, pruned_loss=0.08923, over 4286927.55 frames. ], batch size: 247, lr: 7.82e-03, grad_scale: 16.0 2023-06-20 08:00:09,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=665364.0, ans=0.2 2023-06-20 08:00:42,133 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.96 vs. limit=10.0 2023-06-20 08:01:23,632 INFO [train.py:996] (2/4) Epoch 4, batch 19450, loss[loss=0.2611, simple_loss=0.3074, pruned_loss=0.1074, over 21397.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3233, pruned_loss=0.09172, over 4287804.88 frames. ], batch size: 473, lr: 7.82e-03, grad_scale: 16.0 2023-06-20 08:01:25,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=665604.0, ans=0.125 2023-06-20 08:01:31,163 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.618e+02 3.250e+02 3.891e+02 5.569e+02, threshold=6.499e+02, percent-clipped=0.0 2023-06-20 08:02:42,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=665784.0, ans=0.125 2023-06-20 08:03:06,373 INFO [train.py:996] (2/4) Epoch 4, batch 19500, loss[loss=0.2319, simple_loss=0.2995, pruned_loss=0.08216, over 21667.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3185, pruned_loss=0.09332, over 4280353.02 frames. ], batch size: 333, lr: 7.81e-03, grad_scale: 16.0 2023-06-20 08:03:10,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=665904.0, ans=0.035 2023-06-20 08:03:56,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=666024.0, ans=0.2 2023-06-20 08:04:21,248 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:04:32,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=666144.0, ans=0.125 2023-06-20 08:04:45,129 INFO [train.py:996] (2/4) Epoch 4, batch 19550, loss[loss=0.1945, simple_loss=0.2497, pruned_loss=0.06968, over 21855.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3133, pruned_loss=0.09055, over 4272244.75 frames. ], batch size: 107, lr: 7.81e-03, grad_scale: 16.0 2023-06-20 08:04:53,569 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.307e+02 2.884e+02 3.306e+02 4.120e+02 1.024e+03, threshold=6.612e+02, percent-clipped=7.0 2023-06-20 08:04:59,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=666204.0, ans=0.125 2023-06-20 08:05:26,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=666264.0, ans=0.0 2023-06-20 08:05:58,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=666384.0, ans=0.125 2023-06-20 08:06:00,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=666384.0, ans=0.0 2023-06-20 08:06:21,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=666444.0, ans=0.0 2023-06-20 08:06:29,096 INFO [train.py:996] (2/4) Epoch 4, batch 19600, loss[loss=0.2822, simple_loss=0.3342, pruned_loss=0.1151, over 21937.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3171, pruned_loss=0.09215, over 4276732.98 frames. ], batch size: 316, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 08:07:30,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=666624.0, ans=0.0 2023-06-20 08:08:07,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=666804.0, ans=0.125 2023-06-20 08:08:08,687 INFO [train.py:996] (2/4) Epoch 4, batch 19650, loss[loss=0.2932, simple_loss=0.3624, pruned_loss=0.1119, over 21783.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3247, pruned_loss=0.09831, over 4285419.75 frames. ], batch size: 124, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 08:08:17,369 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 3.087e+02 3.531e+02 4.155e+02 7.951e+02, threshold=7.063e+02, percent-clipped=1.0 2023-06-20 08:08:43,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=666864.0, ans=0.125 2023-06-20 08:09:07,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=666924.0, ans=0.125 2023-06-20 08:09:07,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=666924.0, ans=0.125 2023-06-20 08:10:00,478 INFO [train.py:996] (2/4) Epoch 4, batch 19700, loss[loss=0.2257, simple_loss=0.2817, pruned_loss=0.0849, over 21097.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3287, pruned_loss=0.09855, over 4287858.20 frames. ], batch size: 143, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 08:10:14,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=667104.0, ans=0.0 2023-06-20 08:10:16,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=667104.0, ans=0.0 2023-06-20 08:10:17,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=667104.0, ans=0.125 2023-06-20 08:11:16,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=667284.0, ans=0.0 2023-06-20 08:11:17,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=667344.0, ans=0.05 2023-06-20 08:11:22,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=667344.0, ans=0.0 2023-06-20 08:11:39,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=667344.0, ans=0.0 2023-06-20 08:11:45,293 INFO [train.py:996] (2/4) Epoch 4, batch 19750, loss[loss=0.4454, simple_loss=0.4996, pruned_loss=0.1956, over 21457.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3382, pruned_loss=0.1002, over 4285233.94 frames. ], batch size: 507, lr: 7.81e-03, grad_scale: 32.0 2023-06-20 08:12:04,019 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 3.193e+02 3.734e+02 5.091e+02 8.572e+02, threshold=7.467e+02, percent-clipped=4.0 2023-06-20 08:13:07,651 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=22.5 2023-06-20 08:13:33,740 INFO [train.py:996] (2/4) Epoch 4, batch 19800, loss[loss=0.2533, simple_loss=0.3314, pruned_loss=0.08758, over 21674.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3372, pruned_loss=0.1008, over 4283369.49 frames. ], batch size: 389, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 08:13:53,633 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.92 vs. limit=22.5 2023-06-20 08:14:55,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=667944.0, ans=0.0 2023-06-20 08:15:08,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=667944.0, ans=10.0 2023-06-20 08:15:13,855 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.59 vs. limit=15.0 2023-06-20 08:15:16,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=668004.0, ans=0.125 2023-06-20 08:15:22,745 INFO [train.py:996] (2/4) Epoch 4, batch 19850, loss[loss=0.2216, simple_loss=0.3111, pruned_loss=0.06602, over 21626.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3279, pruned_loss=0.09413, over 4277122.47 frames. ], batch size: 389, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 08:15:30,735 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.800e+02 2.674e+02 3.222e+02 4.278e+02 7.795e+02, threshold=6.444e+02, percent-clipped=1.0 2023-06-20 08:15:41,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=668064.0, ans=0.125 2023-06-20 08:15:45,405 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=22.5 2023-06-20 08:16:01,918 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-06-20 08:16:06,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=668124.0, ans=0.07 2023-06-20 08:16:08,504 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=15.0 2023-06-20 08:16:11,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=668184.0, ans=0.125 2023-06-20 08:16:11,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=668184.0, ans=0.2 2023-06-20 08:16:59,875 INFO [train.py:996] (2/4) Epoch 4, batch 19900, loss[loss=0.2091, simple_loss=0.2687, pruned_loss=0.07479, over 15347.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3276, pruned_loss=0.09115, over 4262220.79 frames. ], batch size: 60, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 08:17:13,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=668304.0, ans=0.0 2023-06-20 08:17:18,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=668304.0, ans=0.125 2023-06-20 08:17:48,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=668424.0, ans=0.0 2023-06-20 08:18:47,738 INFO [train.py:996] (2/4) Epoch 4, batch 19950, loss[loss=0.2527, simple_loss=0.311, pruned_loss=0.09725, over 21777.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3206, pruned_loss=0.09107, over 4270165.61 frames. ], batch size: 351, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 08:18:50,895 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-20 08:18:55,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=668604.0, ans=0.125 2023-06-20 08:18:56,265 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.748e+02 3.104e+02 3.905e+02 6.692e+02, threshold=6.208e+02, percent-clipped=1.0 2023-06-20 08:20:13,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=668844.0, ans=0.125 2023-06-20 08:20:19,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=668844.0, ans=0.125 2023-06-20 08:20:26,109 INFO [train.py:996] (2/4) Epoch 4, batch 20000, loss[loss=0.2453, simple_loss=0.3287, pruned_loss=0.08095, over 19818.00 frames. ], tot_loss[loss=0.253, simple_loss=0.323, pruned_loss=0.09145, over 4270367.13 frames. ], batch size: 702, lr: 7.80e-03, grad_scale: 32.0 2023-06-20 08:20:46,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=668964.0, ans=0.2 2023-06-20 08:22:05,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=669144.0, ans=0.125 2023-06-20 08:22:13,392 INFO [train.py:996] (2/4) Epoch 4, batch 20050, loss[loss=0.2669, simple_loss=0.33, pruned_loss=0.1019, over 21859.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.326, pruned_loss=0.09466, over 4281796.96 frames. ], batch size: 282, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 08:22:21,029 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.338e+02 2.833e+02 3.313e+02 4.048e+02 6.603e+02, threshold=6.626e+02, percent-clipped=1.0 2023-06-20 08:22:30,002 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:22:39,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=669264.0, ans=0.125 2023-06-20 08:22:48,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=669324.0, ans=0.0 2023-06-20 08:22:54,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=669324.0, ans=0.125 2023-06-20 08:23:57,200 INFO [train.py:996] (2/4) Epoch 4, batch 20100, loss[loss=0.267, simple_loss=0.3463, pruned_loss=0.09385, over 21385.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3289, pruned_loss=0.09736, over 4292200.16 frames. ], batch size: 194, lr: 7.79e-03, grad_scale: 32.0 2023-06-20 08:24:32,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=669564.0, ans=0.0 2023-06-20 08:25:00,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=669624.0, ans=0.125 2023-06-20 08:25:03,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=669684.0, ans=0.125 2023-06-20 08:25:25,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=669744.0, ans=0.125 2023-06-20 08:25:27,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=669744.0, ans=0.0 2023-06-20 08:25:40,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=669744.0, ans=0.0 2023-06-20 08:25:42,360 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.16 vs. limit=6.0 2023-06-20 08:25:42,893 INFO [train.py:996] (2/4) Epoch 4, batch 20150, loss[loss=0.3097, simple_loss=0.3783, pruned_loss=0.1205, over 21589.00 frames. ], tot_loss[loss=0.2714, simple_loss=0.3398, pruned_loss=0.1015, over 4293449.96 frames. ], batch size: 389, lr: 7.79e-03, grad_scale: 16.0 2023-06-20 08:25:53,574 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.349e+02 3.395e+02 3.883e+02 5.021e+02 8.143e+02, threshold=7.766e+02, percent-clipped=4.0 2023-06-20 08:26:34,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=669924.0, ans=0.125 2023-06-20 08:26:37,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=669924.0, ans=0.1 2023-06-20 08:26:39,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=669924.0, ans=0.125 2023-06-20 08:27:19,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=670044.0, ans=0.125 2023-06-20 08:27:29,942 INFO [train.py:996] (2/4) Epoch 4, batch 20200, loss[loss=0.2564, simple_loss=0.3364, pruned_loss=0.08822, over 21781.00 frames. ], tot_loss[loss=0.2783, simple_loss=0.3465, pruned_loss=0.1051, over 4289018.46 frames. ], batch size: 282, lr: 7.79e-03, grad_scale: 16.0 2023-06-20 08:27:45,590 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:28:17,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=670224.0, ans=0.0 2023-06-20 08:28:59,159 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=22.5 2023-06-20 08:29:18,117 INFO [train.py:996] (2/4) Epoch 4, batch 20250, loss[loss=0.252, simple_loss=0.3168, pruned_loss=0.09356, over 21298.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3466, pruned_loss=0.1032, over 4279797.68 frames. ], batch size: 143, lr: 7.79e-03, grad_scale: 16.0 2023-06-20 08:29:33,200 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.329e+02 3.002e+02 3.510e+02 4.411e+02 6.052e+02, threshold=7.021e+02, percent-clipped=0.0 2023-06-20 08:30:38,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=670644.0, ans=0.0 2023-06-20 08:30:56,374 INFO [train.py:996] (2/4) Epoch 4, batch 20300, loss[loss=0.3282, simple_loss=0.398, pruned_loss=0.1291, over 21473.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3429, pruned_loss=0.09898, over 4279588.62 frames. ], batch size: 471, lr: 7.79e-03, grad_scale: 16.0 2023-06-20 08:31:09,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=670704.0, ans=0.0 2023-06-20 08:32:22,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=670944.0, ans=0.0 2023-06-20 08:32:34,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=671004.0, ans=0.1 2023-06-20 08:32:35,120 INFO [train.py:996] (2/4) Epoch 4, batch 20350, loss[loss=0.29, simple_loss=0.3493, pruned_loss=0.1153, over 21815.00 frames. ], tot_loss[loss=0.2701, simple_loss=0.3423, pruned_loss=0.09894, over 4265941.26 frames. ], batch size: 351, lr: 7.78e-03, grad_scale: 16.0 2023-06-20 08:32:45,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=671004.0, ans=0.0 2023-06-20 08:32:49,952 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.818e+02 3.132e+02 3.924e+02 7.054e+02, threshold=6.264e+02, percent-clipped=1.0 2023-06-20 08:34:03,174 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=15.0 2023-06-20 08:34:04,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=671244.0, ans=0.0 2023-06-20 08:34:21,811 INFO [train.py:996] (2/4) Epoch 4, batch 20400, loss[loss=0.3031, simple_loss=0.3512, pruned_loss=0.1275, over 21229.00 frames. ], tot_loss[loss=0.2721, simple_loss=0.3429, pruned_loss=0.1006, over 4254914.52 frames. ], batch size: 143, lr: 7.78e-03, grad_scale: 32.0 2023-06-20 08:35:22,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=671484.0, ans=0.0 2023-06-20 08:35:24,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=671484.0, ans=0.04949747468305833 2023-06-20 08:35:25,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=671484.0, ans=0.1 2023-06-20 08:35:27,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=671484.0, ans=0.04949747468305833 2023-06-20 08:35:55,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=671544.0, ans=0.09899494936611666 2023-06-20 08:35:55,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=671544.0, ans=0.125 2023-06-20 08:36:04,147 INFO [train.py:996] (2/4) Epoch 4, batch 20450, loss[loss=0.2919, simple_loss=0.3478, pruned_loss=0.118, over 21581.00 frames. ], tot_loss[loss=0.2763, simple_loss=0.3449, pruned_loss=0.1039, over 4245676.83 frames. ], batch size: 471, lr: 7.78e-03, grad_scale: 16.0 2023-06-20 08:36:06,942 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.74 vs. limit=15.0 2023-06-20 08:36:20,591 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 2.947e+02 3.456e+02 4.255e+02 7.158e+02, threshold=6.912e+02, percent-clipped=2.0 2023-06-20 08:36:40,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=671724.0, ans=0.1 2023-06-20 08:36:53,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=671724.0, ans=0.0 2023-06-20 08:37:44,310 INFO [train.py:996] (2/4) Epoch 4, batch 20500, loss[loss=0.2444, simple_loss=0.3013, pruned_loss=0.09373, over 21668.00 frames. ], tot_loss[loss=0.276, simple_loss=0.3418, pruned_loss=0.1051, over 4262205.84 frames. ], batch size: 231, lr: 7.78e-03, grad_scale: 16.0 2023-06-20 08:38:54,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=672084.0, ans=0.1 2023-06-20 08:38:59,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=672084.0, ans=10.0 2023-06-20 08:39:02,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=672144.0, ans=0.125 2023-06-20 08:39:28,499 INFO [train.py:996] (2/4) Epoch 4, batch 20550, loss[loss=0.3234, simple_loss=0.4222, pruned_loss=0.1123, over 19783.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3349, pruned_loss=0.1033, over 4249615.63 frames. ], batch size: 702, lr: 7.78e-03, grad_scale: 16.0 2023-06-20 08:39:38,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=672204.0, ans=0.1 2023-06-20 08:39:45,986 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 2.765e+02 3.154e+02 3.666e+02 5.388e+02, threshold=6.309e+02, percent-clipped=0.0 2023-06-20 08:40:35,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=672384.0, ans=0.125 2023-06-20 08:41:12,481 INFO [train.py:996] (2/4) Epoch 4, batch 20600, loss[loss=0.2295, simple_loss=0.3103, pruned_loss=0.07436, over 19986.00 frames. ], tot_loss[loss=0.2705, simple_loss=0.3376, pruned_loss=0.1017, over 4238594.64 frames. ], batch size: 703, lr: 7.78e-03, grad_scale: 16.0 2023-06-20 08:42:26,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=672684.0, ans=0.125 2023-06-20 08:42:40,459 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=5.15 vs. limit=5.0 2023-06-20 08:42:55,709 INFO [train.py:996] (2/4) Epoch 4, batch 20650, loss[loss=0.2572, simple_loss=0.3096, pruned_loss=0.1024, over 21600.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3327, pruned_loss=0.1013, over 4239111.94 frames. ], batch size: 263, lr: 7.77e-03, grad_scale: 16.0 2023-06-20 08:43:12,511 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.784e+02 3.361e+02 3.771e+02 8.301e+02, threshold=6.721e+02, percent-clipped=1.0 2023-06-20 08:43:24,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=672864.0, ans=0.0 2023-06-20 08:44:35,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=673044.0, ans=0.04949747468305833 2023-06-20 08:44:45,626 INFO [train.py:996] (2/4) Epoch 4, batch 20700, loss[loss=0.3464, simple_loss=0.4019, pruned_loss=0.1454, over 21502.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3252, pruned_loss=0.09735, over 4240363.60 frames. ], batch size: 508, lr: 7.77e-03, grad_scale: 16.0 2023-06-20 08:44:48,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=673104.0, ans=0.125 2023-06-20 08:44:49,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=673104.0, ans=0.125 2023-06-20 08:44:56,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=673104.0, ans=0.09899494936611666 2023-06-20 08:45:19,880 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.85 vs. limit=15.0 2023-06-20 08:45:26,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=673224.0, ans=0.0 2023-06-20 08:45:32,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=673224.0, ans=0.0 2023-06-20 08:46:23,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=673344.0, ans=0.0 2023-06-20 08:46:31,832 INFO [train.py:996] (2/4) Epoch 4, batch 20750, loss[loss=0.2927, simple_loss=0.3688, pruned_loss=0.1083, over 21464.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3257, pruned_loss=0.09586, over 4244201.88 frames. ], batch size: 211, lr: 7.77e-03, grad_scale: 16.0 2023-06-20 08:46:48,691 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 2.639e+02 3.184e+02 4.013e+02 6.063e+02, threshold=6.368e+02, percent-clipped=0.0 2023-06-20 08:46:53,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=673464.0, ans=0.125 2023-06-20 08:46:56,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=673464.0, ans=0.015 2023-06-20 08:47:59,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=673644.0, ans=0.1 2023-06-20 08:48:15,633 INFO [train.py:996] (2/4) Epoch 4, batch 20800, loss[loss=0.289, simple_loss=0.3369, pruned_loss=0.1206, over 21468.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3331, pruned_loss=0.09845, over 4245222.64 frames. ], batch size: 441, lr: 7.77e-03, grad_scale: 32.0 2023-06-20 08:48:16,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=673704.0, ans=0.125 2023-06-20 08:48:31,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=673704.0, ans=0.125 2023-06-20 08:49:57,088 INFO [train.py:996] (2/4) Epoch 4, batch 20850, loss[loss=0.2363, simple_loss=0.2962, pruned_loss=0.08822, over 21620.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.325, pruned_loss=0.09602, over 4243847.82 frames. ], batch size: 230, lr: 7.77e-03, grad_scale: 16.0 2023-06-20 08:50:04,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=674004.0, ans=0.125 2023-06-20 08:50:15,162 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 2.828e+02 3.362e+02 3.995e+02 7.673e+02, threshold=6.724e+02, percent-clipped=2.0 2023-06-20 08:50:15,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=674004.0, ans=0.1 2023-06-20 08:50:18,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=674064.0, ans=0.05 2023-06-20 08:51:08,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=674184.0, ans=0.125 2023-06-20 08:51:32,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=674244.0, ans=0.125 2023-06-20 08:51:39,466 INFO [train.py:996] (2/4) Epoch 4, batch 20900, loss[loss=0.2294, simple_loss=0.2793, pruned_loss=0.0898, over 16281.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3248, pruned_loss=0.09664, over 4245002.44 frames. ], batch size: 62, lr: 7.77e-03, grad_scale: 16.0 2023-06-20 08:51:43,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=674304.0, ans=0.125 2023-06-20 08:52:03,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=674364.0, ans=0.05 2023-06-20 08:52:35,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=674484.0, ans=0.125 2023-06-20 08:52:42,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=674484.0, ans=0.2 2023-06-20 08:53:21,032 INFO [train.py:996] (2/4) Epoch 4, batch 20950, loss[loss=0.1808, simple_loss=0.262, pruned_loss=0.0498, over 21574.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3214, pruned_loss=0.09248, over 4240528.97 frames. ], batch size: 212, lr: 7.76e-03, grad_scale: 16.0 2023-06-20 08:53:21,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=674604.0, ans=0.1 2023-06-20 08:53:33,589 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.049e+02 2.815e+02 3.248e+02 3.950e+02 6.519e+02, threshold=6.496e+02, percent-clipped=0.0 2023-06-20 08:53:55,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=674664.0, ans=0.0 2023-06-20 08:54:20,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=674784.0, ans=0.0 2023-06-20 08:54:57,010 INFO [train.py:996] (2/4) Epoch 4, batch 21000, loss[loss=0.2572, simple_loss=0.3222, pruned_loss=0.09608, over 21865.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3185, pruned_loss=0.09234, over 4248391.81 frames. ], batch size: 351, lr: 7.76e-03, grad_scale: 16.0 2023-06-20 08:54:57,025 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 08:55:14,694 INFO [train.py:1028] (2/4) Epoch 4, validation: loss=0.2759, simple_loss=0.3744, pruned_loss=0.08874, over 1796401.00 frames. 2023-06-20 08:55:14,695 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-20 08:55:36,174 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:56:51,324 INFO [train.py:996] (2/4) Epoch 4, batch 21050, loss[loss=0.2159, simple_loss=0.2719, pruned_loss=0.07994, over 21475.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3171, pruned_loss=0.09274, over 4247628.58 frames. ], batch size: 212, lr: 7.76e-03, grad_scale: 16.0 2023-06-20 08:57:04,080 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.634e+02 3.115e+02 4.220e+02 7.961e+02, threshold=6.229e+02, percent-clipped=3.0 2023-06-20 08:57:23,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=675264.0, ans=0.0 2023-06-20 08:57:25,055 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.96 vs. limit=15.0 2023-06-20 08:57:28,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=675324.0, ans=0.1 2023-06-20 08:57:31,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=675324.0, ans=0.125 2023-06-20 08:58:32,583 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 08:58:33,734 INFO [train.py:996] (2/4) Epoch 4, batch 21100, loss[loss=0.2822, simple_loss=0.3273, pruned_loss=0.1186, over 21429.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3139, pruned_loss=0.09223, over 4248761.02 frames. ], batch size: 441, lr: 7.76e-03, grad_scale: 16.0 2023-06-20 08:59:05,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=675564.0, ans=0.1 2023-06-20 08:59:41,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=675684.0, ans=0.125 2023-06-20 09:00:16,578 INFO [train.py:996] (2/4) Epoch 4, batch 21150, loss[loss=0.2262, simple_loss=0.2795, pruned_loss=0.08644, over 21876.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3095, pruned_loss=0.09217, over 4252117.63 frames. ], batch size: 107, lr: 7.76e-03, grad_scale: 16.0 2023-06-20 09:00:29,291 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.738e+02 3.201e+02 4.018e+02 7.456e+02, threshold=6.402e+02, percent-clipped=2.0 2023-06-20 09:01:58,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=676104.0, ans=0.2 2023-06-20 09:01:59,528 INFO [train.py:996] (2/4) Epoch 4, batch 21200, loss[loss=0.2035, simple_loss=0.2728, pruned_loss=0.06705, over 21588.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3054, pruned_loss=0.09148, over 4248336.58 frames. ], batch size: 263, lr: 7.76e-03, grad_scale: 32.0 2023-06-20 09:03:23,422 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:03:44,329 INFO [train.py:996] (2/4) Epoch 4, batch 21250, loss[loss=0.2326, simple_loss=0.2908, pruned_loss=0.08724, over 21435.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3043, pruned_loss=0.09141, over 4247862.40 frames. ], batch size: 212, lr: 7.75e-03, grad_scale: 32.0 2023-06-20 09:04:02,652 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 2.827e+02 3.407e+02 4.227e+02 7.586e+02, threshold=6.813e+02, percent-clipped=3.0 2023-06-20 09:04:44,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=676524.0, ans=0.1 2023-06-20 09:05:00,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=676584.0, ans=0.125 2023-06-20 09:05:21,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=676644.0, ans=0.07 2023-06-20 09:05:27,295 INFO [train.py:996] (2/4) Epoch 4, batch 21300, loss[loss=0.2603, simple_loss=0.3285, pruned_loss=0.09602, over 21869.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3111, pruned_loss=0.09397, over 4257516.35 frames. ], batch size: 391, lr: 7.75e-03, grad_scale: 32.0 2023-06-20 09:05:31,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=676704.0, ans=0.015 2023-06-20 09:06:50,937 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.26 vs. limit=15.0 2023-06-20 09:07:06,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=676944.0, ans=0.125 2023-06-20 09:07:11,225 INFO [train.py:996] (2/4) Epoch 4, batch 21350, loss[loss=0.2698, simple_loss=0.3492, pruned_loss=0.09523, over 21363.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3144, pruned_loss=0.09429, over 4263510.60 frames. ], batch size: 548, lr: 7.75e-03, grad_scale: 32.0 2023-06-20 09:07:29,013 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.57 vs. limit=15.0 2023-06-20 09:07:29,389 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.939e+02 3.451e+02 4.084e+02 6.160e+02, threshold=6.901e+02, percent-clipped=0.0 2023-06-20 09:07:50,067 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.25 vs. limit=12.0 2023-06-20 09:08:02,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=677124.0, ans=0.015 2023-06-20 09:08:04,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=677124.0, ans=0.125 2023-06-20 09:08:21,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=677184.0, ans=0.2 2023-06-20 09:08:54,434 INFO [train.py:996] (2/4) Epoch 4, batch 21400, loss[loss=0.2528, simple_loss=0.3284, pruned_loss=0.08857, over 21745.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3192, pruned_loss=0.09515, over 4272245.45 frames. ], batch size: 247, lr: 7.75e-03, grad_scale: 32.0 2023-06-20 09:09:02,713 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.49 vs. limit=12.0 2023-06-20 09:09:43,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=677424.0, ans=0.125 2023-06-20 09:09:54,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=677484.0, ans=0.125 2023-06-20 09:10:05,165 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.24 vs. limit=22.5 2023-06-20 09:10:07,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=677484.0, ans=0.125 2023-06-20 09:10:19,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=677544.0, ans=0.0 2023-06-20 09:10:31,826 INFO [train.py:996] (2/4) Epoch 4, batch 21450, loss[loss=0.3026, simple_loss=0.3562, pruned_loss=0.1244, over 21527.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3224, pruned_loss=0.09632, over 4277644.76 frames. ], batch size: 548, lr: 7.75e-03, grad_scale: 32.0 2023-06-20 09:10:49,755 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.043e+02 2.725e+02 3.459e+02 4.369e+02 7.075e+02, threshold=6.919e+02, percent-clipped=1.0 2023-06-20 09:11:15,016 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.12 vs. limit=15.0 2023-06-20 09:11:17,960 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:11:53,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=677784.0, ans=0.2 2023-06-20 09:12:13,318 INFO [train.py:996] (2/4) Epoch 4, batch 21500, loss[loss=0.2502, simple_loss=0.293, pruned_loss=0.1036, over 21654.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3218, pruned_loss=0.09772, over 4279273.07 frames. ], batch size: 247, lr: 7.74e-03, grad_scale: 32.0 2023-06-20 09:12:26,235 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=15.0 2023-06-20 09:12:32,466 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.78 vs. limit=10.0 2023-06-20 09:12:34,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=677964.0, ans=0.125 2023-06-20 09:12:40,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=677964.0, ans=0.125 2023-06-20 09:12:56,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=677964.0, ans=0.125 2023-06-20 09:13:23,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=678084.0, ans=0.0 2023-06-20 09:13:56,032 INFO [train.py:996] (2/4) Epoch 4, batch 21550, loss[loss=0.2164, simple_loss=0.265, pruned_loss=0.08392, over 20835.00 frames. ], tot_loss[loss=0.251, simple_loss=0.314, pruned_loss=0.09403, over 4280775.07 frames. ], batch size: 608, lr: 7.74e-03, grad_scale: 32.0 2023-06-20 09:14:11,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=678204.0, ans=0.0 2023-06-20 09:14:14,652 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.699e+02 3.223e+02 3.892e+02 7.035e+02, threshold=6.447e+02, percent-clipped=1.0 2023-06-20 09:14:16,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=678264.0, ans=0.0 2023-06-20 09:14:59,741 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.82 vs. limit=5.0 2023-06-20 09:15:23,505 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.24 vs. limit=22.5 2023-06-20 09:15:45,556 INFO [train.py:996] (2/4) Epoch 4, batch 21600, loss[loss=0.2316, simple_loss=0.3283, pruned_loss=0.06743, over 21258.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3137, pruned_loss=0.09289, over 4266245.38 frames. ], batch size: 549, lr: 7.74e-03, grad_scale: 32.0 2023-06-20 09:16:29,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=678624.0, ans=0.125 2023-06-20 09:16:37,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=678624.0, ans=0.125 2023-06-20 09:16:55,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=678684.0, ans=0.125 2023-06-20 09:17:27,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=678744.0, ans=0.125 2023-06-20 09:17:30,272 INFO [train.py:996] (2/4) Epoch 4, batch 21650, loss[loss=0.2306, simple_loss=0.3089, pruned_loss=0.07615, over 21869.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3158, pruned_loss=0.09031, over 4272341.68 frames. ], batch size: 118, lr: 7.74e-03, grad_scale: 16.0 2023-06-20 09:17:35,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=678804.0, ans=0.125 2023-06-20 09:17:53,975 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.809e+02 3.238e+02 3.645e+02 6.702e+02, threshold=6.475e+02, percent-clipped=1.0 2023-06-20 09:17:55,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=678864.0, ans=0.125 2023-06-20 09:17:57,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=678864.0, ans=0.1 2023-06-20 09:18:05,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=678864.0, ans=0.0 2023-06-20 09:18:39,034 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.36 vs. limit=6.0 2023-06-20 09:19:01,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=679044.0, ans=0.1 2023-06-20 09:19:05,622 INFO [train.py:996] (2/4) Epoch 4, batch 21700, loss[loss=0.2301, simple_loss=0.2954, pruned_loss=0.08243, over 21332.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3158, pruned_loss=0.08787, over 4273340.41 frames. ], batch size: 131, lr: 7.74e-03, grad_scale: 16.0 2023-06-20 09:20:34,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=679344.0, ans=0.0 2023-06-20 09:20:47,280 INFO [train.py:996] (2/4) Epoch 4, batch 21750, loss[loss=0.2382, simple_loss=0.2878, pruned_loss=0.09434, over 21276.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3115, pruned_loss=0.08861, over 4266794.61 frames. ], batch size: 144, lr: 7.74e-03, grad_scale: 16.0 2023-06-20 09:21:12,014 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.936e+02 2.575e+02 3.289e+02 4.229e+02 7.703e+02, threshold=6.577e+02, percent-clipped=2.0 2023-06-20 09:21:28,268 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.59 vs. limit=15.0 2023-06-20 09:21:41,702 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=15.0 2023-06-20 09:21:42,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=679524.0, ans=0.125 2023-06-20 09:21:46,289 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.40 vs. limit=15.0 2023-06-20 09:22:02,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=679584.0, ans=0.0 2023-06-20 09:22:31,505 INFO [train.py:996] (2/4) Epoch 4, batch 21800, loss[loss=0.2694, simple_loss=0.3392, pruned_loss=0.09978, over 21853.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3097, pruned_loss=0.09067, over 4257933.71 frames. ], batch size: 373, lr: 7.73e-03, grad_scale: 16.0 2023-06-20 09:23:12,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=679764.0, ans=0.125 2023-06-20 09:23:13,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=679764.0, ans=0.2 2023-06-20 09:23:45,490 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-06-20 09:24:04,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=679944.0, ans=0.125 2023-06-20 09:24:19,771 INFO [train.py:996] (2/4) Epoch 4, batch 21850, loss[loss=0.2236, simple_loss=0.2908, pruned_loss=0.07819, over 19929.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3161, pruned_loss=0.09148, over 4261201.22 frames. ], batch size: 702, lr: 7.73e-03, grad_scale: 16.0 2023-06-20 09:24:39,344 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.164e+02 2.758e+02 3.560e+02 4.592e+02 6.859e+02, threshold=7.120e+02, percent-clipped=3.0 2023-06-20 09:24:59,734 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=22.5 2023-06-20 09:25:01,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=680124.0, ans=0.0 2023-06-20 09:25:25,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=680184.0, ans=0.2 2023-06-20 09:26:00,384 INFO [train.py:996] (2/4) Epoch 4, batch 21900, loss[loss=0.248, simple_loss=0.3056, pruned_loss=0.09525, over 21695.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3166, pruned_loss=0.09252, over 4269757.35 frames. ], batch size: 298, lr: 7.73e-03, grad_scale: 16.0 2023-06-20 09:26:56,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=680424.0, ans=0.1 2023-06-20 09:27:24,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=680544.0, ans=0.125 2023-06-20 09:27:42,103 INFO [train.py:996] (2/4) Epoch 4, batch 21950, loss[loss=0.2329, simple_loss=0.287, pruned_loss=0.08947, over 21271.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3109, pruned_loss=0.09105, over 4272400.71 frames. ], batch size: 144, lr: 7.73e-03, grad_scale: 16.0 2023-06-20 09:28:06,011 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.906e+02 2.677e+02 3.027e+02 3.757e+02 5.142e+02, threshold=6.054e+02, percent-clipped=0.0 2023-06-20 09:28:43,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=680724.0, ans=0.125 2023-06-20 09:28:43,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=680724.0, ans=0.2 2023-06-20 09:29:24,081 INFO [train.py:996] (2/4) Epoch 4, batch 22000, loss[loss=0.2395, simple_loss=0.2952, pruned_loss=0.09193, over 21275.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.304, pruned_loss=0.0866, over 4259871.16 frames. ], batch size: 159, lr: 7.73e-03, grad_scale: 32.0 2023-06-20 09:29:26,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=680904.0, ans=0.0 2023-06-20 09:29:29,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=680904.0, ans=0.0 2023-06-20 09:29:41,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=680904.0, ans=0.04949747468305833 2023-06-20 09:30:05,254 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.83 vs. limit=6.0 2023-06-20 09:30:11,938 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.00 vs. limit=6.0 2023-06-20 09:30:38,651 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=15.0 2023-06-20 09:31:13,127 INFO [train.py:996] (2/4) Epoch 4, batch 22050, loss[loss=0.2655, simple_loss=0.3466, pruned_loss=0.09218, over 21594.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3103, pruned_loss=0.0901, over 4260539.84 frames. ], batch size: 263, lr: 7.73e-03, grad_scale: 32.0 2023-06-20 09:31:33,505 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 2.650e+02 3.249e+02 4.022e+02 7.710e+02, threshold=6.498e+02, percent-clipped=6.0 2023-06-20 09:31:38,258 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=22.5 2023-06-20 09:31:43,337 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-06-20 09:32:09,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=681324.0, ans=0.125 2023-06-20 09:32:57,314 INFO [train.py:996] (2/4) Epoch 4, batch 22100, loss[loss=0.3037, simple_loss=0.3581, pruned_loss=0.1247, over 21788.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3245, pruned_loss=0.09648, over 4261181.63 frames. ], batch size: 351, lr: 7.72e-03, grad_scale: 32.0 2023-06-20 09:33:30,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=681564.0, ans=0.0 2023-06-20 09:33:37,829 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.73 vs. limit=15.0 2023-06-20 09:34:38,862 INFO [train.py:996] (2/4) Epoch 4, batch 22150, loss[loss=0.2649, simple_loss=0.3305, pruned_loss=0.09963, over 21859.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.327, pruned_loss=0.09791, over 4265609.69 frames. ], batch size: 332, lr: 7.72e-03, grad_scale: 32.0 2023-06-20 09:34:57,909 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.192e+02 3.000e+02 3.535e+02 4.245e+02 7.467e+02, threshold=7.071e+02, percent-clipped=4.0 2023-06-20 09:35:03,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=681864.0, ans=0.0 2023-06-20 09:35:28,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=681924.0, ans=0.125 2023-06-20 09:35:48,162 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-20 09:36:02,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=682044.0, ans=0.0 2023-06-20 09:36:21,226 INFO [train.py:996] (2/4) Epoch 4, batch 22200, loss[loss=0.3613, simple_loss=0.4219, pruned_loss=0.1503, over 21646.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3287, pruned_loss=0.09921, over 4278597.50 frames. ], batch size: 508, lr: 7.72e-03, grad_scale: 32.0 2023-06-20 09:36:22,271 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=22.5 2023-06-20 09:36:36,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=682104.0, ans=0.0 2023-06-20 09:36:44,176 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-20 09:37:37,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=682284.0, ans=0.2 2023-06-20 09:38:08,614 INFO [train.py:996] (2/4) Epoch 4, batch 22250, loss[loss=0.3273, simple_loss=0.3988, pruned_loss=0.1279, over 21800.00 frames. ], tot_loss[loss=0.2694, simple_loss=0.3363, pruned_loss=0.1012, over 4282169.66 frames. ], batch size: 118, lr: 7.72e-03, grad_scale: 32.0 2023-06-20 09:38:23,010 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 2.823e+02 3.649e+02 4.510e+02 8.047e+02, threshold=7.298e+02, percent-clipped=1.0 2023-06-20 09:38:32,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=682464.0, ans=0.125 2023-06-20 09:38:58,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=682524.0, ans=0.1 2023-06-20 09:39:49,764 INFO [train.py:996] (2/4) Epoch 4, batch 22300, loss[loss=0.2895, simple_loss=0.3533, pruned_loss=0.1128, over 21861.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3374, pruned_loss=0.1031, over 4281891.01 frames. ], batch size: 107, lr: 7.72e-03, grad_scale: 32.0 2023-06-20 09:39:50,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=682704.0, ans=0.0 2023-06-20 09:40:19,536 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:40:32,987 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-06-20 09:40:37,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=682824.0, ans=0.125 2023-06-20 09:40:39,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=682824.0, ans=0.0 2023-06-20 09:40:47,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=682884.0, ans=0.2 2023-06-20 09:41:23,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=682944.0, ans=0.0 2023-06-20 09:41:30,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=683004.0, ans=0.125 2023-06-20 09:41:31,586 INFO [train.py:996] (2/4) Epoch 4, batch 22350, loss[loss=0.3201, simple_loss=0.3638, pruned_loss=0.1382, over 21773.00 frames. ], tot_loss[loss=0.2716, simple_loss=0.3354, pruned_loss=0.1039, over 4295199.65 frames. ], batch size: 441, lr: 7.72e-03, grad_scale: 32.0 2023-06-20 09:41:32,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=683004.0, ans=0.0 2023-06-20 09:41:34,420 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-20 09:41:42,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=683004.0, ans=0.125 2023-06-20 09:41:46,594 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.465e+02 3.048e+02 3.517e+02 4.689e+02 8.292e+02, threshold=7.034e+02, percent-clipped=3.0 2023-06-20 09:41:58,070 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 09:42:19,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=683124.0, ans=0.1 2023-06-20 09:42:45,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=683184.0, ans=0.125 2023-06-20 09:43:16,283 INFO [train.py:996] (2/4) Epoch 4, batch 22400, loss[loss=0.254, simple_loss=0.3121, pruned_loss=0.09792, over 21315.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.331, pruned_loss=0.0994, over 4292176.38 frames. ], batch size: 211, lr: 7.71e-03, grad_scale: 32.0 2023-06-20 09:43:20,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=683304.0, ans=0.0 2023-06-20 09:43:23,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=683304.0, ans=0.125 2023-06-20 09:43:31,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=683304.0, ans=0.04949747468305833 2023-06-20 09:43:38,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=683364.0, ans=0.1 2023-06-20 09:43:45,594 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-06-20 09:43:53,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=683424.0, ans=0.0 2023-06-20 09:44:58,400 INFO [train.py:996] (2/4) Epoch 4, batch 22450, loss[loss=0.2451, simple_loss=0.2903, pruned_loss=0.09991, over 21335.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3251, pruned_loss=0.09839, over 4274930.13 frames. ], batch size: 473, lr: 7.71e-03, grad_scale: 32.0 2023-06-20 09:45:18,328 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.975e+02 2.635e+02 3.000e+02 3.519e+02 5.856e+02, threshold=6.001e+02, percent-clipped=0.0 2023-06-20 09:46:48,165 INFO [train.py:996] (2/4) Epoch 4, batch 22500, loss[loss=0.2787, simple_loss=0.3721, pruned_loss=0.09263, over 21258.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.32, pruned_loss=0.09721, over 4276810.61 frames. ], batch size: 549, lr: 7.71e-03, grad_scale: 32.0 2023-06-20 09:47:16,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=683964.0, ans=0.0 2023-06-20 09:48:30,873 INFO [train.py:996] (2/4) Epoch 4, batch 22550, loss[loss=0.2292, simple_loss=0.3014, pruned_loss=0.07853, over 21843.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3208, pruned_loss=0.09692, over 4277341.78 frames. ], batch size: 282, lr: 7.71e-03, grad_scale: 32.0 2023-06-20 09:48:31,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=684204.0, ans=0.1 2023-06-20 09:48:45,430 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.179e+02 2.905e+02 3.340e+02 4.222e+02 9.344e+02, threshold=6.680e+02, percent-clipped=7.0 2023-06-20 09:48:50,586 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.76 vs. limit=15.0 2023-06-20 09:48:51,958 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=15.0 2023-06-20 09:49:40,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=684384.0, ans=0.125 2023-06-20 09:49:43,884 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-20 09:50:05,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=684444.0, ans=0.0 2023-06-20 09:50:14,903 INFO [train.py:996] (2/4) Epoch 4, batch 22600, loss[loss=0.2339, simple_loss=0.3156, pruned_loss=0.07608, over 21076.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3234, pruned_loss=0.09666, over 4285139.95 frames. ], batch size: 607, lr: 7.71e-03, grad_scale: 32.0 2023-06-20 09:50:19,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=684504.0, ans=0.2 2023-06-20 09:50:19,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=684504.0, ans=0.0 2023-06-20 09:50:53,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=684564.0, ans=0.2 2023-06-20 09:51:13,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=684624.0, ans=0.05 2023-06-20 09:51:36,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=684684.0, ans=0.0 2023-06-20 09:51:45,604 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-20 09:51:47,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=684744.0, ans=0.1 2023-06-20 09:51:57,260 INFO [train.py:996] (2/4) Epoch 4, batch 22650, loss[loss=0.2289, simple_loss=0.2852, pruned_loss=0.0863, over 21260.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3211, pruned_loss=0.09599, over 4271658.60 frames. ], batch size: 548, lr: 7.71e-03, grad_scale: 32.0 2023-06-20 09:52:12,043 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 3.311e+02 3.814e+02 4.832e+02 8.626e+02, threshold=7.628e+02, percent-clipped=4.0 2023-06-20 09:52:26,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=684864.0, ans=0.0 2023-06-20 09:52:39,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=684924.0, ans=0.125 2023-06-20 09:52:47,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=684924.0, ans=0.125 2023-06-20 09:53:25,626 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.76 vs. limit=22.5 2023-06-20 09:53:39,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=685104.0, ans=0.125 2023-06-20 09:53:40,938 INFO [train.py:996] (2/4) Epoch 4, batch 22700, loss[loss=0.2473, simple_loss=0.2955, pruned_loss=0.09956, over 21879.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3163, pruned_loss=0.09666, over 4275308.71 frames. ], batch size: 373, lr: 7.70e-03, grad_scale: 32.0 2023-06-20 09:54:04,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=685164.0, ans=10.0 2023-06-20 09:54:13,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=685164.0, ans=0.125 2023-06-20 09:54:51,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=685284.0, ans=0.2 2023-06-20 09:55:12,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=685344.0, ans=0.125 2023-06-20 09:55:16,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=685344.0, ans=0.0 2023-06-20 09:55:23,803 INFO [train.py:996] (2/4) Epoch 4, batch 22750, loss[loss=0.2907, simple_loss=0.3379, pruned_loss=0.1218, over 20761.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3162, pruned_loss=0.09794, over 4267447.74 frames. ], batch size: 607, lr: 7.70e-03, grad_scale: 32.0 2023-06-20 09:55:24,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=685404.0, ans=0.125 2023-06-20 09:55:43,092 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 2.761e+02 3.069e+02 3.774e+02 7.547e+02, threshold=6.137e+02, percent-clipped=0.0 2023-06-20 09:55:44,336 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.71 vs. limit=10.0 2023-06-20 09:55:50,995 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.73 vs. limit=10.0 2023-06-20 09:56:12,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=685524.0, ans=0.0 2023-06-20 09:56:22,634 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.11 vs. limit=12.0 2023-06-20 09:56:43,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=685584.0, ans=0.0 2023-06-20 09:57:05,741 INFO [train.py:996] (2/4) Epoch 4, batch 22800, loss[loss=0.2626, simple_loss=0.3241, pruned_loss=0.1006, over 21862.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3217, pruned_loss=0.1009, over 4277484.53 frames. ], batch size: 333, lr: 7.70e-03, grad_scale: 32.0 2023-06-20 09:57:50,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=685824.0, ans=0.125 2023-06-20 09:57:52,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=685824.0, ans=0.0 2023-06-20 09:58:00,804 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=22.5 2023-06-20 09:58:33,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=685944.0, ans=0.125 2023-06-20 09:58:47,363 INFO [train.py:996] (2/4) Epoch 4, batch 22850, loss[loss=0.2359, simple_loss=0.2874, pruned_loss=0.09217, over 21697.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.318, pruned_loss=0.09975, over 4276639.71 frames. ], batch size: 283, lr: 7.70e-03, grad_scale: 32.0 2023-06-20 09:58:49,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=686004.0, ans=0.125 2023-06-20 09:58:51,483 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-20 09:58:55,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=686004.0, ans=0.015 2023-06-20 09:59:07,754 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.241e+02 3.174e+02 3.905e+02 4.697e+02 7.560e+02, threshold=7.810e+02, percent-clipped=8.0 2023-06-20 09:59:12,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=686064.0, ans=0.0 2023-06-20 09:59:15,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=686064.0, ans=0.0 2023-06-20 09:59:55,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=686184.0, ans=0.05 2023-06-20 10:00:20,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=686244.0, ans=0.125 2023-06-20 10:00:25,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=686244.0, ans=0.1 2023-06-20 10:00:31,665 INFO [train.py:996] (2/4) Epoch 4, batch 22900, loss[loss=0.2474, simple_loss=0.3307, pruned_loss=0.08202, over 21422.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3202, pruned_loss=0.09876, over 4266077.43 frames. ], batch size: 211, lr: 7.70e-03, grad_scale: 32.0 2023-06-20 10:00:35,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=686304.0, ans=0.0 2023-06-20 10:02:22,379 INFO [train.py:996] (2/4) Epoch 4, batch 22950, loss[loss=0.2741, simple_loss=0.4027, pruned_loss=0.07269, over 21284.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3332, pruned_loss=0.09692, over 4263089.56 frames. ], batch size: 548, lr: 7.70e-03, grad_scale: 32.0 2023-06-20 10:02:34,998 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=22.5 2023-06-20 10:02:41,647 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 3.062e+02 3.428e+02 4.438e+02 8.217e+02, threshold=6.855e+02, percent-clipped=1.0 2023-06-20 10:03:05,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=686724.0, ans=0.2 2023-06-20 10:03:46,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=686844.0, ans=0.2 2023-06-20 10:03:58,879 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.41 vs. limit=15.0 2023-06-20 10:04:02,215 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.52 vs. limit=6.0 2023-06-20 10:04:02,229 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.91 vs. limit=15.0 2023-06-20 10:04:09,142 INFO [train.py:996] (2/4) Epoch 4, batch 23000, loss[loss=0.2295, simple_loss=0.2936, pruned_loss=0.08267, over 21832.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3321, pruned_loss=0.0942, over 4270236.09 frames. ], batch size: 282, lr: 7.69e-03, grad_scale: 32.0 2023-06-20 10:04:41,298 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-06-20 10:04:43,496 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.73 vs. limit=15.0 2023-06-20 10:04:58,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=687024.0, ans=0.2 2023-06-20 10:05:03,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=687024.0, ans=0.0 2023-06-20 10:05:15,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=687084.0, ans=0.125 2023-06-20 10:05:49,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=687144.0, ans=0.1 2023-06-20 10:05:52,772 INFO [train.py:996] (2/4) Epoch 4, batch 23050, loss[loss=0.2898, simple_loss=0.3459, pruned_loss=0.1168, over 21336.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3338, pruned_loss=0.09675, over 4272368.19 frames. ], batch size: 159, lr: 7.69e-03, grad_scale: 32.0 2023-06-20 10:06:12,284 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.089e+02 2.843e+02 3.308e+02 4.395e+02 9.677e+02, threshold=6.616e+02, percent-clipped=9.0 2023-06-20 10:06:53,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=687384.0, ans=0.125 2023-06-20 10:07:25,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=687444.0, ans=0.125 2023-06-20 10:07:34,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=687504.0, ans=0.0 2023-06-20 10:07:35,959 INFO [train.py:996] (2/4) Epoch 4, batch 23100, loss[loss=0.23, simple_loss=0.2807, pruned_loss=0.08964, over 21324.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3288, pruned_loss=0.09739, over 4268381.56 frames. ], batch size: 194, lr: 7.69e-03, grad_scale: 32.0 2023-06-20 10:07:46,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=687504.0, ans=0.125 2023-06-20 10:07:48,039 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:07:50,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=687504.0, ans=0.0 2023-06-20 10:08:34,701 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.33 vs. limit=6.0 2023-06-20 10:08:35,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=687684.0, ans=0.125 2023-06-20 10:08:55,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=687744.0, ans=0.0 2023-06-20 10:09:10,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=687744.0, ans=0.0 2023-06-20 10:09:10,562 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=15.0 2023-06-20 10:09:17,238 INFO [train.py:996] (2/4) Epoch 4, batch 23150, loss[loss=0.3041, simple_loss=0.3465, pruned_loss=0.1308, over 21593.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3227, pruned_loss=0.09627, over 4271436.03 frames. ], batch size: 548, lr: 7.69e-03, grad_scale: 32.0 2023-06-20 10:09:36,707 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.820e+02 3.256e+02 3.906e+02 5.764e+02, threshold=6.513e+02, percent-clipped=0.0 2023-06-20 10:10:05,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=687924.0, ans=0.125 2023-06-20 10:10:19,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=687984.0, ans=15.0 2023-06-20 10:10:29,183 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.42 vs. limit=22.5 2023-06-20 10:10:53,466 INFO [train.py:996] (2/4) Epoch 4, batch 23200, loss[loss=0.193, simple_loss=0.2813, pruned_loss=0.05238, over 19896.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3218, pruned_loss=0.0969, over 4284426.93 frames. ], batch size: 703, lr: 7.69e-03, grad_scale: 32.0 2023-06-20 10:11:28,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=688164.0, ans=0.95 2023-06-20 10:11:54,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=688284.0, ans=0.0 2023-06-20 10:12:09,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=688344.0, ans=0.2 2023-06-20 10:12:12,289 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=22.5 2023-06-20 10:12:26,618 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.59 vs. limit=15.0 2023-06-20 10:12:36,045 INFO [train.py:996] (2/4) Epoch 4, batch 23250, loss[loss=0.2526, simple_loss=0.316, pruned_loss=0.09462, over 21465.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3229, pruned_loss=0.09907, over 4287769.02 frames. ], batch size: 211, lr: 7.69e-03, grad_scale: 16.0 2023-06-20 10:12:57,021 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.069e+02 2.915e+02 3.512e+02 4.464e+02 9.491e+02, threshold=7.024e+02, percent-clipped=1.0 2023-06-20 10:13:41,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=688584.0, ans=0.0 2023-06-20 10:14:18,796 INFO [train.py:996] (2/4) Epoch 4, batch 23300, loss[loss=0.2706, simple_loss=0.3904, pruned_loss=0.07536, over 20815.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3322, pruned_loss=0.1018, over 4290497.43 frames. ], batch size: 607, lr: 7.68e-03, grad_scale: 8.0 2023-06-20 10:14:30,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=688704.0, ans=0.1 2023-06-20 10:14:46,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=688764.0, ans=0.125 2023-06-20 10:16:01,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=688944.0, ans=0.04949747468305833 2023-06-20 10:16:05,792 INFO [train.py:996] (2/4) Epoch 4, batch 23350, loss[loss=0.2244, simple_loss=0.3018, pruned_loss=0.07351, over 21789.00 frames. ], tot_loss[loss=0.27, simple_loss=0.3379, pruned_loss=0.1011, over 4282642.23 frames. ], batch size: 316, lr: 7.68e-03, grad_scale: 8.0 2023-06-20 10:16:28,789 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.938e+02 2.739e+02 3.338e+02 4.222e+02 6.703e+02, threshold=6.676e+02, percent-clipped=0.0 2023-06-20 10:16:37,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=689064.0, ans=0.0 2023-06-20 10:16:38,295 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=22.5 2023-06-20 10:16:53,035 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-06-20 10:16:53,720 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:17:42,555 INFO [train.py:996] (2/4) Epoch 4, batch 23400, loss[loss=0.2498, simple_loss=0.318, pruned_loss=0.09075, over 15447.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3304, pruned_loss=0.09656, over 4283199.05 frames. ], batch size: 61, lr: 7.68e-03, grad_scale: 8.0 2023-06-20 10:18:51,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=689484.0, ans=0.125 2023-06-20 10:19:13,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=689544.0, ans=0.1 2023-06-20 10:19:30,594 INFO [train.py:996] (2/4) Epoch 4, batch 23450, loss[loss=0.3087, simple_loss=0.3615, pruned_loss=0.1279, over 21949.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3309, pruned_loss=0.09876, over 4286914.61 frames. ], batch size: 316, lr: 7.68e-03, grad_scale: 8.0 2023-06-20 10:19:48,765 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.709e+02 2.746e+02 3.178e+02 4.019e+02 6.793e+02, threshold=6.356e+02, percent-clipped=1.0 2023-06-20 10:20:01,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=689664.0, ans=0.04949747468305833 2023-06-20 10:20:16,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=689724.0, ans=0.125 2023-06-20 10:20:22,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=689724.0, ans=0.0 2023-06-20 10:21:08,142 INFO [train.py:996] (2/4) Epoch 4, batch 23500, loss[loss=0.2362, simple_loss=0.2982, pruned_loss=0.08712, over 21630.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.3297, pruned_loss=0.1003, over 4291678.53 frames. ], batch size: 263, lr: 7.68e-03, grad_scale: 8.0 2023-06-20 10:21:34,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=689964.0, ans=0.2 2023-06-20 10:22:05,433 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:22:35,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=690144.0, ans=0.2 2023-06-20 10:22:51,304 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:22:52,387 INFO [train.py:996] (2/4) Epoch 4, batch 23550, loss[loss=0.2574, simple_loss=0.3154, pruned_loss=0.09966, over 21374.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3242, pruned_loss=0.09968, over 4285234.29 frames. ], batch size: 131, lr: 7.68e-03, grad_scale: 8.0 2023-06-20 10:22:58,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=690204.0, ans=0.0 2023-06-20 10:22:59,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=690204.0, ans=0.5 2023-06-20 10:23:10,767 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.284e+02 2.904e+02 3.219e+02 3.862e+02 7.198e+02, threshold=6.438e+02, percent-clipped=1.0 2023-06-20 10:24:14,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=690444.0, ans=0.2 2023-06-20 10:24:26,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=690444.0, ans=0.125 2023-06-20 10:24:30,329 INFO [train.py:996] (2/4) Epoch 4, batch 23600, loss[loss=0.2846, simple_loss=0.3517, pruned_loss=0.1088, over 21706.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3254, pruned_loss=0.09866, over 4273809.39 frames. ], batch size: 351, lr: 7.67e-03, grad_scale: 16.0 2023-06-20 10:24:51,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=690564.0, ans=0.125 2023-06-20 10:25:52,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=690744.0, ans=0.0 2023-06-20 10:26:10,341 INFO [train.py:996] (2/4) Epoch 4, batch 23650, loss[loss=0.2267, simple_loss=0.2993, pruned_loss=0.07708, over 21458.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3251, pruned_loss=0.09693, over 4265703.86 frames. ], batch size: 211, lr: 7.67e-03, grad_scale: 16.0 2023-06-20 10:26:20,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=690804.0, ans=0.2 2023-06-20 10:26:38,378 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.031e+02 3.037e+02 3.480e+02 4.305e+02 8.157e+02, threshold=6.960e+02, percent-clipped=7.0 2023-06-20 10:27:05,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=690924.0, ans=0.125 2023-06-20 10:27:27,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=690984.0, ans=0.125 2023-06-20 10:27:34,514 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.79 vs. limit=10.0 2023-06-20 10:27:53,244 INFO [train.py:996] (2/4) Epoch 4, batch 23700, loss[loss=0.2254, simple_loss=0.2934, pruned_loss=0.07875, over 21307.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3278, pruned_loss=0.09682, over 4273683.06 frames. ], batch size: 159, lr: 7.67e-03, grad_scale: 16.0 2023-06-20 10:29:10,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=691284.0, ans=0.125 2023-06-20 10:29:48,039 INFO [train.py:996] (2/4) Epoch 4, batch 23750, loss[loss=0.2585, simple_loss=0.3313, pruned_loss=0.09283, over 21701.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3324, pruned_loss=0.0982, over 4271691.36 frames. ], batch size: 351, lr: 7.67e-03, grad_scale: 16.0 2023-06-20 10:30:11,545 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.867e+02 3.217e+02 4.313e+02 7.122e+02, threshold=6.434e+02, percent-clipped=1.0 2023-06-20 10:30:13,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=691464.0, ans=0.125 2023-06-20 10:30:47,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=691584.0, ans=0.1 2023-06-20 10:31:18,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=691644.0, ans=0.025 2023-06-20 10:31:37,972 INFO [train.py:996] (2/4) Epoch 4, batch 23800, loss[loss=0.3079, simple_loss=0.3979, pruned_loss=0.109, over 21626.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3305, pruned_loss=0.09615, over 4270472.78 frames. ], batch size: 414, lr: 7.67e-03, grad_scale: 16.0 2023-06-20 10:31:45,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=691704.0, ans=0.2 2023-06-20 10:32:06,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=691764.0, ans=0.95 2023-06-20 10:32:16,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=691824.0, ans=0.1 2023-06-20 10:32:28,795 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=22.5 2023-06-20 10:32:50,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=691884.0, ans=0.1 2023-06-20 10:33:15,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=691944.0, ans=0.0 2023-06-20 10:33:23,376 INFO [train.py:996] (2/4) Epoch 4, batch 23850, loss[loss=0.265, simple_loss=0.3364, pruned_loss=0.09675, over 21673.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.3396, pruned_loss=0.09908, over 4260827.14 frames. ], batch size: 351, lr: 7.67e-03, grad_scale: 16.0 2023-06-20 10:33:36,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=692004.0, ans=0.07 2023-06-20 10:33:48,519 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 3.013e+02 3.868e+02 5.325e+02 1.077e+03, threshold=7.737e+02, percent-clipped=14.0 2023-06-20 10:34:48,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=692184.0, ans=0.5 2023-06-20 10:35:13,533 INFO [train.py:996] (2/4) Epoch 4, batch 23900, loss[loss=0.2647, simple_loss=0.3333, pruned_loss=0.0981, over 21282.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.3473, pruned_loss=0.102, over 4262851.74 frames. ], batch size: 159, lr: 7.66e-03, grad_scale: 16.0 2023-06-20 10:36:02,354 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=15.0 2023-06-20 10:36:09,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=692484.0, ans=0.125 2023-06-20 10:36:13,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=692484.0, ans=0.0 2023-06-20 10:36:19,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=692484.0, ans=0.125 2023-06-20 10:36:52,306 INFO [train.py:996] (2/4) Epoch 4, batch 23950, loss[loss=0.2791, simple_loss=0.3365, pruned_loss=0.1108, over 21673.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3405, pruned_loss=0.1015, over 4261823.65 frames. ], batch size: 298, lr: 7.66e-03, grad_scale: 16.0 2023-06-20 10:36:54,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=692604.0, ans=0.125 2023-06-20 10:36:59,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=692604.0, ans=0.125 2023-06-20 10:37:02,843 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:37:06,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=692604.0, ans=0.1 2023-06-20 10:37:10,671 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.016e+02 2.929e+02 3.410e+02 4.340e+02 7.845e+02, threshold=6.819e+02, percent-clipped=1.0 2023-06-20 10:37:34,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=692724.0, ans=10.0 2023-06-20 10:37:41,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=692724.0, ans=0.125 2023-06-20 10:37:53,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=692784.0, ans=0.125 2023-06-20 10:38:22,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=692844.0, ans=0.125 2023-06-20 10:38:25,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=692844.0, ans=0.0 2023-06-20 10:38:32,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=692844.0, ans=0.2 2023-06-20 10:38:36,306 INFO [train.py:996] (2/4) Epoch 4, batch 24000, loss[loss=0.2918, simple_loss=0.353, pruned_loss=0.1153, over 21417.00 frames. ], tot_loss[loss=0.2765, simple_loss=0.3426, pruned_loss=0.1052, over 4266304.50 frames. ], batch size: 549, lr: 7.66e-03, grad_scale: 32.0 2023-06-20 10:38:36,306 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 10:38:46,478 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.4328, 1.5688, 3.5021, 2.2884], device='cuda:2') 2023-06-20 10:38:48,341 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.5198, 4.5567, 4.1813, 4.2394], device='cuda:2') 2023-06-20 10:38:53,766 INFO [train.py:1028] (2/4) Epoch 4, validation: loss=0.2722, simple_loss=0.3716, pruned_loss=0.08645, over 1796401.00 frames. 2023-06-20 10:38:53,766 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-20 10:38:54,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=692904.0, ans=0.0 2023-06-20 10:40:37,788 INFO [train.py:996] (2/4) Epoch 4, batch 24050, loss[loss=0.2465, simple_loss=0.3415, pruned_loss=0.07573, over 20845.00 frames. ], tot_loss[loss=0.2757, simple_loss=0.3426, pruned_loss=0.1044, over 4271511.02 frames. ], batch size: 607, lr: 7.66e-03, grad_scale: 16.0 2023-06-20 10:41:03,508 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.885e+02 3.459e+02 4.129e+02 6.625e+02, threshold=6.917e+02, percent-clipped=0.0 2023-06-20 10:41:11,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=693264.0, ans=0.2 2023-06-20 10:41:33,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=693324.0, ans=0.0 2023-06-20 10:41:41,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=693384.0, ans=0.1 2023-06-20 10:41:58,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=693384.0, ans=0.125 2023-06-20 10:42:01,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=693384.0, ans=0.2 2023-06-20 10:42:21,509 INFO [train.py:996] (2/4) Epoch 4, batch 24100, loss[loss=0.3103, simple_loss=0.3714, pruned_loss=0.1246, over 21770.00 frames. ], tot_loss[loss=0.2717, simple_loss=0.3406, pruned_loss=0.1014, over 4281150.05 frames. ], batch size: 441, lr: 7.66e-03, grad_scale: 16.0 2023-06-20 10:42:25,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=693504.0, ans=0.125 2023-06-20 10:42:35,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=693504.0, ans=0.125 2023-06-20 10:42:48,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=693564.0, ans=0.125 2023-06-20 10:43:52,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=693744.0, ans=0.0 2023-06-20 10:44:03,703 INFO [train.py:996] (2/4) Epoch 4, batch 24150, loss[loss=0.2623, simple_loss=0.3188, pruned_loss=0.1029, over 21454.00 frames. ], tot_loss[loss=0.2747, simple_loss=0.3413, pruned_loss=0.104, over 4288461.40 frames. ], batch size: 194, lr: 7.66e-03, grad_scale: 16.0 2023-06-20 10:44:29,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=693864.0, ans=0.125 2023-06-20 10:44:38,769 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 3.099e+02 3.647e+02 4.955e+02 8.844e+02, threshold=7.295e+02, percent-clipped=4.0 2023-06-20 10:44:47,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=693864.0, ans=0.125 2023-06-20 10:44:47,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=693864.0, ans=0.1 2023-06-20 10:44:55,765 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:45:06,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=693924.0, ans=0.125 2023-06-20 10:45:52,591 INFO [train.py:996] (2/4) Epoch 4, batch 24200, loss[loss=0.2586, simple_loss=0.3353, pruned_loss=0.09095, over 21594.00 frames. ], tot_loss[loss=0.2771, simple_loss=0.3442, pruned_loss=0.105, over 4292405.50 frames. ], batch size: 230, lr: 7.65e-03, grad_scale: 16.0 2023-06-20 10:46:08,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=694104.0, ans=0.0 2023-06-20 10:46:40,394 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.96 vs. limit=6.0 2023-06-20 10:47:32,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=694344.0, ans=0.0 2023-06-20 10:47:46,470 INFO [train.py:996] (2/4) Epoch 4, batch 24250, loss[loss=0.2118, simple_loss=0.2895, pruned_loss=0.06701, over 21224.00 frames. ], tot_loss[loss=0.2669, simple_loss=0.3397, pruned_loss=0.09704, over 4287451.36 frames. ], batch size: 143, lr: 7.65e-03, grad_scale: 16.0 2023-06-20 10:48:11,643 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.908e+02 2.800e+02 3.363e+02 4.220e+02 7.304e+02, threshold=6.726e+02, percent-clipped=1.0 2023-06-20 10:48:15,878 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.43 vs. limit=10.0 2023-06-20 10:48:38,789 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.27 vs. limit=15.0 2023-06-20 10:48:54,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=694584.0, ans=0.2 2023-06-20 10:49:03,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=694644.0, ans=0.125 2023-06-20 10:49:28,809 INFO [train.py:996] (2/4) Epoch 4, batch 24300, loss[loss=0.2188, simple_loss=0.2809, pruned_loss=0.07837, over 21881.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3324, pruned_loss=0.09105, over 4281160.12 frames. ], batch size: 107, lr: 7.65e-03, grad_scale: 16.0 2023-06-20 10:49:31,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=694704.0, ans=0.0 2023-06-20 10:49:39,138 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.99 vs. limit=15.0 2023-06-20 10:50:05,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=694824.0, ans=0.0 2023-06-20 10:50:33,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=694884.0, ans=0.125 2023-06-20 10:51:12,281 INFO [train.py:996] (2/4) Epoch 4, batch 24350, loss[loss=0.2581, simple_loss=0.3066, pruned_loss=0.1048, over 20185.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3288, pruned_loss=0.09129, over 4286741.45 frames. ], batch size: 702, lr: 7.65e-03, grad_scale: 16.0 2023-06-20 10:51:29,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=695004.0, ans=0.1 2023-06-20 10:51:37,147 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.818e+02 2.964e+02 3.769e+02 4.911e+02 1.046e+03, threshold=7.538e+02, percent-clipped=11.0 2023-06-20 10:52:02,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=695124.0, ans=0.2 2023-06-20 10:52:21,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=695184.0, ans=0.2 2023-06-20 10:52:56,617 INFO [train.py:996] (2/4) Epoch 4, batch 24400, loss[loss=0.2896, simple_loss=0.3563, pruned_loss=0.1115, over 21721.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.3326, pruned_loss=0.09558, over 4285940.07 frames. ], batch size: 333, lr: 7.65e-03, grad_scale: 32.0 2023-06-20 10:53:00,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=695304.0, ans=0.125 2023-06-20 10:54:21,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=695544.0, ans=0.0 2023-06-20 10:54:39,030 INFO [train.py:996] (2/4) Epoch 4, batch 24450, loss[loss=0.2586, simple_loss=0.3452, pruned_loss=0.08596, over 21753.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3373, pruned_loss=0.09779, over 4279545.48 frames. ], batch size: 332, lr: 7.65e-03, grad_scale: 32.0 2023-06-20 10:54:46,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=695604.0, ans=0.2 2023-06-20 10:54:59,448 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.828e+02 3.266e+02 3.769e+02 5.234e+02, threshold=6.531e+02, percent-clipped=0.0 2023-06-20 10:55:02,309 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.29 vs. limit=12.0 2023-06-20 10:56:10,054 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=15.0 2023-06-20 10:56:15,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=695844.0, ans=0.0 2023-06-20 10:56:15,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=695844.0, ans=0.2 2023-06-20 10:56:21,857 INFO [train.py:996] (2/4) Epoch 4, batch 24500, loss[loss=0.2878, simple_loss=0.3436, pruned_loss=0.116, over 21850.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3366, pruned_loss=0.09798, over 4275008.96 frames. ], batch size: 107, lr: 7.64e-03, grad_scale: 32.0 2023-06-20 10:56:43,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=695964.0, ans=0.1 2023-06-20 10:56:53,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=695964.0, ans=0.0 2023-06-20 10:57:40,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=696084.0, ans=0.0 2023-06-20 10:57:46,676 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 10:58:02,728 INFO [train.py:996] (2/4) Epoch 4, batch 24550, loss[loss=0.2659, simple_loss=0.3087, pruned_loss=0.1115, over 20198.00 frames. ], tot_loss[loss=0.27, simple_loss=0.3387, pruned_loss=0.1007, over 4273759.57 frames. ], batch size: 703, lr: 7.64e-03, grad_scale: 32.0 2023-06-20 10:58:21,867 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 2.855e+02 3.314e+02 4.015e+02 6.051e+02, threshold=6.629e+02, percent-clipped=0.0 2023-06-20 10:59:27,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=696444.0, ans=0.2 2023-06-20 10:59:40,432 INFO [train.py:996] (2/4) Epoch 4, batch 24600, loss[loss=0.2565, simple_loss=0.321, pruned_loss=0.09599, over 21814.00 frames. ], tot_loss[loss=0.2689, simple_loss=0.3344, pruned_loss=0.1017, over 4266341.12 frames. ], batch size: 372, lr: 7.64e-03, grad_scale: 32.0 2023-06-20 11:00:01,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=696564.0, ans=0.125 2023-06-20 11:00:29,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=696624.0, ans=0.1 2023-06-20 11:00:30,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=696624.0, ans=0.2 2023-06-20 11:00:49,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=696684.0, ans=0.125 2023-06-20 11:01:18,231 INFO [train.py:996] (2/4) Epoch 4, batch 24650, loss[loss=0.2102, simple_loss=0.2711, pruned_loss=0.07466, over 21668.00 frames. ], tot_loss[loss=0.2646, simple_loss=0.3288, pruned_loss=0.1002, over 4269567.98 frames. ], batch size: 282, lr: 7.64e-03, grad_scale: 32.0 2023-06-20 11:01:20,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=696804.0, ans=0.125 2023-06-20 11:01:32,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=696804.0, ans=0.0 2023-06-20 11:01:37,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=696864.0, ans=0.0 2023-06-20 11:01:39,824 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 3.151e+02 3.751e+02 4.902e+02 9.106e+02, threshold=7.501e+02, percent-clipped=6.0 2023-06-20 11:01:44,303 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-06-20 11:01:56,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=696924.0, ans=0.0 2023-06-20 11:02:11,125 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.66 vs. limit=10.0 2023-06-20 11:03:01,118 INFO [train.py:996] (2/4) Epoch 4, batch 24700, loss[loss=0.2462, simple_loss=0.3066, pruned_loss=0.09288, over 21137.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3272, pruned_loss=0.09891, over 4257896.72 frames. ], batch size: 176, lr: 7.64e-03, grad_scale: 16.0 2023-06-20 11:03:21,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=697164.0, ans=0.125 2023-06-20 11:03:49,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=697224.0, ans=0.125 2023-06-20 11:03:58,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=697224.0, ans=0.0 2023-06-20 11:04:38,776 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.84 vs. limit=22.5 2023-06-20 11:04:38,955 INFO [train.py:996] (2/4) Epoch 4, batch 24750, loss[loss=0.2436, simple_loss=0.2984, pruned_loss=0.09435, over 21911.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.32, pruned_loss=0.09636, over 4269772.92 frames. ], batch size: 373, lr: 7.64e-03, grad_scale: 16.0 2023-06-20 11:04:46,584 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=15.0 2023-06-20 11:05:00,213 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 2.625e+02 3.082e+02 3.571e+02 6.291e+02, threshold=6.165e+02, percent-clipped=0.0 2023-06-20 11:05:02,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=697464.0, ans=0.2 2023-06-20 11:05:48,427 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-06-20 11:05:56,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=697584.0, ans=0.2 2023-06-20 11:05:59,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=697584.0, ans=0.125 2023-06-20 11:06:22,080 INFO [train.py:996] (2/4) Epoch 4, batch 24800, loss[loss=0.2211, simple_loss=0.281, pruned_loss=0.08066, over 21559.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.314, pruned_loss=0.09556, over 4278916.02 frames. ], batch size: 132, lr: 7.63e-03, grad_scale: 32.0 2023-06-20 11:06:23,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=697704.0, ans=0.0 2023-06-20 11:06:46,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=697764.0, ans=0.0 2023-06-20 11:06:47,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=697764.0, ans=0.125 2023-06-20 11:06:47,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=697764.0, ans=0.07 2023-06-20 11:06:54,586 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=15.0 2023-06-20 11:07:16,665 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.24 vs. limit=22.5 2023-06-20 11:07:33,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=697884.0, ans=0.125 2023-06-20 11:07:33,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=697884.0, ans=0.125 2023-06-20 11:07:36,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=697884.0, ans=0.125 2023-06-20 11:07:39,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=697884.0, ans=0.125 2023-06-20 11:07:41,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=697884.0, ans=0.0 2023-06-20 11:08:05,516 INFO [train.py:996] (2/4) Epoch 4, batch 24850, loss[loss=0.3324, simple_loss=0.382, pruned_loss=0.1414, over 21591.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3149, pruned_loss=0.09713, over 4280160.99 frames. ], batch size: 471, lr: 7.63e-03, grad_scale: 32.0 2023-06-20 11:08:20,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=698064.0, ans=0.125 2023-06-20 11:08:26,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=698064.0, ans=0.0 2023-06-20 11:08:27,077 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 2.980e+02 3.585e+02 4.162e+02 8.983e+02, threshold=7.171e+02, percent-clipped=3.0 2023-06-20 11:08:37,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=698064.0, ans=0.125 2023-06-20 11:08:52,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=698124.0, ans=0.0 2023-06-20 11:08:52,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=698124.0, ans=0.125 2023-06-20 11:09:06,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=698124.0, ans=0.0 2023-06-20 11:09:10,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=698124.0, ans=0.125 2023-06-20 11:09:15,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=698184.0, ans=0.0 2023-06-20 11:09:49,569 INFO [train.py:996] (2/4) Epoch 4, batch 24900, loss[loss=0.3094, simple_loss=0.3693, pruned_loss=0.1247, over 21701.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3154, pruned_loss=0.09645, over 4284452.44 frames. ], batch size: 351, lr: 7.63e-03, grad_scale: 32.0 2023-06-20 11:10:35,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=698424.0, ans=0.125 2023-06-20 11:10:54,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=698484.0, ans=0.0 2023-06-20 11:11:12,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=698544.0, ans=0.1 2023-06-20 11:11:29,335 INFO [train.py:996] (2/4) Epoch 4, batch 24950, loss[loss=0.385, simple_loss=0.4208, pruned_loss=0.1746, over 21455.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.325, pruned_loss=0.1018, over 4280366.46 frames. ], batch size: 471, lr: 7.63e-03, grad_scale: 32.0 2023-06-20 11:11:34,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=698604.0, ans=0.1 2023-06-20 11:12:12,246 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.285e+02 3.311e+02 3.992e+02 4.985e+02 7.150e+02, threshold=7.983e+02, percent-clipped=0.0 2023-06-20 11:12:18,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=698664.0, ans=0.1 2023-06-20 11:12:43,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=698784.0, ans=0.125 2023-06-20 11:13:06,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=698844.0, ans=0.1 2023-06-20 11:13:20,111 INFO [train.py:996] (2/4) Epoch 4, batch 25000, loss[loss=0.2388, simple_loss=0.3058, pruned_loss=0.08589, over 21628.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.331, pruned_loss=0.1027, over 4275392.40 frames. ], batch size: 298, lr: 7.63e-03, grad_scale: 32.0 2023-06-20 11:13:27,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=698904.0, ans=0.0 2023-06-20 11:13:53,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=698964.0, ans=0.125 2023-06-20 11:13:55,619 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=15.0 2023-06-20 11:13:59,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=698964.0, ans=0.2 2023-06-20 11:14:37,822 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 11:15:02,944 INFO [train.py:996] (2/4) Epoch 4, batch 25050, loss[loss=0.2545, simple_loss=0.3107, pruned_loss=0.0992, over 21794.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3238, pruned_loss=0.1002, over 4259671.45 frames. ], batch size: 352, lr: 7.63e-03, grad_scale: 32.0 2023-06-20 11:15:40,007 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.775e+02 3.166e+02 3.769e+02 6.146e+02, threshold=6.333e+02, percent-clipped=0.0 2023-06-20 11:16:35,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=699444.0, ans=0.0 2023-06-20 11:16:47,515 INFO [train.py:996] (2/4) Epoch 4, batch 25100, loss[loss=0.2243, simple_loss=0.2791, pruned_loss=0.08473, over 20754.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3183, pruned_loss=0.09925, over 4251802.14 frames. ], batch size: 608, lr: 7.62e-03, grad_scale: 32.0 2023-06-20 11:16:48,735 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=15.0 2023-06-20 11:17:25,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=699564.0, ans=0.0 2023-06-20 11:17:48,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=699624.0, ans=0.125 2023-06-20 11:18:29,742 INFO [train.py:996] (2/4) Epoch 4, batch 25150, loss[loss=0.2345, simple_loss=0.312, pruned_loss=0.07855, over 21454.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3221, pruned_loss=0.09676, over 4253739.02 frames. ], batch size: 131, lr: 7.62e-03, grad_scale: 32.0 2023-06-20 11:18:33,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=699804.0, ans=0.125 2023-06-20 11:19:00,278 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.032e+02 2.678e+02 3.105e+02 3.619e+02 6.270e+02, threshold=6.210e+02, percent-clipped=0.0 2023-06-20 11:19:15,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=699924.0, ans=0.125 2023-06-20 11:19:26,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=699924.0, ans=0.2 2023-06-20 11:19:34,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=699984.0, ans=0.1 2023-06-20 11:19:34,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=699984.0, ans=0.0 2023-06-20 11:19:52,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=700044.0, ans=0.1 2023-06-20 11:20:06,423 INFO [train.py:996] (2/4) Epoch 4, batch 25200, loss[loss=0.2789, simple_loss=0.3547, pruned_loss=0.1015, over 21692.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3216, pruned_loss=0.09447, over 4250802.23 frames. ], batch size: 389, lr: 7.62e-03, grad_scale: 32.0 2023-06-20 11:20:06,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=700104.0, ans=0.2 2023-06-20 11:21:23,782 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.67 vs. limit=6.0 2023-06-20 11:21:33,135 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.65 vs. limit=15.0 2023-06-20 11:21:43,493 INFO [train.py:996] (2/4) Epoch 4, batch 25250, loss[loss=0.2625, simple_loss=0.318, pruned_loss=0.1035, over 21768.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3198, pruned_loss=0.09265, over 4263502.12 frames. ], batch size: 102, lr: 7.62e-03, grad_scale: 32.0 2023-06-20 11:21:56,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=700404.0, ans=0.125 2023-06-20 11:22:14,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=700464.0, ans=0.125 2023-06-20 11:22:18,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=700464.0, ans=0.0 2023-06-20 11:22:20,666 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.958e+02 2.744e+02 3.112e+02 3.810e+02 6.947e+02, threshold=6.224e+02, percent-clipped=3.0 2023-06-20 11:23:04,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=700584.0, ans=10.0 2023-06-20 11:23:18,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=700644.0, ans=0.0 2023-06-20 11:23:32,857 INFO [train.py:996] (2/4) Epoch 4, batch 25300, loss[loss=0.2392, simple_loss=0.3152, pruned_loss=0.08155, over 20796.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3172, pruned_loss=0.09172, over 4259425.73 frames. ], batch size: 608, lr: 7.62e-03, grad_scale: 16.0 2023-06-20 11:23:51,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=700704.0, ans=0.2 2023-06-20 11:23:51,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=700704.0, ans=0.0 2023-06-20 11:24:11,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=700764.0, ans=0.125 2023-06-20 11:24:14,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=700764.0, ans=0.125 2023-06-20 11:24:23,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=700824.0, ans=0.125 2023-06-20 11:24:34,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=700824.0, ans=0.1 2023-06-20 11:24:44,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=700884.0, ans=0.125 2023-06-20 11:24:57,320 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-20 11:25:19,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=700944.0, ans=0.2 2023-06-20 11:25:27,751 INFO [train.py:996] (2/4) Epoch 4, batch 25350, loss[loss=0.2099, simple_loss=0.2846, pruned_loss=0.06759, over 21522.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3207, pruned_loss=0.09218, over 4257625.56 frames. ], batch size: 230, lr: 7.62e-03, grad_scale: 16.0 2023-06-20 11:25:55,991 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.721e+02 3.100e+02 3.889e+02 7.002e+02, threshold=6.200e+02, percent-clipped=1.0 2023-06-20 11:25:59,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=701064.0, ans=0.0 2023-06-20 11:26:15,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=701124.0, ans=0.1 2023-06-20 11:27:05,587 INFO [train.py:996] (2/4) Epoch 4, batch 25400, loss[loss=0.2115, simple_loss=0.2775, pruned_loss=0.07272, over 21551.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3172, pruned_loss=0.0912, over 4261961.59 frames. ], batch size: 263, lr: 7.62e-03, grad_scale: 16.0 2023-06-20 11:27:43,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=701424.0, ans=0.2 2023-06-20 11:27:50,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=701424.0, ans=0.0 2023-06-20 11:27:51,899 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 11:27:52,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=701424.0, ans=0.0 2023-06-20 11:28:42,932 INFO [train.py:996] (2/4) Epoch 4, batch 25450, loss[loss=0.2286, simple_loss=0.3214, pruned_loss=0.06787, over 21827.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3175, pruned_loss=0.093, over 4271800.93 frames. ], batch size: 282, lr: 7.61e-03, grad_scale: 16.0 2023-06-20 11:29:04,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=701604.0, ans=0.0 2023-06-20 11:29:17,654 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.926e+02 3.539e+02 4.386e+02 7.693e+02, threshold=7.077e+02, percent-clipped=6.0 2023-06-20 11:30:33,815 INFO [train.py:996] (2/4) Epoch 4, batch 25500, loss[loss=0.2308, simple_loss=0.3172, pruned_loss=0.07224, over 21691.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3165, pruned_loss=0.0892, over 4253461.53 frames. ], batch size: 263, lr: 7.61e-03, grad_scale: 16.0 2023-06-20 11:30:46,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=701904.0, ans=0.0 2023-06-20 11:31:16,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=702024.0, ans=0.0 2023-06-20 11:31:34,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=702084.0, ans=0.1 2023-06-20 11:31:42,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=702084.0, ans=0.0 2023-06-20 11:32:12,748 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.99 vs. limit=15.0 2023-06-20 11:32:16,682 INFO [train.py:996] (2/4) Epoch 4, batch 25550, loss[loss=0.2247, simple_loss=0.3148, pruned_loss=0.06732, over 21419.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3219, pruned_loss=0.08829, over 4252863.69 frames. ], batch size: 211, lr: 7.61e-03, grad_scale: 16.0 2023-06-20 11:32:45,275 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.942e+02 2.927e+02 3.531e+02 4.668e+02 7.861e+02, threshold=7.061e+02, percent-clipped=1.0 2023-06-20 11:33:05,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=702324.0, ans=0.2 2023-06-20 11:33:05,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=702324.0, ans=0.1 2023-06-20 11:33:22,331 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=22.5 2023-06-20 11:33:23,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=702384.0, ans=0.125 2023-06-20 11:33:50,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=702444.0, ans=0.07 2023-06-20 11:33:55,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=702444.0, ans=0.5 2023-06-20 11:34:05,303 INFO [train.py:996] (2/4) Epoch 4, batch 25600, loss[loss=0.2948, simple_loss=0.3595, pruned_loss=0.1151, over 21737.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3258, pruned_loss=0.08923, over 4255090.38 frames. ], batch size: 351, lr: 7.61e-03, grad_scale: 32.0 2023-06-20 11:34:29,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=702564.0, ans=0.125 2023-06-20 11:34:32,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=702564.0, ans=0.07 2023-06-20 11:35:02,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=702684.0, ans=0.125 2023-06-20 11:35:47,383 INFO [train.py:996] (2/4) Epoch 4, batch 25650, loss[loss=0.2321, simple_loss=0.2889, pruned_loss=0.08763, over 21450.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3272, pruned_loss=0.09303, over 4259808.19 frames. ], batch size: 211, lr: 7.61e-03, grad_scale: 32.0 2023-06-20 11:36:10,588 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.198e+02 2.974e+02 3.726e+02 4.769e+02 9.123e+02, threshold=7.452e+02, percent-clipped=4.0 2023-06-20 11:36:50,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=702984.0, ans=15.0 2023-06-20 11:37:31,127 INFO [train.py:996] (2/4) Epoch 4, batch 25700, loss[loss=0.2798, simple_loss=0.3326, pruned_loss=0.1135, over 21728.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3258, pruned_loss=0.0951, over 4258657.90 frames. ], batch size: 441, lr: 7.61e-03, grad_scale: 32.0 2023-06-20 11:37:34,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=703104.0, ans=0.0 2023-06-20 11:37:47,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=703164.0, ans=0.125 2023-06-20 11:38:03,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=703224.0, ans=0.125 2023-06-20 11:38:06,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=703224.0, ans=0.0 2023-06-20 11:38:22,283 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.11 vs. limit=22.5 2023-06-20 11:39:00,398 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 11:39:12,218 INFO [train.py:996] (2/4) Epoch 4, batch 25750, loss[loss=0.328, simple_loss=0.4208, pruned_loss=0.1176, over 21323.00 frames. ], tot_loss[loss=0.2639, simple_loss=0.3313, pruned_loss=0.09825, over 4267638.80 frames. ], batch size: 548, lr: 7.60e-03, grad_scale: 32.0 2023-06-20 11:39:16,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=703404.0, ans=0.125 2023-06-20 11:39:35,971 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 3.258e+02 3.853e+02 4.721e+02 7.384e+02, threshold=7.705e+02, percent-clipped=0.0 2023-06-20 11:40:57,313 INFO [train.py:996] (2/4) Epoch 4, batch 25800, loss[loss=0.2803, simple_loss=0.347, pruned_loss=0.1068, over 21705.00 frames. ], tot_loss[loss=0.2754, simple_loss=0.3439, pruned_loss=0.1035, over 4262551.87 frames. ], batch size: 332, lr: 7.60e-03, grad_scale: 32.0 2023-06-20 11:42:28,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=703944.0, ans=0.0 2023-06-20 11:42:39,614 INFO [train.py:996] (2/4) Epoch 4, batch 25850, loss[loss=0.1869, simple_loss=0.2489, pruned_loss=0.06242, over 16796.00 frames. ], tot_loss[loss=0.2757, simple_loss=0.3455, pruned_loss=0.1029, over 4259666.69 frames. ], batch size: 62, lr: 7.60e-03, grad_scale: 32.0 2023-06-20 11:42:40,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=704004.0, ans=0.0 2023-06-20 11:43:23,437 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.241e+02 3.030e+02 3.694e+02 4.405e+02 6.989e+02, threshold=7.387e+02, percent-clipped=0.0 2023-06-20 11:43:56,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=704184.0, ans=0.125 2023-06-20 11:44:28,597 INFO [train.py:996] (2/4) Epoch 4, batch 25900, loss[loss=0.2806, simple_loss=0.3472, pruned_loss=0.107, over 21798.00 frames. ], tot_loss[loss=0.2762, simple_loss=0.3462, pruned_loss=0.1031, over 4267169.92 frames. ], batch size: 112, lr: 7.60e-03, grad_scale: 16.0 2023-06-20 11:44:42,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=704304.0, ans=0.0 2023-06-20 11:45:29,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=704424.0, ans=0.125 2023-06-20 11:45:40,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=704484.0, ans=0.0 2023-06-20 11:46:17,919 INFO [train.py:996] (2/4) Epoch 4, batch 25950, loss[loss=0.3069, simple_loss=0.3717, pruned_loss=0.121, over 21282.00 frames. ], tot_loss[loss=0.2832, simple_loss=0.3533, pruned_loss=0.1065, over 4278166.22 frames. ], batch size: 143, lr: 7.60e-03, grad_scale: 16.0 2023-06-20 11:46:18,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=704604.0, ans=0.125 2023-06-20 11:46:52,828 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.400e+02 3.258e+02 4.006e+02 4.613e+02 7.769e+02, threshold=8.011e+02, percent-clipped=1.0 2023-06-20 11:47:32,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=704784.0, ans=0.2 2023-06-20 11:48:11,741 INFO [train.py:996] (2/4) Epoch 4, batch 26000, loss[loss=0.2827, simple_loss=0.3545, pruned_loss=0.1055, over 21774.00 frames. ], tot_loss[loss=0.283, simple_loss=0.3545, pruned_loss=0.1058, over 4278529.27 frames. ], batch size: 247, lr: 7.60e-03, grad_scale: 32.0 2023-06-20 11:48:27,771 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=22.5 2023-06-20 11:48:39,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=704964.0, ans=0.2 2023-06-20 11:49:15,194 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-20 11:49:22,530 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.93 vs. limit=22.5 2023-06-20 11:49:39,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=705144.0, ans=0.125 2023-06-20 11:49:53,042 INFO [train.py:996] (2/4) Epoch 4, batch 26050, loss[loss=0.2585, simple_loss=0.3204, pruned_loss=0.09827, over 21861.00 frames. ], tot_loss[loss=0.2825, simple_loss=0.353, pruned_loss=0.106, over 4284232.34 frames. ], batch size: 371, lr: 7.59e-03, grad_scale: 32.0 2023-06-20 11:50:10,794 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.14 vs. limit=15.0 2023-06-20 11:50:17,437 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.846e+02 3.300e+02 3.976e+02 7.984e+02, threshold=6.600e+02, percent-clipped=0.0 2023-06-20 11:50:42,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=705384.0, ans=0.07 2023-06-20 11:51:35,561 INFO [train.py:996] (2/4) Epoch 4, batch 26100, loss[loss=0.2467, simple_loss=0.3076, pruned_loss=0.09294, over 21843.00 frames. ], tot_loss[loss=0.2789, simple_loss=0.3472, pruned_loss=0.1053, over 4277416.90 frames. ], batch size: 441, lr: 7.59e-03, grad_scale: 32.0 2023-06-20 11:51:47,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=705504.0, ans=0.125 2023-06-20 11:53:19,690 INFO [train.py:996] (2/4) Epoch 4, batch 26150, loss[loss=0.311, simple_loss=0.3656, pruned_loss=0.1282, over 21346.00 frames. ], tot_loss[loss=0.2772, simple_loss=0.3432, pruned_loss=0.1056, over 4281182.60 frames. ], batch size: 159, lr: 7.59e-03, grad_scale: 32.0 2023-06-20 11:53:42,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=705864.0, ans=0.125 2023-06-20 11:53:45,113 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.363e+02 3.006e+02 3.444e+02 4.248e+02 6.303e+02, threshold=6.888e+02, percent-clipped=0.0 2023-06-20 11:55:05,096 INFO [train.py:996] (2/4) Epoch 4, batch 26200, loss[loss=0.3823, simple_loss=0.4516, pruned_loss=0.1566, over 21509.00 frames. ], tot_loss[loss=0.2756, simple_loss=0.3443, pruned_loss=0.1035, over 4278952.55 frames. ], batch size: 508, lr: 7.59e-03, grad_scale: 32.0 2023-06-20 11:55:22,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=706164.0, ans=0.125 2023-06-20 11:55:29,737 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=22.5 2023-06-20 11:56:08,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=706284.0, ans=0.125 2023-06-20 11:56:10,388 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=12.0 2023-06-20 11:56:46,825 INFO [train.py:996] (2/4) Epoch 4, batch 26250, loss[loss=0.2769, simple_loss=0.3453, pruned_loss=0.1042, over 21747.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.346, pruned_loss=0.1008, over 4280604.99 frames. ], batch size: 389, lr: 7.59e-03, grad_scale: 16.0 2023-06-20 11:57:12,720 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.231e+02 2.833e+02 3.243e+02 4.065e+02 7.438e+02, threshold=6.486e+02, percent-clipped=1.0 2023-06-20 11:57:13,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=706464.0, ans=0.0 2023-06-20 11:57:23,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=706524.0, ans=0.125 2023-06-20 11:58:28,971 INFO [train.py:996] (2/4) Epoch 4, batch 26300, loss[loss=0.2672, simple_loss=0.3267, pruned_loss=0.1039, over 21871.00 frames. ], tot_loss[loss=0.2732, simple_loss=0.3432, pruned_loss=0.1016, over 4286861.26 frames. ], batch size: 124, lr: 7.59e-03, grad_scale: 16.0 2023-06-20 11:58:37,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=706704.0, ans=0.07 2023-06-20 11:59:04,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=706764.0, ans=0.125 2023-06-20 11:59:51,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=706884.0, ans=0.125 2023-06-20 12:00:14,488 INFO [train.py:996] (2/4) Epoch 4, batch 26350, loss[loss=0.2676, simple_loss=0.3387, pruned_loss=0.0982, over 21658.00 frames. ], tot_loss[loss=0.2739, simple_loss=0.3421, pruned_loss=0.1029, over 4291866.65 frames. ], batch size: 112, lr: 7.58e-03, grad_scale: 16.0 2023-06-20 12:00:50,660 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 3.098e+02 3.455e+02 4.050e+02 6.767e+02, threshold=6.909e+02, percent-clipped=5.0 2023-06-20 12:00:52,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=707064.0, ans=0.125 2023-06-20 12:01:06,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=707124.0, ans=0.09899494936611666 2023-06-20 12:01:24,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=707184.0, ans=0.0 2023-06-20 12:01:25,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=707184.0, ans=0.125 2023-06-20 12:01:57,031 INFO [train.py:996] (2/4) Epoch 4, batch 26400, loss[loss=0.2792, simple_loss=0.3238, pruned_loss=0.1173, over 21865.00 frames. ], tot_loss[loss=0.2713, simple_loss=0.3366, pruned_loss=0.103, over 4290730.26 frames. ], batch size: 102, lr: 7.58e-03, grad_scale: 32.0 2023-06-20 12:02:07,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=707304.0, ans=0.125 2023-06-20 12:02:22,701 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:03:12,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=707484.0, ans=0.0 2023-06-20 12:03:49,523 INFO [train.py:996] (2/4) Epoch 4, batch 26450, loss[loss=0.2987, simple_loss=0.4004, pruned_loss=0.09851, over 21654.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3375, pruned_loss=0.103, over 4284716.55 frames. ], batch size: 414, lr: 7.58e-03, grad_scale: 32.0 2023-06-20 12:04:21,779 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 2.976e+02 3.665e+02 4.778e+02 9.045e+02, threshold=7.330e+02, percent-clipped=3.0 2023-06-20 12:04:44,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=707724.0, ans=0.0 2023-06-20 12:05:09,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=707784.0, ans=0.0 2023-06-20 12:05:28,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=707904.0, ans=0.0 2023-06-20 12:05:35,280 INFO [train.py:996] (2/4) Epoch 4, batch 26500, loss[loss=0.268, simple_loss=0.3488, pruned_loss=0.09356, over 21633.00 frames. ], tot_loss[loss=0.2713, simple_loss=0.3398, pruned_loss=0.1014, over 4270954.91 frames. ], batch size: 389, lr: 7.58e-03, grad_scale: 16.0 2023-06-20 12:06:09,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=707964.0, ans=0.0 2023-06-20 12:06:24,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=708024.0, ans=0.1 2023-06-20 12:07:09,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=708144.0, ans=0.2 2023-06-20 12:07:24,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=708144.0, ans=0.2 2023-06-20 12:07:32,531 INFO [train.py:996] (2/4) Epoch 4, batch 26550, loss[loss=0.2154, simple_loss=0.328, pruned_loss=0.0514, over 20800.00 frames. ], tot_loss[loss=0.2646, simple_loss=0.3343, pruned_loss=0.09747, over 4257471.73 frames. ], batch size: 609, lr: 7.58e-03, grad_scale: 16.0 2023-06-20 12:07:33,555 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=22.5 2023-06-20 12:07:51,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=708264.0, ans=0.04949747468305833 2023-06-20 12:08:00,821 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 2.832e+02 3.306e+02 3.943e+02 6.835e+02, threshold=6.613e+02, percent-clipped=0.0 2023-06-20 12:08:01,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=708264.0, ans=0.125 2023-06-20 12:08:20,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=708324.0, ans=0.125 2023-06-20 12:08:37,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=708384.0, ans=10.0 2023-06-20 12:08:47,123 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:09:07,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=708444.0, ans=0.1 2023-06-20 12:09:16,683 INFO [train.py:996] (2/4) Epoch 4, batch 26600, loss[loss=0.2405, simple_loss=0.3072, pruned_loss=0.08688, over 21182.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.333, pruned_loss=0.09406, over 4260788.03 frames. ], batch size: 159, lr: 7.58e-03, grad_scale: 16.0 2023-06-20 12:09:20,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=708504.0, ans=0.1 2023-06-20 12:09:25,833 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:09:42,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=708564.0, ans=0.125 2023-06-20 12:09:42,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=708564.0, ans=0.2 2023-06-20 12:10:19,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=708684.0, ans=0.1 2023-06-20 12:10:54,719 INFO [train.py:996] (2/4) Epoch 4, batch 26650, loss[loss=0.2144, simple_loss=0.2902, pruned_loss=0.06932, over 21769.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3268, pruned_loss=0.09345, over 4255728.69 frames. ], batch size: 351, lr: 7.57e-03, grad_scale: 16.0 2023-06-20 12:10:59,593 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=5.38 vs. limit=15.0 2023-06-20 12:11:18,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=708864.0, ans=0.95 2023-06-20 12:11:27,250 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.949e+02 2.914e+02 3.392e+02 4.054e+02 7.182e+02, threshold=6.783e+02, percent-clipped=2.0 2023-06-20 12:12:32,359 INFO [train.py:996] (2/4) Epoch 4, batch 26700, loss[loss=0.2444, simple_loss=0.301, pruned_loss=0.09389, over 21263.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3189, pruned_loss=0.0897, over 4263702.47 frames. ], batch size: 176, lr: 7.57e-03, grad_scale: 16.0 2023-06-20 12:12:41,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=709104.0, ans=0.125 2023-06-20 12:12:46,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=709104.0, ans=0.125 2023-06-20 12:13:32,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=709284.0, ans=0.0 2023-06-20 12:13:47,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=709284.0, ans=0.125 2023-06-20 12:13:48,008 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-20 12:14:11,598 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.25 vs. limit=15.0 2023-06-20 12:14:11,948 INFO [train.py:996] (2/4) Epoch 4, batch 26750, loss[loss=0.247, simple_loss=0.3309, pruned_loss=0.08156, over 21719.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3198, pruned_loss=0.08973, over 4267506.01 frames. ], batch size: 351, lr: 7.57e-03, grad_scale: 16.0 2023-06-20 12:14:12,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=709404.0, ans=0.0 2023-06-20 12:14:35,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=709464.0, ans=0.0 2023-06-20 12:14:50,953 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.737e+02 2.700e+02 3.315e+02 4.013e+02 5.519e+02, threshold=6.631e+02, percent-clipped=0.0 2023-06-20 12:14:56,070 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.25 vs. limit=15.0 2023-06-20 12:14:56,863 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:15:19,417 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=15.0 2023-06-20 12:15:48,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=709644.0, ans=0.2 2023-06-20 12:15:56,197 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.73 vs. limit=6.0 2023-06-20 12:15:56,468 INFO [train.py:996] (2/4) Epoch 4, batch 26800, loss[loss=0.359, simple_loss=0.4007, pruned_loss=0.1587, over 21300.00 frames. ], tot_loss[loss=0.26, simple_loss=0.329, pruned_loss=0.09549, over 4273866.71 frames. ], batch size: 507, lr: 7.57e-03, grad_scale: 32.0 2023-06-20 12:16:58,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=709824.0, ans=0.0 2023-06-20 12:17:21,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=709944.0, ans=0.5 2023-06-20 12:17:32,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=709944.0, ans=0.0 2023-06-20 12:17:36,504 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-20 12:17:43,552 INFO [train.py:996] (2/4) Epoch 4, batch 26850, loss[loss=0.2466, simple_loss=0.2985, pruned_loss=0.09738, over 21652.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3295, pruned_loss=0.09757, over 4278349.37 frames. ], batch size: 282, lr: 7.57e-03, grad_scale: 32.0 2023-06-20 12:17:43,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=710004.0, ans=0.0 2023-06-20 12:18:09,678 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-20 12:18:21,740 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.334e+02 2.953e+02 3.297e+02 3.985e+02 6.841e+02, threshold=6.593e+02, percent-clipped=1.0 2023-06-20 12:19:08,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=710244.0, ans=0.0 2023-06-20 12:19:13,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=710244.0, ans=0.125 2023-06-20 12:19:14,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=710244.0, ans=0.0 2023-06-20 12:19:14,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=710244.0, ans=0.07 2023-06-20 12:19:20,907 INFO [train.py:996] (2/4) Epoch 4, batch 26900, loss[loss=0.2164, simple_loss=0.2839, pruned_loss=0.07445, over 15762.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3215, pruned_loss=0.09633, over 4263102.56 frames. ], batch size: 63, lr: 7.57e-03, grad_scale: 32.0 2023-06-20 12:19:30,728 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.77 vs. limit=8.0 2023-06-20 12:20:07,889 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.77 vs. limit=15.0 2023-06-20 12:20:20,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=710424.0, ans=0.09899494936611666 2023-06-20 12:21:01,684 INFO [train.py:996] (2/4) Epoch 4, batch 26950, loss[loss=0.2375, simple_loss=0.2973, pruned_loss=0.08883, over 21678.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3198, pruned_loss=0.09628, over 4268298.14 frames. ], batch size: 299, lr: 7.57e-03, grad_scale: 32.0 2023-06-20 12:21:31,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=710664.0, ans=0.0 2023-06-20 12:21:34,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=710664.0, ans=0.1 2023-06-20 12:21:38,733 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.998e+02 3.332e+02 4.075e+02 6.086e+02, threshold=6.663e+02, percent-clipped=0.0 2023-06-20 12:21:41,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=710664.0, ans=0.125 2023-06-20 12:22:10,876 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=12.0 2023-06-20 12:22:14,210 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=15.0 2023-06-20 12:22:24,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=710844.0, ans=0.05 2023-06-20 12:22:25,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=710844.0, ans=0.125 2023-06-20 12:22:45,207 INFO [train.py:996] (2/4) Epoch 4, batch 27000, loss[loss=0.2214, simple_loss=0.3061, pruned_loss=0.06834, over 21608.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3188, pruned_loss=0.09231, over 4270124.85 frames. ], batch size: 263, lr: 7.56e-03, grad_scale: 32.0 2023-06-20 12:22:45,208 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 12:23:07,065 INFO [train.py:1028] (2/4) Epoch 4, validation: loss=0.2473, simple_loss=0.3466, pruned_loss=0.07399, over 1796401.00 frames. 2023-06-20 12:23:07,066 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-20 12:23:07,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=710904.0, ans=0.125 2023-06-20 12:24:26,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=711144.0, ans=0.2 2023-06-20 12:24:50,445 INFO [train.py:996] (2/4) Epoch 4, batch 27050, loss[loss=0.3583, simple_loss=0.3968, pruned_loss=0.1599, over 21651.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3213, pruned_loss=0.08858, over 4269571.32 frames. ], batch size: 507, lr: 7.56e-03, grad_scale: 32.0 2023-06-20 12:24:53,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=711204.0, ans=15.0 2023-06-20 12:25:05,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=711204.0, ans=0.125 2023-06-20 12:25:10,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=711264.0, ans=0.05 2023-06-20 12:25:14,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=711264.0, ans=0.1 2023-06-20 12:25:20,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=711264.0, ans=0.2 2023-06-20 12:25:23,367 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.894e+02 2.528e+02 2.843e+02 3.432e+02 6.081e+02, threshold=5.686e+02, percent-clipped=0.0 2023-06-20 12:26:27,071 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:26:28,424 INFO [train.py:996] (2/4) Epoch 4, batch 27100, loss[loss=0.2587, simple_loss=0.3387, pruned_loss=0.08937, over 21344.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3247, pruned_loss=0.09044, over 4273047.74 frames. ], batch size: 159, lr: 7.56e-03, grad_scale: 32.0 2023-06-20 12:27:13,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=711624.0, ans=0.2 2023-06-20 12:27:27,544 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=22.5 2023-06-20 12:27:40,521 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:27:43,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=711684.0, ans=0.1 2023-06-20 12:28:19,213 INFO [train.py:996] (2/4) Epoch 4, batch 27150, loss[loss=0.3285, simple_loss=0.416, pruned_loss=0.1205, over 21316.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.336, pruned_loss=0.09354, over 4277411.59 frames. ], batch size: 548, lr: 7.56e-03, grad_scale: 32.0 2023-06-20 12:28:47,391 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 3.195e+02 3.755e+02 4.562e+02 7.359e+02, threshold=7.509e+02, percent-clipped=7.0 2023-06-20 12:29:33,560 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2023-06-20 12:29:34,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=712044.0, ans=0.125 2023-06-20 12:29:37,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=712044.0, ans=0.125 2023-06-20 12:29:57,451 INFO [train.py:996] (2/4) Epoch 4, batch 27200, loss[loss=0.3273, simple_loss=0.422, pruned_loss=0.1163, over 21276.00 frames. ], tot_loss[loss=0.2684, simple_loss=0.3437, pruned_loss=0.09652, over 4271820.57 frames. ], batch size: 548, lr: 7.56e-03, grad_scale: 32.0 2023-06-20 12:30:33,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=712164.0, ans=0.5 2023-06-20 12:31:42,702 INFO [train.py:996] (2/4) Epoch 4, batch 27250, loss[loss=0.3221, simple_loss=0.3714, pruned_loss=0.1363, over 21255.00 frames. ], tot_loss[loss=0.2748, simple_loss=0.347, pruned_loss=0.1013, over 4268057.85 frames. ], batch size: 143, lr: 7.56e-03, grad_scale: 16.0 2023-06-20 12:32:18,771 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.188e+02 3.455e+02 4.039e+02 4.883e+02 8.665e+02, threshold=8.078e+02, percent-clipped=1.0 2023-06-20 12:32:29,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=712524.0, ans=0.125 2023-06-20 12:33:11,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=712644.0, ans=0.0 2023-06-20 12:33:33,293 INFO [train.py:996] (2/4) Epoch 4, batch 27300, loss[loss=0.2382, simple_loss=0.2996, pruned_loss=0.08844, over 20089.00 frames. ], tot_loss[loss=0.2774, simple_loss=0.349, pruned_loss=0.1029, over 4271615.25 frames. ], batch size: 703, lr: 7.55e-03, grad_scale: 16.0 2023-06-20 12:33:33,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=712704.0, ans=0.2 2023-06-20 12:34:18,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=712764.0, ans=0.125 2023-06-20 12:34:21,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=712824.0, ans=0.1 2023-06-20 12:34:40,716 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2023-06-20 12:35:16,751 INFO [train.py:996] (2/4) Epoch 4, batch 27350, loss[loss=0.3316, simple_loss=0.3998, pruned_loss=0.1317, over 21665.00 frames. ], tot_loss[loss=0.2798, simple_loss=0.352, pruned_loss=0.1038, over 4277567.44 frames. ], batch size: 414, lr: 7.55e-03, grad_scale: 16.0 2023-06-20 12:35:32,289 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.17 vs. limit=22.5 2023-06-20 12:35:49,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=713064.0, ans=0.125 2023-06-20 12:36:00,979 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.787e+02 3.114e+02 3.820e+02 5.936e+02, threshold=6.228e+02, percent-clipped=0.0 2023-06-20 12:36:03,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=713124.0, ans=0.1 2023-06-20 12:36:49,796 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-06-20 12:36:57,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=713304.0, ans=0.2 2023-06-20 12:36:58,523 INFO [train.py:996] (2/4) Epoch 4, batch 27400, loss[loss=0.2259, simple_loss=0.2883, pruned_loss=0.08181, over 21837.00 frames. ], tot_loss[loss=0.2768, simple_loss=0.3472, pruned_loss=0.1032, over 4276451.54 frames. ], batch size: 98, lr: 7.55e-03, grad_scale: 16.0 2023-06-20 12:38:30,601 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:38:41,470 INFO [train.py:996] (2/4) Epoch 4, batch 27450, loss[loss=0.2723, simple_loss=0.356, pruned_loss=0.09426, over 21869.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.341, pruned_loss=0.1018, over 4281223.12 frames. ], batch size: 317, lr: 7.55e-03, grad_scale: 16.0 2023-06-20 12:39:19,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=713664.0, ans=0.0 2023-06-20 12:39:26,435 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 2.637e+02 2.964e+02 3.334e+02 5.036e+02, threshold=5.928e+02, percent-clipped=0.0 2023-06-20 12:40:23,688 INFO [train.py:996] (2/4) Epoch 4, batch 27500, loss[loss=0.2735, simple_loss=0.3416, pruned_loss=0.1027, over 21765.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.34, pruned_loss=0.1023, over 4287422.43 frames. ], batch size: 332, lr: 7.55e-03, grad_scale: 16.0 2023-06-20 12:40:33,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=713904.0, ans=0.1 2023-06-20 12:40:50,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=713964.0, ans=0.05 2023-06-20 12:41:21,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=714024.0, ans=0.1 2023-06-20 12:42:02,879 INFO [train.py:996] (2/4) Epoch 4, batch 27550, loss[loss=0.2889, simple_loss=0.3652, pruned_loss=0.1063, over 19995.00 frames. ], tot_loss[loss=0.2668, simple_loss=0.3348, pruned_loss=0.09937, over 4287789.83 frames. ], batch size: 702, lr: 7.55e-03, grad_scale: 16.0 2023-06-20 12:42:09,082 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=15.0 2023-06-20 12:42:09,204 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.07 vs. limit=22.5 2023-06-20 12:42:49,321 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 3.202e+02 3.879e+02 5.081e+02 9.458e+02, threshold=7.759e+02, percent-clipped=14.0 2023-06-20 12:42:51,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=714324.0, ans=0.125 2023-06-20 12:43:19,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=714384.0, ans=0.09899494936611666 2023-06-20 12:43:34,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=714444.0, ans=0.125 2023-06-20 12:43:50,137 INFO [train.py:996] (2/4) Epoch 4, batch 27600, loss[loss=0.2407, simple_loss=0.2998, pruned_loss=0.09085, over 21687.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3269, pruned_loss=0.09764, over 4291468.91 frames. ], batch size: 282, lr: 7.54e-03, grad_scale: 32.0 2023-06-20 12:44:22,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=714564.0, ans=0.125 2023-06-20 12:44:22,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=714564.0, ans=0.0 2023-06-20 12:44:40,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=714624.0, ans=0.0 2023-06-20 12:44:40,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=714624.0, ans=6.0 2023-06-20 12:45:09,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=714744.0, ans=0.125 2023-06-20 12:45:17,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=714744.0, ans=0.125 2023-06-20 12:45:24,133 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=22.5 2023-06-20 12:45:26,209 INFO [train.py:996] (2/4) Epoch 4, batch 27650, loss[loss=0.2437, simple_loss=0.3171, pruned_loss=0.08513, over 21283.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3225, pruned_loss=0.0969, over 4282505.83 frames. ], batch size: 159, lr: 7.54e-03, grad_scale: 32.0 2023-06-20 12:45:29,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=714804.0, ans=0.125 2023-06-20 12:45:45,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=714864.0, ans=0.05 2023-06-20 12:45:54,610 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=2.590e-03 2023-06-20 12:46:05,564 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.142e+02 2.761e+02 3.105e+02 3.536e+02 5.675e+02, threshold=6.210e+02, percent-clipped=0.0 2023-06-20 12:46:17,693 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.06 vs. limit=12.0 2023-06-20 12:46:44,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=714984.0, ans=0.2 2023-06-20 12:47:03,681 INFO [train.py:996] (2/4) Epoch 4, batch 27700, loss[loss=0.2244, simple_loss=0.3068, pruned_loss=0.07097, over 21614.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3217, pruned_loss=0.09415, over 4279010.76 frames. ], batch size: 263, lr: 7.54e-03, grad_scale: 32.0 2023-06-20 12:47:32,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=715104.0, ans=0.0 2023-06-20 12:47:52,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=715224.0, ans=0.125 2023-06-20 12:48:04,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=715224.0, ans=0.125 2023-06-20 12:48:51,172 INFO [train.py:996] (2/4) Epoch 4, batch 27750, loss[loss=0.2286, simple_loss=0.2929, pruned_loss=0.08214, over 21370.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3243, pruned_loss=0.09386, over 4282049.09 frames. ], batch size: 131, lr: 7.54e-03, grad_scale: 16.0 2023-06-20 12:48:51,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=715404.0, ans=0.1 2023-06-20 12:49:33,994 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.146e+02 3.026e+02 3.500e+02 4.221e+02 6.656e+02, threshold=7.000e+02, percent-clipped=2.0 2023-06-20 12:50:10,928 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-20 12:50:12,293 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 12:50:13,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=715644.0, ans=0.2 2023-06-20 12:50:29,395 INFO [train.py:996] (2/4) Epoch 4, batch 27800, loss[loss=0.28, simple_loss=0.3294, pruned_loss=0.1153, over 21365.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.322, pruned_loss=0.09364, over 4280351.70 frames. ], batch size: 159, lr: 7.54e-03, grad_scale: 16.0 2023-06-20 12:51:49,836 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.90 vs. limit=22.5 2023-06-20 12:51:55,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=715944.0, ans=0.125 2023-06-20 12:52:04,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=715944.0, ans=0.125 2023-06-20 12:52:17,418 INFO [train.py:996] (2/4) Epoch 4, batch 27850, loss[loss=0.2801, simple_loss=0.3592, pruned_loss=0.1005, over 21026.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3223, pruned_loss=0.09524, over 4288473.89 frames. ], batch size: 607, lr: 7.54e-03, grad_scale: 16.0 2023-06-20 12:52:35,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=716004.0, ans=0.1 2023-06-20 12:52:37,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=716004.0, ans=0.09899494936611666 2023-06-20 12:52:38,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=716004.0, ans=0.125 2023-06-20 12:52:43,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=716064.0, ans=0.125 2023-06-20 12:52:55,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=716064.0, ans=0.0 2023-06-20 12:53:00,727 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.961e+02 3.530e+02 4.217e+02 1.068e+03, threshold=7.060e+02, percent-clipped=1.0 2023-06-20 12:53:25,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=716184.0, ans=0.1 2023-06-20 12:53:26,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=716184.0, ans=0.125 2023-06-20 12:54:11,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=716304.0, ans=0.125 2023-06-20 12:54:13,222 INFO [train.py:996] (2/4) Epoch 4, batch 27900, loss[loss=0.2265, simple_loss=0.3063, pruned_loss=0.07331, over 21370.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3313, pruned_loss=0.09545, over 4281497.07 frames. ], batch size: 194, lr: 7.54e-03, grad_scale: 16.0 2023-06-20 12:54:55,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=716424.0, ans=0.0 2023-06-20 12:55:51,128 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-20 12:55:52,679 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=15.0 2023-06-20 12:55:56,706 INFO [train.py:996] (2/4) Epoch 4, batch 27950, loss[loss=0.2005, simple_loss=0.2893, pruned_loss=0.05592, over 21585.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3309, pruned_loss=0.09207, over 4277846.48 frames. ], batch size: 230, lr: 7.53e-03, grad_scale: 16.0 2023-06-20 12:56:29,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=716664.0, ans=0.125 2023-06-20 12:56:33,398 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.009e+02 2.859e+02 3.547e+02 4.362e+02 7.820e+02, threshold=7.095e+02, percent-clipped=2.0 2023-06-20 12:57:25,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=716844.0, ans=0.04949747468305833 2023-06-20 12:57:26,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=716844.0, ans=0.0 2023-06-20 12:57:34,428 INFO [train.py:996] (2/4) Epoch 4, batch 28000, loss[loss=0.2188, simple_loss=0.305, pruned_loss=0.06634, over 21855.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3282, pruned_loss=0.09011, over 4282327.36 frames. ], batch size: 351, lr: 7.53e-03, grad_scale: 32.0 2023-06-20 12:57:38,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=716904.0, ans=0.0 2023-06-20 12:58:24,825 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.74 vs. limit=22.5 2023-06-20 12:59:17,607 INFO [train.py:996] (2/4) Epoch 4, batch 28050, loss[loss=0.3635, simple_loss=0.4161, pruned_loss=0.1555, over 21604.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.327, pruned_loss=0.09277, over 4274422.51 frames. ], batch size: 508, lr: 7.53e-03, grad_scale: 32.0 2023-06-20 12:59:35,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=717204.0, ans=0.05 2023-06-20 12:59:54,023 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.994e+02 2.872e+02 3.334e+02 4.118e+02 8.421e+02, threshold=6.667e+02, percent-clipped=2.0 2023-06-20 13:00:21,840 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=22.5 2023-06-20 13:00:48,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=717444.0, ans=0.125 2023-06-20 13:00:48,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=717444.0, ans=0.125 2023-06-20 13:00:49,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=717444.0, ans=0.125 2023-06-20 13:00:51,422 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:01:00,678 INFO [train.py:996] (2/4) Epoch 4, batch 28100, loss[loss=0.2668, simple_loss=0.319, pruned_loss=0.1072, over 21759.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.326, pruned_loss=0.09289, over 4272542.93 frames. ], batch size: 351, lr: 7.53e-03, grad_scale: 32.0 2023-06-20 13:01:27,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=717564.0, ans=0.07 2023-06-20 13:01:38,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=717624.0, ans=0.125 2023-06-20 13:02:12,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=717684.0, ans=0.125 2023-06-20 13:02:21,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=717684.0, ans=0.125 2023-06-20 13:02:41,118 INFO [train.py:996] (2/4) Epoch 4, batch 28150, loss[loss=0.275, simple_loss=0.3184, pruned_loss=0.1158, over 21948.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3205, pruned_loss=0.09275, over 4273476.98 frames. ], batch size: 113, lr: 7.53e-03, grad_scale: 16.0 2023-06-20 13:02:56,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=717804.0, ans=0.0 2023-06-20 13:03:08,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=717864.0, ans=0.04949747468305833 2023-06-20 13:03:23,531 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.345e+02 3.192e+02 4.033e+02 4.957e+02 1.192e+03, threshold=8.065e+02, percent-clipped=8.0 2023-06-20 13:03:24,920 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.81 vs. limit=15.0 2023-06-20 13:03:58,023 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-20 13:03:59,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=717984.0, ans=0.125 2023-06-20 13:04:01,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=717984.0, ans=0.0 2023-06-20 13:04:16,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=718044.0, ans=0.125 2023-06-20 13:04:18,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=718044.0, ans=0.2 2023-06-20 13:04:18,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=718044.0, ans=0.1 2023-06-20 13:04:27,757 INFO [train.py:996] (2/4) Epoch 4, batch 28200, loss[loss=0.2913, simple_loss=0.3567, pruned_loss=0.113, over 21765.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3205, pruned_loss=0.09439, over 4274896.23 frames. ], batch size: 124, lr: 7.53e-03, grad_scale: 16.0 2023-06-20 13:05:08,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=718164.0, ans=0.2 2023-06-20 13:06:06,479 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.02 vs. limit=10.0 2023-06-20 13:06:10,223 INFO [train.py:996] (2/4) Epoch 4, batch 28250, loss[loss=0.2758, simple_loss=0.3212, pruned_loss=0.1152, over 21650.00 frames. ], tot_loss[loss=0.259, simple_loss=0.323, pruned_loss=0.09751, over 4271073.77 frames. ], batch size: 298, lr: 7.52e-03, grad_scale: 16.0 2023-06-20 13:06:53,560 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.993e+02 3.131e+02 3.683e+02 4.386e+02 7.452e+02, threshold=7.367e+02, percent-clipped=0.0 2023-06-20 13:07:07,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=718524.0, ans=0.125 2023-06-20 13:07:28,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=718584.0, ans=0.1 2023-06-20 13:07:31,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=718644.0, ans=10.0 2023-06-20 13:07:53,737 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.50 vs. limit=15.0 2023-06-20 13:07:54,538 INFO [train.py:996] (2/4) Epoch 4, batch 28300, loss[loss=0.1814, simple_loss=0.2585, pruned_loss=0.05214, over 21237.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3192, pruned_loss=0.09466, over 4269187.75 frames. ], batch size: 159, lr: 7.52e-03, grad_scale: 16.0 2023-06-20 13:08:00,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=718704.0, ans=0.125 2023-06-20 13:08:38,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=718764.0, ans=0.125 2023-06-20 13:08:53,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=718824.0, ans=0.0 2023-06-20 13:09:43,137 INFO [train.py:996] (2/4) Epoch 4, batch 28350, loss[loss=0.2538, simple_loss=0.3162, pruned_loss=0.09573, over 21326.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.317, pruned_loss=0.08855, over 4263470.93 frames. ], batch size: 471, lr: 7.52e-03, grad_scale: 16.0 2023-06-20 13:09:53,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=719004.0, ans=0.2 2023-06-20 13:09:53,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=719004.0, ans=0.125 2023-06-20 13:10:19,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=719064.0, ans=0.125 2023-06-20 13:10:26,143 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.796e+02 2.689e+02 3.125e+02 3.923e+02 6.563e+02, threshold=6.250e+02, percent-clipped=0.0 2023-06-20 13:10:27,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=719124.0, ans=0.0 2023-06-20 13:11:30,291 INFO [train.py:996] (2/4) Epoch 4, batch 28400, loss[loss=0.3046, simple_loss=0.3554, pruned_loss=0.1269, over 21183.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3127, pruned_loss=0.08807, over 4271183.14 frames. ], batch size: 143, lr: 7.52e-03, grad_scale: 32.0 2023-06-20 13:11:40,287 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=12.0 2023-06-20 13:11:58,791 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=15.0 2023-06-20 13:11:59,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=719364.0, ans=0.125 2023-06-20 13:12:40,747 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.068e-02 2023-06-20 13:12:44,622 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.53 vs. limit=22.5 2023-06-20 13:13:07,871 INFO [train.py:996] (2/4) Epoch 4, batch 28450, loss[loss=0.2916, simple_loss=0.3536, pruned_loss=0.1148, over 21799.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3195, pruned_loss=0.09356, over 4278541.99 frames. ], batch size: 112, lr: 7.52e-03, grad_scale: 32.0 2023-06-20 13:13:34,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=719664.0, ans=0.0 2023-06-20 13:13:36,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=719664.0, ans=0.0 2023-06-20 13:13:50,968 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.987e+02 3.410e+02 4.090e+02 6.526e+02, threshold=6.821e+02, percent-clipped=2.0 2023-06-20 13:13:51,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=719724.0, ans=0.1 2023-06-20 13:14:30,601 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=22.5 2023-06-20 13:14:50,132 INFO [train.py:996] (2/4) Epoch 4, batch 28500, loss[loss=0.3188, simple_loss=0.3692, pruned_loss=0.1342, over 21225.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3227, pruned_loss=0.09717, over 4286508.80 frames. ], batch size: 143, lr: 7.52e-03, grad_scale: 32.0 2023-06-20 13:15:12,973 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=15.0 2023-06-20 13:15:48,007 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.46 vs. limit=15.0 2023-06-20 13:16:01,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=720084.0, ans=0.125 2023-06-20 13:16:41,765 INFO [train.py:996] (2/4) Epoch 4, batch 28550, loss[loss=0.2863, simple_loss=0.3834, pruned_loss=0.0946, over 21870.00 frames. ], tot_loss[loss=0.2646, simple_loss=0.331, pruned_loss=0.09909, over 4285663.35 frames. ], batch size: 372, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:16:57,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=720264.0, ans=0.125 2023-06-20 13:17:20,275 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 3.197e+02 3.692e+02 4.478e+02 6.914e+02, threshold=7.384e+02, percent-clipped=1.0 2023-06-20 13:17:42,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=720384.0, ans=0.0 2023-06-20 13:17:58,207 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=15.0 2023-06-20 13:18:25,251 INFO [train.py:996] (2/4) Epoch 4, batch 28600, loss[loss=0.2789, simple_loss=0.3517, pruned_loss=0.103, over 21664.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3375, pruned_loss=0.1012, over 4278662.51 frames. ], batch size: 351, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:20:01,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=720744.0, ans=0.125 2023-06-20 13:20:07,390 INFO [train.py:996] (2/4) Epoch 4, batch 28650, loss[loss=0.2226, simple_loss=0.2764, pruned_loss=0.08435, over 21438.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3305, pruned_loss=0.09957, over 4275916.61 frames. ], batch size: 195, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:20:09,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=720804.0, ans=0.2 2023-06-20 13:20:46,274 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.224e+02 2.893e+02 3.240e+02 3.665e+02 6.143e+02, threshold=6.480e+02, percent-clipped=0.0 2023-06-20 13:21:03,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=720924.0, ans=0.0 2023-06-20 13:21:06,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=720924.0, ans=0.0 2023-06-20 13:21:10,901 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.35 vs. limit=10.0 2023-06-20 13:21:31,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=720984.0, ans=0.125 2023-06-20 13:21:34,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=721044.0, ans=0.0 2023-06-20 13:21:51,605 INFO [train.py:996] (2/4) Epoch 4, batch 28700, loss[loss=0.3061, simple_loss=0.367, pruned_loss=0.1226, over 21923.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3309, pruned_loss=0.1017, over 4278127.31 frames. ], batch size: 372, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:21:53,093 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.17 vs. limit=22.5 2023-06-20 13:21:55,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=721104.0, ans=0.0 2023-06-20 13:22:24,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=721164.0, ans=0.125 2023-06-20 13:22:45,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=721224.0, ans=0.0 2023-06-20 13:22:56,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=721284.0, ans=0.125 2023-06-20 13:22:58,640 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=15.0 2023-06-20 13:23:10,123 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.16 vs. limit=12.0 2023-06-20 13:23:11,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=721284.0, ans=0.0 2023-06-20 13:23:26,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=721344.0, ans=0.1 2023-06-20 13:23:27,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=721344.0, ans=0.0 2023-06-20 13:23:32,191 INFO [train.py:996] (2/4) Epoch 4, batch 28750, loss[loss=0.2615, simple_loss=0.3167, pruned_loss=0.1031, over 21353.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3301, pruned_loss=0.1021, over 4266753.84 frames. ], batch size: 176, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:23:32,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=721404.0, ans=0.1 2023-06-20 13:23:36,225 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=15.0 2023-06-20 13:24:15,484 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.193e+02 3.841e+02 4.687e+02 9.363e+02, threshold=7.682e+02, percent-clipped=10.0 2023-06-20 13:24:46,448 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:24:53,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=721644.0, ans=0.125 2023-06-20 13:24:57,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=721644.0, ans=0.1 2023-06-20 13:25:12,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=721644.0, ans=0.0 2023-06-20 13:25:15,128 INFO [train.py:996] (2/4) Epoch 4, batch 28800, loss[loss=0.2679, simple_loss=0.3371, pruned_loss=0.09935, over 21924.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3338, pruned_loss=0.1023, over 4275385.66 frames. ], batch size: 316, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:25:21,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=721704.0, ans=0.125 2023-06-20 13:26:33,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=721884.0, ans=0.1 2023-06-20 13:26:43,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=721944.0, ans=0.04949747468305833 2023-06-20 13:26:58,369 INFO [train.py:996] (2/4) Epoch 4, batch 28850, loss[loss=0.271, simple_loss=0.3338, pruned_loss=0.104, over 21466.00 frames. ], tot_loss[loss=0.2723, simple_loss=0.3355, pruned_loss=0.1046, over 4284720.71 frames. ], batch size: 131, lr: 7.51e-03, grad_scale: 32.0 2023-06-20 13:27:17,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=722004.0, ans=0.125 2023-06-20 13:27:36,251 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.291e+02 3.137e+02 3.550e+02 4.295e+02 6.856e+02, threshold=7.100e+02, percent-clipped=0.0 2023-06-20 13:27:39,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=722124.0, ans=0.0 2023-06-20 13:27:52,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=722124.0, ans=0.0 2023-06-20 13:28:30,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=722244.0, ans=0.125 2023-06-20 13:28:36,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=722304.0, ans=0.0 2023-06-20 13:28:42,258 INFO [train.py:996] (2/4) Epoch 4, batch 28900, loss[loss=0.2808, simple_loss=0.3298, pruned_loss=0.1159, over 21451.00 frames. ], tot_loss[loss=0.275, simple_loss=0.3378, pruned_loss=0.1061, over 4280924.40 frames. ], batch size: 194, lr: 7.50e-03, grad_scale: 32.0 2023-06-20 13:28:58,670 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=15.0 2023-06-20 13:29:18,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=722364.0, ans=0.125 2023-06-20 13:29:42,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=722484.0, ans=0.0 2023-06-20 13:29:42,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=722484.0, ans=0.0 2023-06-20 13:30:29,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=722604.0, ans=0.1 2023-06-20 13:30:30,840 INFO [train.py:996] (2/4) Epoch 4, batch 28950, loss[loss=0.2518, simple_loss=0.3693, pruned_loss=0.06717, over 21234.00 frames. ], tot_loss[loss=0.2741, simple_loss=0.338, pruned_loss=0.1051, over 4276018.22 frames. ], batch size: 548, lr: 7.50e-03, grad_scale: 16.0 2023-06-20 13:31:11,283 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.090e+02 3.092e+02 3.646e+02 4.374e+02 7.156e+02, threshold=7.293e+02, percent-clipped=1.0 2023-06-20 13:31:29,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=722784.0, ans=0.2 2023-06-20 13:31:50,187 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.64 vs. limit=15.0 2023-06-20 13:32:13,758 INFO [train.py:996] (2/4) Epoch 4, batch 29000, loss[loss=0.2563, simple_loss=0.3374, pruned_loss=0.08759, over 20781.00 frames. ], tot_loss[loss=0.2734, simple_loss=0.3402, pruned_loss=0.1033, over 4269835.24 frames. ], batch size: 607, lr: 7.50e-03, grad_scale: 16.0 2023-06-20 13:32:14,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=722904.0, ans=0.07 2023-06-20 13:32:14,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=722904.0, ans=0.05 2023-06-20 13:32:26,607 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=22.5 2023-06-20 13:32:27,926 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=15.0 2023-06-20 13:32:29,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=722964.0, ans=0.0 2023-06-20 13:32:51,465 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-20 13:32:59,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=723024.0, ans=0.125 2023-06-20 13:33:22,759 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.00 vs. limit=6.0 2023-06-20 13:33:50,990 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.49 vs. limit=10.0 2023-06-20 13:33:56,572 INFO [train.py:996] (2/4) Epoch 4, batch 29050, loss[loss=0.2749, simple_loss=0.3317, pruned_loss=0.1091, over 21767.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.3387, pruned_loss=0.1039, over 4272066.99 frames. ], batch size: 112, lr: 7.50e-03, grad_scale: 16.0 2023-06-20 13:34:40,195 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.155e+02 2.777e+02 3.114e+02 3.614e+02 7.723e+02, threshold=6.228e+02, percent-clipped=0.0 2023-06-20 13:34:49,907 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=22.5 2023-06-20 13:35:21,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=723444.0, ans=0.02 2023-06-20 13:35:37,379 INFO [train.py:996] (2/4) Epoch 4, batch 29100, loss[loss=0.2046, simple_loss=0.2657, pruned_loss=0.0718, over 21560.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3305, pruned_loss=0.1005, over 4274891.74 frames. ], batch size: 196, lr: 7.50e-03, grad_scale: 16.0 2023-06-20 13:36:09,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=723564.0, ans=0.2 2023-06-20 13:36:17,095 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.91 vs. limit=10.0 2023-06-20 13:36:34,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=723624.0, ans=0.0 2023-06-20 13:36:46,520 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.38 vs. limit=6.0 2023-06-20 13:37:00,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=723684.0, ans=0.125 2023-06-20 13:37:01,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=723684.0, ans=10.0 2023-06-20 13:37:19,156 INFO [train.py:996] (2/4) Epoch 4, batch 29150, loss[loss=0.2443, simple_loss=0.3057, pruned_loss=0.09146, over 21209.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3281, pruned_loss=0.09859, over 4275689.10 frames. ], batch size: 176, lr: 7.50e-03, grad_scale: 16.0 2023-06-20 13:37:27,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=723804.0, ans=0.125 2023-06-20 13:37:46,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=723864.0, ans=0.125 2023-06-20 13:38:03,276 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.983e+02 3.477e+02 4.696e+02 8.127e+02, threshold=6.954e+02, percent-clipped=11.0 2023-06-20 13:39:00,569 INFO [train.py:996] (2/4) Epoch 4, batch 29200, loss[loss=0.2045, simple_loss=0.2597, pruned_loss=0.07468, over 20706.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3247, pruned_loss=0.09773, over 4267827.49 frames. ], batch size: 608, lr: 7.49e-03, grad_scale: 32.0 2023-06-20 13:39:19,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=724104.0, ans=0.125 2023-06-20 13:39:36,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=724164.0, ans=0.0 2023-06-20 13:39:36,717 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:39:46,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=724224.0, ans=0.1 2023-06-20 13:40:48,589 INFO [train.py:996] (2/4) Epoch 4, batch 29250, loss[loss=0.2437, simple_loss=0.3287, pruned_loss=0.07935, over 21622.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3224, pruned_loss=0.09436, over 4264796.55 frames. ], batch size: 263, lr: 7.49e-03, grad_scale: 32.0 2023-06-20 13:41:18,325 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:41:21,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=724464.0, ans=0.125 2023-06-20 13:41:32,106 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.828e+02 3.224e+02 4.215e+02 8.591e+02, threshold=6.449e+02, percent-clipped=2.0 2023-06-20 13:42:04,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=724584.0, ans=0.125 2023-06-20 13:42:29,527 INFO [train.py:996] (2/4) Epoch 4, batch 29300, loss[loss=0.2352, simple_loss=0.2866, pruned_loss=0.09196, over 21293.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3238, pruned_loss=0.09328, over 4273631.42 frames. ], batch size: 176, lr: 7.49e-03, grad_scale: 32.0 2023-06-20 13:42:55,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=724764.0, ans=0.0 2023-06-20 13:43:02,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=724764.0, ans=0.0 2023-06-20 13:43:04,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=724764.0, ans=0.0 2023-06-20 13:43:21,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=724824.0, ans=0.125 2023-06-20 13:43:56,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=724944.0, ans=0.95 2023-06-20 13:44:04,756 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:44:16,739 INFO [train.py:996] (2/4) Epoch 4, batch 29350, loss[loss=0.2844, simple_loss=0.3677, pruned_loss=0.1006, over 21681.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3211, pruned_loss=0.09289, over 4274400.78 frames. ], batch size: 391, lr: 7.49e-03, grad_scale: 32.0 2023-06-20 13:44:35,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=725064.0, ans=0.1 2023-06-20 13:44:38,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=725064.0, ans=0.125 2023-06-20 13:45:02,534 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.151e+02 2.825e+02 3.145e+02 3.740e+02 7.269e+02, threshold=6.289e+02, percent-clipped=1.0 2023-06-20 13:45:38,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=725244.0, ans=0.125 2023-06-20 13:45:58,063 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.82 vs. limit=6.0 2023-06-20 13:45:59,953 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.16 vs. limit=12.0 2023-06-20 13:46:00,226 INFO [train.py:996] (2/4) Epoch 4, batch 29400, loss[loss=0.2217, simple_loss=0.305, pruned_loss=0.06919, over 21727.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3186, pruned_loss=0.09056, over 4265456.72 frames. ], batch size: 391, lr: 7.49e-03, grad_scale: 32.0 2023-06-20 13:46:04,276 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:46:39,629 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.01 vs. limit=22.5 2023-06-20 13:46:53,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=725424.0, ans=0.05 2023-06-20 13:46:56,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=725424.0, ans=0.95 2023-06-20 13:47:02,258 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=22.5 2023-06-20 13:47:29,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=725544.0, ans=0.1 2023-06-20 13:47:42,458 INFO [train.py:996] (2/4) Epoch 4, batch 29450, loss[loss=0.3384, simple_loss=0.4172, pruned_loss=0.1298, over 21805.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3189, pruned_loss=0.09059, over 4264144.16 frames. ], batch size: 124, lr: 7.49e-03, grad_scale: 16.0 2023-06-20 13:48:05,285 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:48:28,459 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.191e+02 3.113e+02 3.518e+02 4.347e+02 7.926e+02, threshold=7.036e+02, percent-clipped=6.0 2023-06-20 13:48:55,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=725784.0, ans=0.125 2023-06-20 13:49:14,023 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.71 vs. limit=15.0 2023-06-20 13:49:22,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=725904.0, ans=0.0 2023-06-20 13:49:24,097 INFO [train.py:996] (2/4) Epoch 4, batch 29500, loss[loss=0.3147, simple_loss=0.3563, pruned_loss=0.1365, over 21762.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3233, pruned_loss=0.09449, over 4272335.94 frames. ], batch size: 508, lr: 7.49e-03, grad_scale: 16.0 2023-06-20 13:49:48,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=725964.0, ans=0.125 2023-06-20 13:50:01,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=725964.0, ans=0.05 2023-06-20 13:50:15,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=726024.0, ans=0.035 2023-06-20 13:50:21,177 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2023-06-20 13:50:42,645 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.97 vs. limit=22.5 2023-06-20 13:51:01,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=726144.0, ans=0.015 2023-06-20 13:51:04,358 INFO [train.py:996] (2/4) Epoch 4, batch 29550, loss[loss=0.2474, simple_loss=0.3036, pruned_loss=0.09564, over 21552.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3219, pruned_loss=0.09557, over 4282198.67 frames. ], batch size: 548, lr: 7.48e-03, grad_scale: 16.0 2023-06-20 13:51:50,788 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.392e+02 2.813e+02 3.221e+02 3.823e+02 6.296e+02, threshold=6.442e+02, percent-clipped=0.0 2023-06-20 13:52:28,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=726444.0, ans=0.125 2023-06-20 13:52:50,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=726504.0, ans=0.04949747468305833 2023-06-20 13:52:51,281 INFO [train.py:996] (2/4) Epoch 4, batch 29600, loss[loss=0.2545, simple_loss=0.3363, pruned_loss=0.08636, over 21442.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3293, pruned_loss=0.09824, over 4285070.56 frames. ], batch size: 211, lr: 7.48e-03, grad_scale: 32.0 2023-06-20 13:53:22,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=726564.0, ans=0.125 2023-06-20 13:53:31,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=726624.0, ans=0.125 2023-06-20 13:54:00,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=726684.0, ans=0.0 2023-06-20 13:54:11,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=726744.0, ans=0.125 2023-06-20 13:54:33,061 INFO [train.py:996] (2/4) Epoch 4, batch 29650, loss[loss=0.2108, simple_loss=0.2834, pruned_loss=0.06912, over 21879.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3275, pruned_loss=0.09548, over 4275637.45 frames. ], batch size: 316, lr: 7.48e-03, grad_scale: 32.0 2023-06-20 13:54:36,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=726804.0, ans=0.125 2023-06-20 13:54:39,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=726804.0, ans=0.1 2023-06-20 13:55:13,477 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 3.008e+02 3.849e+02 5.213e+02 1.335e+03, threshold=7.697e+02, percent-clipped=14.0 2023-06-20 13:55:14,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=726924.0, ans=0.125 2023-06-20 13:55:27,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=726924.0, ans=0.2 2023-06-20 13:55:27,660 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:55:52,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=727044.0, ans=0.125 2023-06-20 13:56:00,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=727044.0, ans=0.125 2023-06-20 13:56:14,503 INFO [train.py:996] (2/4) Epoch 4, batch 29700, loss[loss=0.2645, simple_loss=0.3432, pruned_loss=0.09289, over 21129.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3286, pruned_loss=0.09467, over 4275847.46 frames. ], batch size: 143, lr: 7.48e-03, grad_scale: 32.0 2023-06-20 13:56:41,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=727164.0, ans=0.05 2023-06-20 13:56:42,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=727164.0, ans=0.125 2023-06-20 13:57:19,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=727284.0, ans=0.1 2023-06-20 13:57:19,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=727284.0, ans=0.0 2023-06-20 13:57:30,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=727284.0, ans=0.125 2023-06-20 13:57:35,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=727344.0, ans=0.125 2023-06-20 13:57:56,356 INFO [train.py:996] (2/4) Epoch 4, batch 29750, loss[loss=0.2292, simple_loss=0.3016, pruned_loss=0.0784, over 21458.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3351, pruned_loss=0.09525, over 4276317.26 frames. ], batch size: 131, lr: 7.48e-03, grad_scale: 32.0 2023-06-20 13:58:13,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=727404.0, ans=0.125 2023-06-20 13:58:24,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=727464.0, ans=0.2 2023-06-20 13:58:36,375 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.282e+02 2.793e+02 3.232e+02 3.849e+02 7.208e+02, threshold=6.464e+02, percent-clipped=0.0 2023-06-20 13:58:59,082 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.16 vs. limit=15.0 2023-06-20 13:59:15,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=727644.0, ans=0.1 2023-06-20 13:59:21,387 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 13:59:36,439 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2023-06-20 13:59:36,833 INFO [train.py:996] (2/4) Epoch 4, batch 29800, loss[loss=0.2609, simple_loss=0.3266, pruned_loss=0.0976, over 21856.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.336, pruned_loss=0.09545, over 4275765.57 frames. ], batch size: 298, lr: 7.48e-03, grad_scale: 32.0 2023-06-20 14:00:08,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=727764.0, ans=0.125 2023-06-20 14:00:52,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=727884.0, ans=0.125 2023-06-20 14:01:02,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=727944.0, ans=0.125 2023-06-20 14:01:23,941 INFO [train.py:996] (2/4) Epoch 4, batch 29850, loss[loss=0.2214, simple_loss=0.2894, pruned_loss=0.07673, over 21251.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3312, pruned_loss=0.09291, over 4280044.03 frames. ], batch size: 608, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:01:34,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=728004.0, ans=0.07 2023-06-20 14:01:42,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=728064.0, ans=0.0 2023-06-20 14:02:05,838 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.011e+02 2.765e+02 3.288e+02 3.804e+02 6.956e+02, threshold=6.577e+02, percent-clipped=2.0 2023-06-20 14:02:17,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=728124.0, ans=0.125 2023-06-20 14:02:35,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=728184.0, ans=0.125 2023-06-20 14:03:06,621 INFO [train.py:996] (2/4) Epoch 4, batch 29900, loss[loss=0.2708, simple_loss=0.3348, pruned_loss=0.1034, over 21374.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3281, pruned_loss=0.09396, over 4275482.83 frames. ], batch size: 143, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:03:40,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=728364.0, ans=0.125 2023-06-20 14:03:44,091 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-20 14:04:30,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=728544.0, ans=0.125 2023-06-20 14:04:50,352 INFO [train.py:996] (2/4) Epoch 4, batch 29950, loss[loss=0.2765, simple_loss=0.3496, pruned_loss=0.1017, over 21833.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3333, pruned_loss=0.09921, over 4270508.68 frames. ], batch size: 124, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:04:55,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=728604.0, ans=0.125 2023-06-20 14:05:36,643 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.386e+02 3.024e+02 3.410e+02 4.010e+02 6.604e+02, threshold=6.821e+02, percent-clipped=1.0 2023-06-20 14:06:33,135 INFO [train.py:996] (2/4) Epoch 4, batch 30000, loss[loss=0.2295, simple_loss=0.3217, pruned_loss=0.06871, over 21715.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3372, pruned_loss=0.1004, over 4271865.41 frames. ], batch size: 247, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:06:33,136 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 14:06:55,182 INFO [train.py:1028] (2/4) Epoch 4, validation: loss=0.2513, simple_loss=0.3514, pruned_loss=0.07557, over 1796401.00 frames. 2023-06-20 14:06:55,183 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-20 14:07:09,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=728904.0, ans=0.0 2023-06-20 14:07:17,759 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=22.5 2023-06-20 14:07:24,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=728964.0, ans=22.5 2023-06-20 14:08:09,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=729084.0, ans=0.04949747468305833 2023-06-20 14:08:25,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=729144.0, ans=0.0 2023-06-20 14:08:44,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=729144.0, ans=0.0 2023-06-20 14:08:48,108 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.56 vs. limit=6.0 2023-06-20 14:08:48,668 INFO [train.py:996] (2/4) Epoch 4, batch 30050, loss[loss=0.2313, simple_loss=0.3368, pruned_loss=0.06288, over 21270.00 frames. ], tot_loss[loss=0.2676, simple_loss=0.34, pruned_loss=0.09762, over 4265066.23 frames. ], batch size: 548, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:08:57,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=729204.0, ans=0.2 2023-06-20 14:09:10,969 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.84 vs. limit=15.0 2023-06-20 14:09:34,092 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.923e+02 2.668e+02 3.143e+02 3.876e+02 8.051e+02, threshold=6.286e+02, percent-clipped=2.0 2023-06-20 14:09:59,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=729384.0, ans=0.125 2023-06-20 14:10:23,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=729444.0, ans=0.07 2023-06-20 14:10:30,491 INFO [train.py:996] (2/4) Epoch 4, batch 30100, loss[loss=0.2899, simple_loss=0.4055, pruned_loss=0.08718, over 21229.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3388, pruned_loss=0.0969, over 4260342.15 frames. ], batch size: 549, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:11:19,411 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=15.0 2023-06-20 14:11:32,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=729624.0, ans=0.2 2023-06-20 14:11:36,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=729684.0, ans=0.125 2023-06-20 14:11:54,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=729744.0, ans=0.125 2023-06-20 14:12:07,445 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.00 vs. limit=15.0 2023-06-20 14:12:13,038 INFO [train.py:996] (2/4) Epoch 4, batch 30150, loss[loss=0.3494, simple_loss=0.3862, pruned_loss=0.1563, over 21423.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3349, pruned_loss=0.09837, over 4269624.83 frames. ], batch size: 510, lr: 7.47e-03, grad_scale: 32.0 2023-06-20 14:12:39,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=729804.0, ans=0.125 2023-06-20 14:12:44,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=729864.0, ans=0.0 2023-06-20 14:13:07,772 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.395e+02 3.286e+02 3.695e+02 4.286e+02 6.850e+02, threshold=7.389e+02, percent-clipped=1.0 2023-06-20 14:13:45,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=730044.0, ans=0.04949747468305833 2023-06-20 14:14:05,642 INFO [train.py:996] (2/4) Epoch 4, batch 30200, loss[loss=0.3428, simple_loss=0.4012, pruned_loss=0.1422, over 21458.00 frames. ], tot_loss[loss=0.268, simple_loss=0.3394, pruned_loss=0.09832, over 4271630.95 frames. ], batch size: 471, lr: 7.46e-03, grad_scale: 32.0 2023-06-20 14:14:31,329 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.40 vs. limit=6.0 2023-06-20 14:14:46,443 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-06-20 14:15:55,840 INFO [train.py:996] (2/4) Epoch 4, batch 30250, loss[loss=0.2667, simple_loss=0.3198, pruned_loss=0.1068, over 20129.00 frames. ], tot_loss[loss=0.2733, simple_loss=0.345, pruned_loss=0.1008, over 4273114.70 frames. ], batch size: 707, lr: 7.46e-03, grad_scale: 32.0 2023-06-20 14:16:09,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=730404.0, ans=0.2 2023-06-20 14:16:27,402 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.02 vs. limit=12.0 2023-06-20 14:16:32,069 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=15.0 2023-06-20 14:16:35,806 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.109e+02 2.958e+02 3.586e+02 4.420e+02 6.930e+02, threshold=7.173e+02, percent-clipped=0.0 2023-06-20 14:16:43,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=730524.0, ans=0.125 2023-06-20 14:16:44,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=730524.0, ans=0.1 2023-06-20 14:16:49,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=730584.0, ans=0.2 2023-06-20 14:17:08,732 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:17:25,184 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-20 14:17:26,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=730644.0, ans=0.0 2023-06-20 14:17:37,308 INFO [train.py:996] (2/4) Epoch 4, batch 30300, loss[loss=0.2386, simple_loss=0.2985, pruned_loss=0.08933, over 21737.00 frames. ], tot_loss[loss=0.2726, simple_loss=0.3429, pruned_loss=0.1011, over 4269840.52 frames. ], batch size: 351, lr: 7.46e-03, grad_scale: 32.0 2023-06-20 14:18:13,751 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-20 14:18:58,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=730944.0, ans=0.125 2023-06-20 14:19:16,760 INFO [train.py:996] (2/4) Epoch 4, batch 30350, loss[loss=0.4, simple_loss=0.4629, pruned_loss=0.1686, over 21524.00 frames. ], tot_loss[loss=0.2749, simple_loss=0.3445, pruned_loss=0.1027, over 4270333.54 frames. ], batch size: 509, lr: 7.46e-03, grad_scale: 16.0 2023-06-20 14:19:30,483 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=17.13 vs. limit=15.0 2023-06-20 14:19:44,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=731064.0, ans=0.2 2023-06-20 14:19:53,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=731124.0, ans=0.025 2023-06-20 14:19:55,925 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 3.242e+02 3.707e+02 4.467e+02 6.879e+02, threshold=7.414e+02, percent-clipped=0.0 2023-06-20 14:20:09,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=731184.0, ans=0.95 2023-06-20 14:20:43,894 INFO [train.py:996] (2/4) Epoch 4, batch 30400, loss[loss=0.2965, simple_loss=0.375, pruned_loss=0.109, over 21208.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3394, pruned_loss=0.1002, over 4259992.69 frames. ], batch size: 549, lr: 7.46e-03, grad_scale: 32.0 2023-06-20 14:21:06,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=731364.0, ans=0.125 2023-06-20 14:21:21,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=731424.0, ans=0.2 2023-06-20 14:21:26,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=731424.0, ans=0.2 2023-06-20 14:21:44,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=731544.0, ans=0.0 2023-06-20 14:22:05,361 INFO [train.py:996] (2/4) Epoch 4, batch 30450, loss[loss=0.3118, simple_loss=0.4341, pruned_loss=0.09478, over 19862.00 frames. ], tot_loss[loss=0.2704, simple_loss=0.3411, pruned_loss=0.09979, over 4202588.61 frames. ], batch size: 702, lr: 7.46e-03, grad_scale: 32.0 2023-06-20 14:22:43,251 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.654e+02 3.992e+02 5.785e+02 8.199e+02 3.035e+03, threshold=1.157e+03, percent-clipped=30.0 2023-06-20 14:22:49,429 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:23:02,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=731784.0, ans=0.125 2023-06-20 14:25:02,488 INFO [train.py:996] (2/4) Epoch 5, batch 0, loss[loss=0.3088, simple_loss=0.3398, pruned_loss=0.139, over 21377.00 frames. ], tot_loss[loss=0.3088, simple_loss=0.3398, pruned_loss=0.139, over 21377.00 frames. ], batch size: 509, lr: 6.61e-03, grad_scale: 32.0 2023-06-20 14:25:02,489 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 14:25:18,282 INFO [train.py:1028] (2/4) Epoch 5, validation: loss=0.2519, simple_loss=0.3587, pruned_loss=0.07257, over 1796401.00 frames. 2023-06-20 14:25:18,282 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-20 14:25:50,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=731934.0, ans=0.0 2023-06-20 14:26:35,729 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-20 14:26:51,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=732114.0, ans=0.125 2023-06-20 14:26:54,774 INFO [train.py:996] (2/4) Epoch 5, batch 50, loss[loss=0.2284, simple_loss=0.3163, pruned_loss=0.07026, over 21433.00 frames. ], tot_loss[loss=0.2751, simple_loss=0.3459, pruned_loss=0.1021, over 961720.41 frames. ], batch size: 194, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:27:11,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=732174.0, ans=0.1 2023-06-20 14:27:34,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=732234.0, ans=0.0 2023-06-20 14:27:41,789 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=12.0 2023-06-20 14:27:49,694 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:27:53,562 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.548e+02 3.294e+02 4.063e+02 6.432e+02 1.595e+03, threshold=8.127e+02, percent-clipped=6.0 2023-06-20 14:27:55,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=732354.0, ans=0.125 2023-06-20 14:28:19,074 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=22.5 2023-06-20 14:28:21,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=732414.0, ans=0.125 2023-06-20 14:28:32,497 INFO [train.py:996] (2/4) Epoch 5, batch 100, loss[loss=0.3202, simple_loss=0.3954, pruned_loss=0.1225, over 21763.00 frames. ], tot_loss[loss=0.2814, simple_loss=0.3562, pruned_loss=0.1033, over 1693975.46 frames. ], batch size: 441, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:28:47,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=732474.0, ans=0.0 2023-06-20 14:28:51,046 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2023-06-20 14:29:06,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=732534.0, ans=0.125 2023-06-20 14:29:33,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=732654.0, ans=0.2 2023-06-20 14:29:54,268 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.09 vs. limit=10.0 2023-06-20 14:29:56,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=732714.0, ans=0.125 2023-06-20 14:30:04,169 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.71 vs. limit=15.0 2023-06-20 14:30:08,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=732774.0, ans=0.2 2023-06-20 14:30:09,622 INFO [train.py:996] (2/4) Epoch 5, batch 150, loss[loss=0.3013, simple_loss=0.3935, pruned_loss=0.1045, over 21646.00 frames. ], tot_loss[loss=0.2804, simple_loss=0.358, pruned_loss=0.1014, over 2267665.31 frames. ], batch size: 389, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:30:43,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=732834.0, ans=0.0 2023-06-20 14:30:51,890 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.80 vs. limit=10.0 2023-06-20 14:31:07,045 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.823e+02 3.207e+02 3.913e+02 7.422e+02, threshold=6.414e+02, percent-clipped=0.0 2023-06-20 14:31:42,137 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=22.5 2023-06-20 14:31:50,641 INFO [train.py:996] (2/4) Epoch 5, batch 200, loss[loss=0.267, simple_loss=0.3153, pruned_loss=0.1094, over 21752.00 frames. ], tot_loss[loss=0.2767, simple_loss=0.3532, pruned_loss=0.1001, over 2714133.60 frames. ], batch size: 112, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:32:41,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=733194.0, ans=0.0 2023-06-20 14:33:26,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=733314.0, ans=0.2 2023-06-20 14:33:29,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=733314.0, ans=0.05 2023-06-20 14:33:29,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=733314.0, ans=0.0 2023-06-20 14:33:32,182 INFO [train.py:996] (2/4) Epoch 5, batch 250, loss[loss=0.3321, simple_loss=0.3837, pruned_loss=0.1402, over 21379.00 frames. ], tot_loss[loss=0.2735, simple_loss=0.3483, pruned_loss=0.09937, over 3059905.29 frames. ], batch size: 507, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:33:40,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=733374.0, ans=0.0 2023-06-20 14:33:45,550 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-20 14:33:52,629 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.09 vs. limit=15.0 2023-06-20 14:34:30,981 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.159e+02 2.953e+02 3.443e+02 4.116e+02 7.444e+02, threshold=6.886e+02, percent-clipped=2.0 2023-06-20 14:34:32,183 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=15.0 2023-06-20 14:34:33,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=733554.0, ans=0.2 2023-06-20 14:35:14,599 INFO [train.py:996] (2/4) Epoch 5, batch 300, loss[loss=0.2644, simple_loss=0.3241, pruned_loss=0.1024, over 21443.00 frames. ], tot_loss[loss=0.2688, simple_loss=0.3411, pruned_loss=0.09823, over 3321908.79 frames. ], batch size: 211, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:35:23,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=733674.0, ans=0.125 2023-06-20 14:36:02,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=733794.0, ans=0.125 2023-06-20 14:36:04,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=733794.0, ans=0.125 2023-06-20 14:36:26,527 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=22.5 2023-06-20 14:36:51,862 INFO [train.py:996] (2/4) Epoch 5, batch 350, loss[loss=0.2828, simple_loss=0.3355, pruned_loss=0.115, over 21434.00 frames. ], tot_loss[loss=0.2648, simple_loss=0.3347, pruned_loss=0.09742, over 3541607.63 frames. ], batch size: 194, lr: 6.60e-03, grad_scale: 16.0 2023-06-20 14:37:51,536 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.253e+02 2.828e+02 3.218e+02 3.899e+02 6.662e+02, threshold=6.437e+02, percent-clipped=0.0 2023-06-20 14:38:13,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=734154.0, ans=0.125 2023-06-20 14:38:15,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=734154.0, ans=0.125 2023-06-20 14:38:28,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=734214.0, ans=0.2 2023-06-20 14:38:33,745 INFO [train.py:996] (2/4) Epoch 5, batch 400, loss[loss=0.2076, simple_loss=0.2659, pruned_loss=0.07468, over 21571.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3311, pruned_loss=0.09626, over 3706716.85 frames. ], batch size: 231, lr: 6.59e-03, grad_scale: 32.0 2023-06-20 14:38:56,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=734334.0, ans=0.0 2023-06-20 14:39:00,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=734334.0, ans=0.2 2023-06-20 14:40:15,236 INFO [train.py:996] (2/4) Epoch 5, batch 450, loss[loss=0.2299, simple_loss=0.3034, pruned_loss=0.07821, over 21574.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3293, pruned_loss=0.0948, over 3838421.70 frames. ], batch size: 263, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:41:20,212 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.914e+02 3.934e+02 5.626e+02 1.302e+03, threshold=7.868e+02, percent-clipped=18.0 2023-06-20 14:41:38,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=734814.0, ans=0.0 2023-06-20 14:41:55,828 INFO [train.py:996] (2/4) Epoch 5, batch 500, loss[loss=0.2589, simple_loss=0.3057, pruned_loss=0.1061, over 21386.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3263, pruned_loss=0.09326, over 3944278.16 frames. ], batch size: 212, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:42:51,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=734994.0, ans=0.125 2023-06-20 14:43:07,388 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-20 14:43:37,128 INFO [train.py:996] (2/4) Epoch 5, batch 550, loss[loss=0.2865, simple_loss=0.3411, pruned_loss=0.1159, over 21794.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3295, pruned_loss=0.09265, over 4018980.85 frames. ], batch size: 112, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:43:58,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=735234.0, ans=0.0 2023-06-20 14:44:26,083 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=15.0 2023-06-20 14:44:35,824 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-06-20 14:44:36,271 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 2.996e+02 3.563e+02 4.226e+02 6.619e+02, threshold=7.127e+02, percent-clipped=0.0 2023-06-20 14:44:45,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=735354.0, ans=0.1 2023-06-20 14:45:16,881 INFO [train.py:996] (2/4) Epoch 5, batch 600, loss[loss=0.2261, simple_loss=0.2887, pruned_loss=0.08181, over 21186.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3327, pruned_loss=0.09296, over 4078944.13 frames. ], batch size: 143, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:45:45,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=735534.0, ans=0.125 2023-06-20 14:45:52,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=735534.0, ans=15.0 2023-06-20 14:46:02,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=735594.0, ans=0.04949747468305833 2023-06-20 14:46:51,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=735714.0, ans=0.125 2023-06-20 14:46:59,131 INFO [train.py:996] (2/4) Epoch 5, batch 650, loss[loss=0.2665, simple_loss=0.3505, pruned_loss=0.09123, over 21843.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3334, pruned_loss=0.09364, over 4123746.86 frames. ], batch size: 298, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:46:59,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=735774.0, ans=0.2 2023-06-20 14:47:04,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=735774.0, ans=0.125 2023-06-20 14:47:24,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=735834.0, ans=10.0 2023-06-20 14:47:33,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=735834.0, ans=0.0 2023-06-20 14:47:35,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=735894.0, ans=0.0 2023-06-20 14:47:43,282 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 14:47:53,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=735894.0, ans=10.0 2023-06-20 14:47:58,879 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.149e+02 2.943e+02 3.470e+02 4.276e+02 7.197e+02, threshold=6.941e+02, percent-clipped=1.0 2023-06-20 14:48:11,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=735954.0, ans=0.125 2023-06-20 14:48:33,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=736014.0, ans=0.2 2023-06-20 14:48:40,863 INFO [train.py:996] (2/4) Epoch 5, batch 700, loss[loss=0.2891, simple_loss=0.3527, pruned_loss=0.1128, over 21399.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3362, pruned_loss=0.09593, over 4156958.55 frames. ], batch size: 194, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:48:43,749 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.83 vs. limit=22.5 2023-06-20 14:49:00,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=736134.0, ans=0.125 2023-06-20 14:49:59,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=736254.0, ans=0.0 2023-06-20 14:50:21,192 INFO [train.py:996] (2/4) Epoch 5, batch 750, loss[loss=0.2709, simple_loss=0.3552, pruned_loss=0.09328, over 21436.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3342, pruned_loss=0.09637, over 4195100.32 frames. ], batch size: 211, lr: 6.59e-03, grad_scale: 16.0 2023-06-20 14:50:25,293 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.66 vs. limit=10.0 2023-06-20 14:50:47,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=736434.0, ans=0.125 2023-06-20 14:51:18,447 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.06 vs. limit=10.0 2023-06-20 14:51:22,103 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.354e+02 2.979e+02 3.412e+02 4.334e+02 7.194e+02, threshold=6.824e+02, percent-clipped=1.0 2023-06-20 14:51:52,348 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2023-06-20 14:52:04,150 INFO [train.py:996] (2/4) Epoch 5, batch 800, loss[loss=0.268, simple_loss=0.3254, pruned_loss=0.1052, over 21769.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3303, pruned_loss=0.09598, over 4214550.45 frames. ], batch size: 414, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 14:52:08,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=736674.0, ans=0.125 2023-06-20 14:52:28,264 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.67 vs. limit=15.0 2023-06-20 14:53:08,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=736854.0, ans=0.5 2023-06-20 14:53:27,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=736854.0, ans=0.125 2023-06-20 14:53:46,648 INFO [train.py:996] (2/4) Epoch 5, batch 850, loss[loss=0.2656, simple_loss=0.3249, pruned_loss=0.1031, over 21873.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3284, pruned_loss=0.09527, over 4231096.73 frames. ], batch size: 414, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 14:54:21,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=737034.0, ans=0.0 2023-06-20 14:54:57,021 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.235e+02 2.960e+02 3.459e+02 4.425e+02 7.988e+02, threshold=6.917e+02, percent-clipped=3.0 2023-06-20 14:55:20,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=737214.0, ans=0.0 2023-06-20 14:55:32,833 INFO [train.py:996] (2/4) Epoch 5, batch 900, loss[loss=0.2415, simple_loss=0.3048, pruned_loss=0.08908, over 21652.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3268, pruned_loss=0.09559, over 4246744.24 frames. ], batch size: 230, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 14:56:05,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=737334.0, ans=0.0 2023-06-20 14:57:13,145 INFO [train.py:996] (2/4) Epoch 5, batch 950, loss[loss=0.1894, simple_loss=0.2794, pruned_loss=0.04973, over 21639.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3246, pruned_loss=0.09504, over 4254478.36 frames. ], batch size: 230, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 14:57:55,120 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.98 vs. limit=15.0 2023-06-20 14:58:18,909 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.716e+02 3.192e+02 3.705e+02 5.586e+02, threshold=6.385e+02, percent-clipped=0.0 2023-06-20 14:58:54,045 INFO [train.py:996] (2/4) Epoch 5, batch 1000, loss[loss=0.2097, simple_loss=0.2736, pruned_loss=0.07287, over 21449.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3241, pruned_loss=0.0934, over 4264054.58 frames. ], batch size: 212, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 15:00:12,333 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-06-20 15:00:31,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=738114.0, ans=0.125 2023-06-20 15:00:32,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=738114.0, ans=0.125 2023-06-20 15:00:37,269 INFO [train.py:996] (2/4) Epoch 5, batch 1050, loss[loss=0.2298, simple_loss=0.3066, pruned_loss=0.0765, over 21363.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3238, pruned_loss=0.09311, over 4273720.38 frames. ], batch size: 176, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 15:00:40,415 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-20 15:01:45,021 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.880e+02 3.329e+02 4.012e+02 6.640e+02, threshold=6.657e+02, percent-clipped=1.0 2023-06-20 15:01:59,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=738414.0, ans=0.035 2023-06-20 15:02:22,416 INFO [train.py:996] (2/4) Epoch 5, batch 1100, loss[loss=0.27, simple_loss=0.3289, pruned_loss=0.1056, over 21508.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.324, pruned_loss=0.09254, over 4276657.58 frames. ], batch size: 211, lr: 6.58e-03, grad_scale: 32.0 2023-06-20 15:03:01,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=738534.0, ans=0.07 2023-06-20 15:03:17,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=738594.0, ans=0.125 2023-06-20 15:03:19,282 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.00 vs. limit=15.0 2023-06-20 15:03:24,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=738594.0, ans=0.0 2023-06-20 15:03:46,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=738654.0, ans=0.125 2023-06-20 15:04:10,624 INFO [train.py:996] (2/4) Epoch 5, batch 1150, loss[loss=0.2375, simple_loss=0.3233, pruned_loss=0.07589, over 21318.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3265, pruned_loss=0.09351, over 4279478.22 frames. ], batch size: 548, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:04:35,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=738774.0, ans=0.1 2023-06-20 15:05:02,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=738894.0, ans=0.125 2023-06-20 15:05:18,295 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.094e+02 2.853e+02 3.547e+02 4.502e+02 9.164e+02, threshold=7.095e+02, percent-clipped=7.0 2023-06-20 15:05:20,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=738954.0, ans=0.125 2023-06-20 15:05:59,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=739074.0, ans=0.05 2023-06-20 15:06:05,921 INFO [train.py:996] (2/4) Epoch 5, batch 1200, loss[loss=0.2841, simple_loss=0.3432, pruned_loss=0.1125, over 21619.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3287, pruned_loss=0.09479, over 4278090.51 frames. ], batch size: 471, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:06:15,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=739074.0, ans=0.0 2023-06-20 15:06:34,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=739134.0, ans=0.125 2023-06-20 15:07:26,753 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.01 vs. limit=10.0 2023-06-20 15:07:48,982 INFO [train.py:996] (2/4) Epoch 5, batch 1250, loss[loss=0.2601, simple_loss=0.3335, pruned_loss=0.09333, over 21353.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3299, pruned_loss=0.09412, over 4280369.79 frames. ], batch size: 548, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:07:51,766 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=12.0 2023-06-20 15:08:37,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=739494.0, ans=0.2 2023-06-20 15:08:45,310 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.897e+02 3.475e+02 4.049e+02 7.365e+02, threshold=6.950e+02, percent-clipped=1.0 2023-06-20 15:09:33,296 INFO [train.py:996] (2/4) Epoch 5, batch 1300, loss[loss=0.2281, simple_loss=0.3055, pruned_loss=0.07532, over 21443.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3317, pruned_loss=0.09444, over 4289731.94 frames. ], batch size: 195, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:09:33,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=739674.0, ans=0.1 2023-06-20 15:09:56,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=739734.0, ans=0.125 2023-06-20 15:10:00,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=739734.0, ans=0.125 2023-06-20 15:10:06,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=739794.0, ans=0.0 2023-06-20 15:10:44,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=739854.0, ans=0.125 2023-06-20 15:10:48,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=739914.0, ans=0.125 2023-06-20 15:11:02,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=739914.0, ans=0.125 2023-06-20 15:11:16,847 INFO [train.py:996] (2/4) Epoch 5, batch 1350, loss[loss=0.2579, simple_loss=0.3355, pruned_loss=0.09015, over 21811.00 frames. ], tot_loss[loss=0.2623, simple_loss=0.3333, pruned_loss=0.09562, over 4283354.09 frames. ], batch size: 351, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:11:17,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=739974.0, ans=0.125 2023-06-20 15:11:22,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=739974.0, ans=0.0 2023-06-20 15:11:46,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=740034.0, ans=0.04949747468305833 2023-06-20 15:12:06,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=740094.0, ans=0.2 2023-06-20 15:12:12,800 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.265e+02 2.977e+02 3.601e+02 4.425e+02 6.870e+02, threshold=7.202e+02, percent-clipped=0.0 2023-06-20 15:13:00,173 INFO [train.py:996] (2/4) Epoch 5, batch 1400, loss[loss=0.2271, simple_loss=0.304, pruned_loss=0.07507, over 21842.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.331, pruned_loss=0.09576, over 4290343.02 frames. ], batch size: 372, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:14:00,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=740454.0, ans=0.125 2023-06-20 15:14:06,350 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.09 vs. limit=6.0 2023-06-20 15:14:34,450 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 15:14:43,309 INFO [train.py:996] (2/4) Epoch 5, batch 1450, loss[loss=0.2618, simple_loss=0.331, pruned_loss=0.09635, over 21724.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3314, pruned_loss=0.09722, over 4292772.49 frames. ], batch size: 332, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:14:48,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=740574.0, ans=0.1 2023-06-20 15:15:06,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=740634.0, ans=0.2 2023-06-20 15:15:17,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=740694.0, ans=0.0 2023-06-20 15:15:17,987 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-20 15:15:22,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=740694.0, ans=0.125 2023-06-20 15:15:37,772 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.414e+02 2.878e+02 3.343e+02 3.968e+02 7.161e+02, threshold=6.685e+02, percent-clipped=0.0 2023-06-20 15:15:40,499 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-20 15:16:21,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=740814.0, ans=0.125 2023-06-20 15:16:24,296 INFO [train.py:996] (2/4) Epoch 5, batch 1500, loss[loss=0.3135, simple_loss=0.4003, pruned_loss=0.1134, over 21515.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3326, pruned_loss=0.09822, over 4298442.00 frames. ], batch size: 471, lr: 6.57e-03, grad_scale: 32.0 2023-06-20 15:17:11,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=740994.0, ans=0.2 2023-06-20 15:17:18,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=741054.0, ans=0.125 2023-06-20 15:18:08,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=741174.0, ans=0.0 2023-06-20 15:18:10,007 INFO [train.py:996] (2/4) Epoch 5, batch 1550, loss[loss=0.2909, simple_loss=0.3483, pruned_loss=0.1168, over 21229.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3301, pruned_loss=0.09703, over 4300402.78 frames. ], batch size: 143, lr: 6.56e-03, grad_scale: 16.0 2023-06-20 15:18:21,370 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.83 vs. limit=15.0 2023-06-20 15:19:09,211 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 2.929e+02 3.428e+02 4.021e+02 6.196e+02, threshold=6.855e+02, percent-clipped=0.0 2023-06-20 15:19:51,921 INFO [train.py:996] (2/4) Epoch 5, batch 1600, loss[loss=0.267, simple_loss=0.3701, pruned_loss=0.082, over 20890.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3294, pruned_loss=0.09522, over 4298465.84 frames. ], batch size: 607, lr: 6.56e-03, grad_scale: 32.0 2023-06-20 15:20:01,769 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.18 vs. limit=15.0 2023-06-20 15:20:23,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=741534.0, ans=0.125 2023-06-20 15:21:26,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=741714.0, ans=0.5 2023-06-20 15:21:36,268 INFO [train.py:996] (2/4) Epoch 5, batch 1650, loss[loss=0.219, simple_loss=0.3184, pruned_loss=0.05982, over 21770.00 frames. ], tot_loss[loss=0.259, simple_loss=0.329, pruned_loss=0.09449, over 4291715.36 frames. ], batch size: 351, lr: 6.56e-03, grad_scale: 32.0 2023-06-20 15:21:48,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=741774.0, ans=0.125 2023-06-20 15:22:00,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=741834.0, ans=0.125 2023-06-20 15:22:31,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=741894.0, ans=0.07 2023-06-20 15:22:51,494 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.331e+02 3.044e+02 3.519e+02 4.349e+02 7.461e+02, threshold=7.039e+02, percent-clipped=1.0 2023-06-20 15:22:58,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=741954.0, ans=10.0 2023-06-20 15:23:19,940 INFO [train.py:996] (2/4) Epoch 5, batch 1700, loss[loss=0.2941, simple_loss=0.3512, pruned_loss=0.1185, over 21840.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3313, pruned_loss=0.09543, over 4291676.51 frames. ], batch size: 298, lr: 6.56e-03, grad_scale: 16.0 2023-06-20 15:24:06,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=742194.0, ans=0.1 2023-06-20 15:24:21,075 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.029e-02 2023-06-20 15:24:35,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=742254.0, ans=0.0 2023-06-20 15:24:52,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=742314.0, ans=0.125 2023-06-20 15:24:56,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=742314.0, ans=0.0 2023-06-20 15:25:00,590 INFO [train.py:996] (2/4) Epoch 5, batch 1750, loss[loss=0.2213, simple_loss=0.3032, pruned_loss=0.06966, over 21785.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3301, pruned_loss=0.09376, over 4286759.94 frames. ], batch size: 316, lr: 6.56e-03, grad_scale: 16.0 2023-06-20 15:25:58,394 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-06-20 15:26:03,510 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.06 vs. limit=12.0 2023-06-20 15:26:13,032 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.877e+02 3.732e+02 4.371e+02 8.077e+02, threshold=7.464e+02, percent-clipped=3.0 2023-06-20 15:26:22,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=742554.0, ans=0.0 2023-06-20 15:26:45,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=742674.0, ans=0.125 2023-06-20 15:26:46,875 INFO [train.py:996] (2/4) Epoch 5, batch 1800, loss[loss=0.2619, simple_loss=0.3216, pruned_loss=0.1011, over 21181.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3264, pruned_loss=0.09095, over 4278924.26 frames. ], batch size: 607, lr: 6.56e-03, grad_scale: 16.0 2023-06-20 15:26:58,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=742674.0, ans=0.2 2023-06-20 15:28:03,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=742854.0, ans=0.0 2023-06-20 15:28:20,417 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=22.5 2023-06-20 15:28:24,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=742914.0, ans=0.125 2023-06-20 15:28:29,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=742974.0, ans=0.1 2023-06-20 15:28:30,595 INFO [train.py:996] (2/4) Epoch 5, batch 1850, loss[loss=0.2072, simple_loss=0.296, pruned_loss=0.05919, over 21742.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3263, pruned_loss=0.08798, over 4274241.83 frames. ], batch size: 298, lr: 6.56e-03, grad_scale: 16.0 2023-06-20 15:28:46,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=742974.0, ans=0.2 2023-06-20 15:29:02,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=743034.0, ans=0.95 2023-06-20 15:29:36,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=743154.0, ans=0.0 2023-06-20 15:29:40,610 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.730e+02 3.304e+02 4.070e+02 7.005e+02, threshold=6.608e+02, percent-clipped=0.0 2023-06-20 15:30:03,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=743214.0, ans=0.0 2023-06-20 15:30:18,513 INFO [train.py:996] (2/4) Epoch 5, batch 1900, loss[loss=0.2481, simple_loss=0.3026, pruned_loss=0.09681, over 21861.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3255, pruned_loss=0.0884, over 4278121.88 frames. ], batch size: 118, lr: 6.56e-03, grad_scale: 16.0 2023-06-20 15:30:46,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=743334.0, ans=15.0 2023-06-20 15:30:50,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=743334.0, ans=0.125 2023-06-20 15:31:37,985 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 15:32:01,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=743574.0, ans=0.125 2023-06-20 15:32:02,682 INFO [train.py:996] (2/4) Epoch 5, batch 1950, loss[loss=0.2099, simple_loss=0.2872, pruned_loss=0.06626, over 21559.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3217, pruned_loss=0.08807, over 4276812.30 frames. ], batch size: 263, lr: 6.55e-03, grad_scale: 16.0 2023-06-20 15:32:33,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=743634.0, ans=0.2 2023-06-20 15:33:10,455 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 3.018e+02 3.404e+02 4.237e+02 8.100e+02, threshold=6.807e+02, percent-clipped=3.0 2023-06-20 15:33:40,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=743814.0, ans=0.1 2023-06-20 15:33:49,099 INFO [train.py:996] (2/4) Epoch 5, batch 2000, loss[loss=0.2406, simple_loss=0.2976, pruned_loss=0.09181, over 20793.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3184, pruned_loss=0.08765, over 4276512.71 frames. ], batch size: 607, lr: 6.55e-03, grad_scale: 32.0 2023-06-20 15:34:09,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=743874.0, ans=0.125 2023-06-20 15:34:17,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=743934.0, ans=0.04949747468305833 2023-06-20 15:34:26,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=743934.0, ans=0.0 2023-06-20 15:34:28,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=743934.0, ans=0.125 2023-06-20 15:34:28,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=743934.0, ans=0.0 2023-06-20 15:34:32,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=743994.0, ans=0.125 2023-06-20 15:35:07,151 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.48 vs. limit=10.0 2023-06-20 15:35:21,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=744114.0, ans=0.05 2023-06-20 15:35:33,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=744174.0, ans=0.125 2023-06-20 15:35:34,240 INFO [train.py:996] (2/4) Epoch 5, batch 2050, loss[loss=0.2471, simple_loss=0.3253, pruned_loss=0.08441, over 21875.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3205, pruned_loss=0.08864, over 4281929.12 frames. ], batch size: 107, lr: 6.55e-03, grad_scale: 32.0 2023-06-20 15:36:11,229 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.68 vs. limit=10.0 2023-06-20 15:36:18,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=744294.0, ans=0.0 2023-06-20 15:36:36,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=744354.0, ans=0.95 2023-06-20 15:36:39,227 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.193e+02 2.809e+02 3.320e+02 4.171e+02 6.443e+02, threshold=6.640e+02, percent-clipped=0.0 2023-06-20 15:36:41,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=744354.0, ans=0.2 2023-06-20 15:36:53,546 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.33 vs. limit=15.0 2023-06-20 15:37:12,251 INFO [train.py:996] (2/4) Epoch 5, batch 2100, loss[loss=0.3007, simple_loss=0.3734, pruned_loss=0.114, over 21892.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3226, pruned_loss=0.08999, over 4286471.76 frames. ], batch size: 316, lr: 6.55e-03, grad_scale: 32.0 2023-06-20 15:37:54,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=744534.0, ans=0.125 2023-06-20 15:38:14,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=744594.0, ans=0.125 2023-06-20 15:38:46,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=744714.0, ans=0.125 2023-06-20 15:39:06,031 INFO [train.py:996] (2/4) Epoch 5, batch 2150, loss[loss=0.2304, simple_loss=0.2934, pruned_loss=0.08372, over 21665.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.326, pruned_loss=0.09231, over 4282845.74 frames. ], batch size: 333, lr: 6.55e-03, grad_scale: 32.0 2023-06-20 15:39:16,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=744774.0, ans=0.0 2023-06-20 15:39:25,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=744774.0, ans=0.1 2023-06-20 15:39:38,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=744834.0, ans=0.125 2023-06-20 15:40:02,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn1.whiten.whitening_limit, batch_count=744894.0, ans=22.5 2023-06-20 15:40:11,005 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.283e+02 3.145e+02 3.806e+02 5.141e+02 9.299e+02, threshold=7.611e+02, percent-clipped=10.0 2023-06-20 15:40:41,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=745014.0, ans=0.125 2023-06-20 15:40:44,303 INFO [train.py:996] (2/4) Epoch 5, batch 2200, loss[loss=0.224, simple_loss=0.2957, pruned_loss=0.07615, over 21214.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3268, pruned_loss=0.0926, over 4276946.06 frames. ], batch size: 176, lr: 6.55e-03, grad_scale: 32.0 2023-06-20 15:41:28,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=745194.0, ans=0.0 2023-06-20 15:42:06,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=745254.0, ans=0.2 2023-06-20 15:42:32,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=745314.0, ans=0.05 2023-06-20 15:42:40,446 INFO [train.py:996] (2/4) Epoch 5, batch 2250, loss[loss=0.2583, simple_loss=0.3208, pruned_loss=0.09787, over 21783.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3239, pruned_loss=0.09029, over 4277322.67 frames. ], batch size: 371, lr: 6.55e-03, grad_scale: 16.0 2023-06-20 15:43:12,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=745494.0, ans=0.0 2023-06-20 15:43:34,838 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-20 15:43:46,944 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.841e+02 2.695e+02 3.113e+02 3.722e+02 7.366e+02, threshold=6.226e+02, percent-clipped=0.0 2023-06-20 15:43:52,857 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=15.0 2023-06-20 15:43:53,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=745554.0, ans=0.125 2023-06-20 15:44:23,186 INFO [train.py:996] (2/4) Epoch 5, batch 2300, loss[loss=0.2562, simple_loss=0.355, pruned_loss=0.07873, over 20738.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3179, pruned_loss=0.08902, over 4267865.94 frames. ], batch size: 608, lr: 6.54e-03, grad_scale: 16.0 2023-06-20 15:44:34,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=745674.0, ans=0.125 2023-06-20 15:44:40,792 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 15:45:17,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=745854.0, ans=0.125 2023-06-20 15:46:06,137 INFO [train.py:996] (2/4) Epoch 5, batch 2350, loss[loss=0.239, simple_loss=0.2977, pruned_loss=0.09017, over 21175.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3155, pruned_loss=0.08919, over 4264070.21 frames. ], batch size: 176, lr: 6.54e-03, grad_scale: 16.0 2023-06-20 15:46:50,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=746094.0, ans=0.05 2023-06-20 15:47:12,569 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.470e+02 3.194e+02 3.678e+02 4.653e+02 7.153e+02, threshold=7.356e+02, percent-clipped=3.0 2023-06-20 15:47:13,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=746154.0, ans=0.125 2023-06-20 15:47:42,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=746214.0, ans=0.125 2023-06-20 15:47:46,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=746214.0, ans=0.2 2023-06-20 15:47:50,587 INFO [train.py:996] (2/4) Epoch 5, batch 2400, loss[loss=0.3028, simple_loss=0.3567, pruned_loss=0.1245, over 21468.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3186, pruned_loss=0.09182, over 4266233.01 frames. ], batch size: 211, lr: 6.54e-03, grad_scale: 32.0 2023-06-20 15:47:52,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=746274.0, ans=0.125 2023-06-20 15:48:05,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=746274.0, ans=0.125 2023-06-20 15:48:27,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=746334.0, ans=0.125 2023-06-20 15:49:16,571 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.10 vs. limit=22.5 2023-06-20 15:49:22,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=746514.0, ans=0.5 2023-06-20 15:49:36,781 INFO [train.py:996] (2/4) Epoch 5, batch 2450, loss[loss=0.2356, simple_loss=0.3135, pruned_loss=0.07882, over 21815.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3272, pruned_loss=0.09562, over 4261237.35 frames. ], batch size: 118, lr: 6.54e-03, grad_scale: 32.0 2023-06-20 15:50:34,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=746754.0, ans=0.0 2023-06-20 15:50:47,934 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.378e+02 3.072e+02 3.519e+02 4.096e+02 7.474e+02, threshold=7.039e+02, percent-clipped=1.0 2023-06-20 15:51:12,474 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=22.5 2023-06-20 15:51:18,598 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 15:51:19,746 INFO [train.py:996] (2/4) Epoch 5, batch 2500, loss[loss=0.246, simple_loss=0.3041, pruned_loss=0.09388, over 21237.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.326, pruned_loss=0.09569, over 4265196.99 frames. ], batch size: 159, lr: 6.54e-03, grad_scale: 32.0 2023-06-20 15:51:20,787 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-20 15:52:18,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=747054.0, ans=0.1 2023-06-20 15:52:40,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=747114.0, ans=0.125 2023-06-20 15:52:59,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=747114.0, ans=0.0 2023-06-20 15:53:02,347 INFO [train.py:996] (2/4) Epoch 5, batch 2550, loss[loss=0.2394, simple_loss=0.2948, pruned_loss=0.092, over 21527.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3238, pruned_loss=0.09437, over 4263813.98 frames. ], batch size: 391, lr: 6.54e-03, grad_scale: 32.0 2023-06-20 15:53:09,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=747174.0, ans=0.0 2023-06-20 15:53:20,759 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-20 15:54:03,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=747354.0, ans=0.1 2023-06-20 15:54:06,981 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.216e+02 2.818e+02 3.202e+02 3.816e+02 6.010e+02, threshold=6.403e+02, percent-clipped=0.0 2023-06-20 15:54:44,538 INFO [train.py:996] (2/4) Epoch 5, batch 2600, loss[loss=0.3013, simple_loss=0.3576, pruned_loss=0.1225, over 21321.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3241, pruned_loss=0.09447, over 4268893.84 frames. ], batch size: 143, lr: 6.54e-03, grad_scale: 32.0 2023-06-20 15:54:50,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=747474.0, ans=0.2 2023-06-20 15:55:19,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=747534.0, ans=0.0 2023-06-20 15:55:29,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=747594.0, ans=0.0 2023-06-20 15:55:34,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=747594.0, ans=0.1 2023-06-20 15:55:49,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=747654.0, ans=0.125 2023-06-20 15:55:58,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=747654.0, ans=0.125 2023-06-20 15:56:27,531 INFO [train.py:996] (2/4) Epoch 5, batch 2650, loss[loss=0.2296, simple_loss=0.3191, pruned_loss=0.07, over 21852.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3274, pruned_loss=0.09713, over 4277199.67 frames. ], batch size: 282, lr: 6.54e-03, grad_scale: 32.0 2023-06-20 15:56:36,993 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=15.0 2023-06-20 15:57:02,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=747834.0, ans=0.2 2023-06-20 15:57:04,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=747894.0, ans=0.125 2023-06-20 15:57:39,252 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 3.025e+02 3.669e+02 4.481e+02 6.938e+02, threshold=7.338e+02, percent-clipped=2.0 2023-06-20 15:57:59,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=748014.0, ans=0.1 2023-06-20 15:58:10,859 INFO [train.py:996] (2/4) Epoch 5, batch 2700, loss[loss=0.2073, simple_loss=0.2696, pruned_loss=0.07253, over 21461.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3257, pruned_loss=0.09619, over 4278582.88 frames. ], batch size: 195, lr: 6.53e-03, grad_scale: 32.0 2023-06-20 15:58:21,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=748074.0, ans=0.2 2023-06-20 15:59:02,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=748194.0, ans=0.5 2023-06-20 15:59:11,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=748254.0, ans=0.0 2023-06-20 15:59:16,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=748254.0, ans=0.0 2023-06-20 15:59:52,551 INFO [train.py:996] (2/4) Epoch 5, batch 2750, loss[loss=0.3181, simple_loss=0.3624, pruned_loss=0.1369, over 21726.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3247, pruned_loss=0.09629, over 4279309.20 frames. ], batch size: 473, lr: 6.53e-03, grad_scale: 32.0 2023-06-20 16:00:41,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=748494.0, ans=0.2 2023-06-20 16:00:49,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=748494.0, ans=0.1 2023-06-20 16:01:03,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=748554.0, ans=0.2 2023-06-20 16:01:06,309 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.348e+02 3.120e+02 3.768e+02 4.829e+02 8.745e+02, threshold=7.536e+02, percent-clipped=5.0 2023-06-20 16:01:13,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=748554.0, ans=0.125 2023-06-20 16:01:35,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=748674.0, ans=0.125 2023-06-20 16:01:36,391 INFO [train.py:996] (2/4) Epoch 5, batch 2800, loss[loss=0.2223, simple_loss=0.3445, pruned_loss=0.05011, over 19725.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3298, pruned_loss=0.09719, over 4281284.81 frames. ], batch size: 702, lr: 6.53e-03, grad_scale: 32.0 2023-06-20 16:01:50,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=748674.0, ans=0.125 2023-06-20 16:03:27,355 INFO [train.py:996] (2/4) Epoch 5, batch 2850, loss[loss=0.252, simple_loss=0.3646, pruned_loss=0.06974, over 19856.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3319, pruned_loss=0.09716, over 4280474.80 frames. ], batch size: 704, lr: 6.53e-03, grad_scale: 16.0 2023-06-20 16:03:27,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=748974.0, ans=0.5 2023-06-20 16:04:02,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=749034.0, ans=0.2 2023-06-20 16:04:35,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=749154.0, ans=0.0 2023-06-20 16:04:42,075 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.414e+02 3.197e+02 3.952e+02 5.010e+02 9.652e+02, threshold=7.904e+02, percent-clipped=7.0 2023-06-20 16:04:44,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=749154.0, ans=0.125 2023-06-20 16:05:10,317 INFO [train.py:996] (2/4) Epoch 5, batch 2900, loss[loss=0.2198, simple_loss=0.2658, pruned_loss=0.08684, over 20727.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3286, pruned_loss=0.09643, over 4287677.41 frames. ], batch size: 608, lr: 6.53e-03, grad_scale: 16.0 2023-06-20 16:05:16,689 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-20 16:05:17,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=749274.0, ans=0.1 2023-06-20 16:05:38,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=749334.0, ans=0.125 2023-06-20 16:06:01,251 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-06-20 16:06:38,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=749514.0, ans=0.125 2023-06-20 16:06:52,777 INFO [train.py:996] (2/4) Epoch 5, batch 2950, loss[loss=0.2485, simple_loss=0.3209, pruned_loss=0.08807, over 21710.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3289, pruned_loss=0.09664, over 4295157.63 frames. ], batch size: 112, lr: 6.53e-03, grad_scale: 16.0 2023-06-20 16:07:31,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=749694.0, ans=0.1 2023-06-20 16:08:05,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=749754.0, ans=10.0 2023-06-20 16:08:10,265 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.876e+02 3.268e+02 4.025e+02 7.097e+02, threshold=6.536e+02, percent-clipped=0.0 2023-06-20 16:08:21,037 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-06-20 16:08:36,224 INFO [train.py:996] (2/4) Epoch 5, batch 3000, loss[loss=0.2878, simple_loss=0.3532, pruned_loss=0.1112, over 21338.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3337, pruned_loss=0.09857, over 4293902.64 frames. ], batch size: 548, lr: 6.53e-03, grad_scale: 8.0 2023-06-20 16:08:36,224 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 16:08:55,138 INFO [train.py:1028] (2/4) Epoch 5, validation: loss=0.2579, simple_loss=0.3533, pruned_loss=0.08129, over 1796401.00 frames. 2023-06-20 16:08:55,139 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-20 16:08:57,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=749874.0, ans=0.125 2023-06-20 16:09:12,324 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.40 vs. limit=22.5 2023-06-20 16:09:31,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=749934.0, ans=0.125 2023-06-20 16:09:35,997 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.71 vs. limit=10.0 2023-06-20 16:09:49,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=749994.0, ans=0.5 2023-06-20 16:09:50,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=749994.0, ans=0.125 2023-06-20 16:10:18,026 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.35 vs. limit=22.5 2023-06-20 16:10:39,786 INFO [train.py:996] (2/4) Epoch 5, batch 3050, loss[loss=0.2342, simple_loss=0.3317, pruned_loss=0.06831, over 19863.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3333, pruned_loss=0.09662, over 4293049.62 frames. ], batch size: 703, lr: 6.52e-03, grad_scale: 8.0 2023-06-20 16:11:50,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=750354.0, ans=0.125 2023-06-20 16:11:52,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=750354.0, ans=0.125 2023-06-20 16:11:58,929 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 2.749e+02 3.160e+02 3.983e+02 6.617e+02, threshold=6.320e+02, percent-clipped=1.0 2023-06-20 16:12:25,612 INFO [train.py:996] (2/4) Epoch 5, batch 3100, loss[loss=0.236, simple_loss=0.3286, pruned_loss=0.07173, over 21714.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3325, pruned_loss=0.09502, over 4283543.18 frames. ], batch size: 351, lr: 6.52e-03, grad_scale: 8.0 2023-06-20 16:12:57,309 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:12:58,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=750534.0, ans=0.125 2023-06-20 16:14:15,797 INFO [train.py:996] (2/4) Epoch 5, batch 3150, loss[loss=0.3438, simple_loss=0.393, pruned_loss=0.1473, over 21272.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3333, pruned_loss=0.09509, over 4277957.15 frames. ], batch size: 143, lr: 6.52e-03, grad_scale: 8.0 2023-06-20 16:14:40,166 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=22.5 2023-06-20 16:15:28,149 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.671e+02 3.239e+02 3.868e+02 6.706e+02, threshold=6.479e+02, percent-clipped=2.0 2023-06-20 16:15:42,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=751014.0, ans=0.1 2023-06-20 16:15:55,784 INFO [train.py:996] (2/4) Epoch 5, batch 3200, loss[loss=0.261, simple_loss=0.3368, pruned_loss=0.09259, over 21811.00 frames. ], tot_loss[loss=0.2634, simple_loss=0.3358, pruned_loss=0.09553, over 4276075.25 frames. ], batch size: 282, lr: 6.52e-03, grad_scale: 16.0 2023-06-20 16:16:29,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=751134.0, ans=0.1 2023-06-20 16:17:23,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=751314.0, ans=0.125 2023-06-20 16:17:38,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=751374.0, ans=0.04949747468305833 2023-06-20 16:17:38,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=751374.0, ans=0.1 2023-06-20 16:17:40,159 INFO [train.py:996] (2/4) Epoch 5, batch 3250, loss[loss=0.2496, simple_loss=0.3142, pruned_loss=0.09252, over 21812.00 frames. ], tot_loss[loss=0.266, simple_loss=0.3372, pruned_loss=0.09734, over 4269709.29 frames. ], batch size: 118, lr: 6.52e-03, grad_scale: 16.0 2023-06-20 16:17:47,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=751374.0, ans=0.0 2023-06-20 16:18:28,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=751494.0, ans=0.125 2023-06-20 16:18:29,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=751494.0, ans=0.1 2023-06-20 16:19:02,276 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.342e+02 3.071e+02 3.454e+02 4.020e+02 6.852e+02, threshold=6.907e+02, percent-clipped=1.0 2023-06-20 16:19:23,822 INFO [train.py:996] (2/4) Epoch 5, batch 3300, loss[loss=0.2627, simple_loss=0.3496, pruned_loss=0.08784, over 21759.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3315, pruned_loss=0.09561, over 4260789.22 frames. ], batch size: 351, lr: 6.52e-03, grad_scale: 16.0 2023-06-20 16:19:41,607 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.97 vs. limit=15.0 2023-06-20 16:20:05,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=751794.0, ans=0.2 2023-06-20 16:21:09,037 INFO [train.py:996] (2/4) Epoch 5, batch 3350, loss[loss=0.282, simple_loss=0.3467, pruned_loss=0.1086, over 21791.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3339, pruned_loss=0.09654, over 4264261.35 frames. ], batch size: 414, lr: 6.52e-03, grad_scale: 16.0 2023-06-20 16:22:31,290 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.245e+02 3.953e+02 4.970e+02 1.057e+03, threshold=7.906e+02, percent-clipped=6.0 2023-06-20 16:22:49,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=752214.0, ans=0.05 2023-06-20 16:22:57,038 INFO [train.py:996] (2/4) Epoch 5, batch 3400, loss[loss=0.2631, simple_loss=0.3264, pruned_loss=0.09987, over 21633.00 frames. ], tot_loss[loss=0.2653, simple_loss=0.3349, pruned_loss=0.09778, over 4274068.70 frames. ], batch size: 332, lr: 6.52e-03, grad_scale: 16.0 2023-06-20 16:22:59,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=752274.0, ans=0.2 2023-06-20 16:22:59,819 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=22.5 2023-06-20 16:23:23,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=752334.0, ans=0.125 2023-06-20 16:23:47,315 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.11 vs. limit=15.0 2023-06-20 16:24:41,925 INFO [train.py:996] (2/4) Epoch 5, batch 3450, loss[loss=0.4452, simple_loss=0.4856, pruned_loss=0.2024, over 21426.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3306, pruned_loss=0.09719, over 4279013.74 frames. ], batch size: 471, lr: 6.51e-03, grad_scale: 16.0 2023-06-20 16:25:08,416 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.26 vs. limit=10.0 2023-06-20 16:25:22,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=752634.0, ans=0.125 2023-06-20 16:26:05,740 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 3.288e+02 3.795e+02 4.947e+02 8.128e+02, threshold=7.589e+02, percent-clipped=1.0 2023-06-20 16:26:11,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=752814.0, ans=0.125 2023-06-20 16:26:27,186 INFO [train.py:996] (2/4) Epoch 5, batch 3500, loss[loss=0.3014, simple_loss=0.36, pruned_loss=0.1214, over 21246.00 frames. ], tot_loss[loss=0.2688, simple_loss=0.3374, pruned_loss=0.1001, over 4280310.85 frames. ], batch size: 159, lr: 6.51e-03, grad_scale: 16.0 2023-06-20 16:26:31,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=752874.0, ans=0.125 2023-06-20 16:27:01,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=752934.0, ans=0.125 2023-06-20 16:28:10,890 INFO [train.py:996] (2/4) Epoch 5, batch 3550, loss[loss=0.241, simple_loss=0.2955, pruned_loss=0.09324, over 21374.00 frames. ], tot_loss[loss=0.2708, simple_loss=0.3393, pruned_loss=0.1011, over 4280288.90 frames. ], batch size: 194, lr: 6.51e-03, grad_scale: 16.0 2023-06-20 16:28:38,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=753234.0, ans=0.125 2023-06-20 16:28:57,748 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=15.0 2023-06-20 16:28:57,847 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-20 16:29:35,084 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.272e+02 2.975e+02 3.456e+02 4.245e+02 7.529e+02, threshold=6.912e+02, percent-clipped=0.0 2023-06-20 16:29:49,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=753414.0, ans=0.125 2023-06-20 16:29:49,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=753414.0, ans=0.125 2023-06-20 16:30:01,844 INFO [train.py:996] (2/4) Epoch 5, batch 3600, loss[loss=0.2699, simple_loss=0.3265, pruned_loss=0.1066, over 21660.00 frames. ], tot_loss[loss=0.2677, simple_loss=0.3331, pruned_loss=0.1011, over 4281725.00 frames. ], batch size: 298, lr: 6.51e-03, grad_scale: 32.0 2023-06-20 16:30:09,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=753474.0, ans=0.125 2023-06-20 16:31:15,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=753654.0, ans=0.1 2023-06-20 16:31:46,503 INFO [train.py:996] (2/4) Epoch 5, batch 3650, loss[loss=0.2323, simple_loss=0.2905, pruned_loss=0.08706, over 21913.00 frames. ], tot_loss[loss=0.2682, simple_loss=0.3343, pruned_loss=0.101, over 4282207.09 frames. ], batch size: 107, lr: 6.51e-03, grad_scale: 32.0 2023-06-20 16:32:13,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=753834.0, ans=0.1 2023-06-20 16:32:28,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=753894.0, ans=0.125 2023-06-20 16:32:29,480 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.89 vs. limit=10.0 2023-06-20 16:32:30,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=753894.0, ans=10.0 2023-06-20 16:32:33,556 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:32:56,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=753954.0, ans=0.04949747468305833 2023-06-20 16:33:02,487 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 3.073e+02 3.466e+02 4.352e+02 7.872e+02, threshold=6.931e+02, percent-clipped=1.0 2023-06-20 16:33:06,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=754014.0, ans=0.0 2023-06-20 16:33:29,235 INFO [train.py:996] (2/4) Epoch 5, batch 3700, loss[loss=0.2624, simple_loss=0.3317, pruned_loss=0.09653, over 21861.00 frames. ], tot_loss[loss=0.2656, simple_loss=0.3331, pruned_loss=0.0991, over 4286650.87 frames. ], batch size: 332, lr: 6.51e-03, grad_scale: 32.0 2023-06-20 16:33:42,409 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-06-20 16:33:58,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=754134.0, ans=0.0 2023-06-20 16:34:04,947 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-06-20 16:34:09,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=754134.0, ans=0.0 2023-06-20 16:34:13,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=754194.0, ans=0.125 2023-06-20 16:34:54,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=754314.0, ans=0.125 2023-06-20 16:35:14,142 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:35:18,258 INFO [train.py:996] (2/4) Epoch 5, batch 3750, loss[loss=0.2052, simple_loss=0.2738, pruned_loss=0.06827, over 21872.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3297, pruned_loss=0.0974, over 4287088.17 frames. ], batch size: 124, lr: 6.51e-03, grad_scale: 16.0 2023-06-20 16:36:13,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=754494.0, ans=0.125 2023-06-20 16:36:36,448 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.776e+02 3.273e+02 3.853e+02 7.611e+02, threshold=6.547e+02, percent-clipped=1.0 2023-06-20 16:36:54,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=754614.0, ans=0.125 2023-06-20 16:37:07,432 INFO [train.py:996] (2/4) Epoch 5, batch 3800, loss[loss=0.274, simple_loss=0.3347, pruned_loss=0.1067, over 21824.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.329, pruned_loss=0.09616, over 4283748.58 frames. ], batch size: 247, lr: 6.51e-03, grad_scale: 16.0 2023-06-20 16:37:26,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=754734.0, ans=0.125 2023-06-20 16:38:28,560 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-20 16:38:44,998 INFO [train.py:996] (2/4) Epoch 5, batch 3850, loss[loss=0.2663, simple_loss=0.3197, pruned_loss=0.1064, over 21848.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3297, pruned_loss=0.09793, over 4280906.85 frames. ], batch size: 118, lr: 6.50e-03, grad_scale: 16.0 2023-06-20 16:39:25,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=755034.0, ans=0.1 2023-06-20 16:39:34,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=755094.0, ans=0.2 2023-06-20 16:39:49,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=755154.0, ans=0.0 2023-06-20 16:40:01,856 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 3.004e+02 3.563e+02 4.477e+02 7.369e+02, threshold=7.126e+02, percent-clipped=2.0 2023-06-20 16:40:19,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=755214.0, ans=0.1 2023-06-20 16:40:27,659 INFO [train.py:996] (2/4) Epoch 5, batch 3900, loss[loss=0.2401, simple_loss=0.2985, pruned_loss=0.09088, over 21313.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3243, pruned_loss=0.09718, over 4285964.87 frames. ], batch size: 159, lr: 6.50e-03, grad_scale: 16.0 2023-06-20 16:41:31,748 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=12.0 2023-06-20 16:41:45,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=755454.0, ans=0.125 2023-06-20 16:42:10,929 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.60 vs. limit=15.0 2023-06-20 16:42:11,412 INFO [train.py:996] (2/4) Epoch 5, batch 3950, loss[loss=0.2862, simple_loss=0.369, pruned_loss=0.1016, over 20702.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3256, pruned_loss=0.09657, over 4287004.92 frames. ], batch size: 607, lr: 6.50e-03, grad_scale: 16.0 2023-06-20 16:42:47,301 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.90 vs. limit=15.0 2023-06-20 16:42:49,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=755634.0, ans=0.125 2023-06-20 16:43:06,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=755694.0, ans=0.125 2023-06-20 16:43:22,837 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:43:33,691 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.896e+02 3.603e+02 4.962e+02 8.484e+02, threshold=7.206e+02, percent-clipped=4.0 2023-06-20 16:43:44,469 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.76 vs. limit=22.5 2023-06-20 16:43:50,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=755814.0, ans=0.125 2023-06-20 16:43:52,926 INFO [train.py:996] (2/4) Epoch 5, batch 4000, loss[loss=0.2697, simple_loss=0.337, pruned_loss=0.1012, over 20192.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.319, pruned_loss=0.09385, over 4271484.18 frames. ], batch size: 703, lr: 6.50e-03, grad_scale: 32.0 2023-06-20 16:44:02,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=755874.0, ans=0.125 2023-06-20 16:44:02,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=755874.0, ans=0.0 2023-06-20 16:44:05,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=755874.0, ans=0.1 2023-06-20 16:44:50,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=755994.0, ans=0.0 2023-06-20 16:44:52,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=756054.0, ans=0.0 2023-06-20 16:45:41,101 INFO [train.py:996] (2/4) Epoch 5, batch 4050, loss[loss=0.2336, simple_loss=0.3165, pruned_loss=0.07532, over 21647.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3169, pruned_loss=0.09034, over 4276950.38 frames. ], batch size: 263, lr: 6.50e-03, grad_scale: 32.0 2023-06-20 16:46:15,143 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.20 vs. limit=15.0 2023-06-20 16:46:22,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=756294.0, ans=0.1 2023-06-20 16:46:22,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=756294.0, ans=0.0 2023-06-20 16:46:57,483 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.019e+02 2.627e+02 3.094e+02 3.740e+02 6.411e+02, threshold=6.189e+02, percent-clipped=0.0 2023-06-20 16:47:22,992 INFO [train.py:996] (2/4) Epoch 5, batch 4100, loss[loss=0.2862, simple_loss=0.3568, pruned_loss=0.1078, over 21790.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3178, pruned_loss=0.09028, over 4280534.38 frames. ], batch size: 391, lr: 6.50e-03, grad_scale: 32.0 2023-06-20 16:47:24,282 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.66 vs. limit=12.0 2023-06-20 16:48:05,980 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=22.5 2023-06-20 16:48:30,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=756654.0, ans=0.1 2023-06-20 16:48:35,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=756654.0, ans=0.1 2023-06-20 16:49:03,705 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 16:49:06,419 INFO [train.py:996] (2/4) Epoch 5, batch 4150, loss[loss=0.2245, simple_loss=0.3081, pruned_loss=0.07049, over 21576.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3182, pruned_loss=0.08705, over 4264429.19 frames. ], batch size: 230, lr: 6.50e-03, grad_scale: 16.0 2023-06-20 16:49:15,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=756774.0, ans=0.0 2023-06-20 16:50:26,862 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.036e+02 2.812e+02 3.283e+02 4.437e+02 7.520e+02, threshold=6.566e+02, percent-clipped=5.0 2023-06-20 16:50:55,601 INFO [train.py:996] (2/4) Epoch 5, batch 4200, loss[loss=0.2271, simple_loss=0.2948, pruned_loss=0.0797, over 21466.00 frames. ], tot_loss[loss=0.246, simple_loss=0.318, pruned_loss=0.087, over 4263208.72 frames. ], batch size: 212, lr: 6.50e-03, grad_scale: 16.0 2023-06-20 16:52:14,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=757254.0, ans=0.0 2023-06-20 16:52:37,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=757314.0, ans=10.0 2023-06-20 16:52:40,376 INFO [train.py:996] (2/4) Epoch 5, batch 4250, loss[loss=0.3075, simple_loss=0.3621, pruned_loss=0.1265, over 21283.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3245, pruned_loss=0.09057, over 4253263.33 frames. ], batch size: 159, lr: 6.49e-03, grad_scale: 16.0 2023-06-20 16:52:46,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=757374.0, ans=0.125 2023-06-20 16:53:55,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=757554.0, ans=0.2 2023-06-20 16:54:03,355 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.181e+02 2.999e+02 3.547e+02 4.279e+02 1.014e+03, threshold=7.094e+02, percent-clipped=7.0 2023-06-20 16:54:22,753 INFO [train.py:996] (2/4) Epoch 5, batch 4300, loss[loss=0.2267, simple_loss=0.3202, pruned_loss=0.06659, over 21812.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3335, pruned_loss=0.0946, over 4260850.22 frames. ], batch size: 282, lr: 6.49e-03, grad_scale: 16.0 2023-06-20 16:54:30,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=757674.0, ans=0.0 2023-06-20 16:54:55,029 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.13 vs. limit=5.0 2023-06-20 16:55:42,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=757854.0, ans=0.125 2023-06-20 16:55:49,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=757914.0, ans=0.125 2023-06-20 16:56:17,269 INFO [train.py:996] (2/4) Epoch 5, batch 4350, loss[loss=0.2505, simple_loss=0.3574, pruned_loss=0.07179, over 19911.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3302, pruned_loss=0.09288, over 4261102.33 frames. ], batch size: 702, lr: 6.49e-03, grad_scale: 16.0 2023-06-20 16:57:06,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=758094.0, ans=0.125 2023-06-20 16:57:31,426 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.864e+02 3.148e+02 3.712e+02 7.836e+02, threshold=6.297e+02, percent-clipped=1.0 2023-06-20 16:57:54,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=758214.0, ans=0.2 2023-06-20 16:57:57,338 INFO [train.py:996] (2/4) Epoch 5, batch 4400, loss[loss=0.2222, simple_loss=0.2962, pruned_loss=0.07408, over 21203.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3263, pruned_loss=0.09184, over 4257430.09 frames. ], batch size: 143, lr: 6.49e-03, grad_scale: 32.0 2023-06-20 16:59:02,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=758454.0, ans=0.125 2023-06-20 16:59:16,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=758454.0, ans=0.2 2023-06-20 16:59:39,789 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.98 vs. limit=10.0 2023-06-20 16:59:41,593 INFO [train.py:996] (2/4) Epoch 5, batch 4450, loss[loss=0.3123, simple_loss=0.3862, pruned_loss=0.1192, over 21865.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3347, pruned_loss=0.09358, over 4259065.13 frames. ], batch size: 371, lr: 6.49e-03, grad_scale: 32.0 2023-06-20 16:59:54,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=758574.0, ans=0.125 2023-06-20 17:00:37,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=758694.0, ans=0.0 2023-06-20 17:01:08,065 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 2.910e+02 3.386e+02 4.171e+02 6.417e+02, threshold=6.772e+02, percent-clipped=2.0 2023-06-20 17:01:32,517 INFO [train.py:996] (2/4) Epoch 5, batch 4500, loss[loss=0.2828, simple_loss=0.3816, pruned_loss=0.09199, over 21255.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3358, pruned_loss=0.09576, over 4258094.20 frames. ], batch size: 548, lr: 6.49e-03, grad_scale: 32.0 2023-06-20 17:01:46,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=758874.0, ans=0.125 2023-06-20 17:02:14,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=758994.0, ans=0.2 2023-06-20 17:02:36,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=759054.0, ans=0.125 2023-06-20 17:03:10,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=759114.0, ans=0.1 2023-06-20 17:03:17,249 INFO [train.py:996] (2/4) Epoch 5, batch 4550, loss[loss=0.2603, simple_loss=0.3241, pruned_loss=0.09829, over 21316.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3411, pruned_loss=0.09687, over 4266307.28 frames. ], batch size: 551, lr: 6.49e-03, grad_scale: 32.0 2023-06-20 17:04:34,185 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.512e+02 3.034e+02 3.831e+02 5.015e+02 1.154e+03, threshold=7.663e+02, percent-clipped=6.0 2023-06-20 17:05:00,169 INFO [train.py:996] (2/4) Epoch 5, batch 4600, loss[loss=0.2473, simple_loss=0.3116, pruned_loss=0.09147, over 21290.00 frames. ], tot_loss[loss=0.2699, simple_loss=0.3425, pruned_loss=0.09859, over 4273202.48 frames. ], batch size: 143, lr: 6.49e-03, grad_scale: 16.0 2023-06-20 17:05:42,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=759594.0, ans=0.125 2023-06-20 17:05:44,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=759594.0, ans=0.0 2023-06-20 17:05:48,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=759594.0, ans=0.0 2023-06-20 17:06:37,015 INFO [train.py:996] (2/4) Epoch 5, batch 4650, loss[loss=0.207, simple_loss=0.2734, pruned_loss=0.07029, over 21273.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3351, pruned_loss=0.09599, over 4281690.48 frames. ], batch size: 159, lr: 6.48e-03, grad_scale: 16.0 2023-06-20 17:07:00,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=759834.0, ans=0.125 2023-06-20 17:07:29,753 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.52 vs. limit=15.0 2023-06-20 17:07:57,689 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.721e+02 3.118e+02 3.617e+02 7.093e+02, threshold=6.237e+02, percent-clipped=0.0 2023-06-20 17:08:14,599 INFO [train.py:996] (2/4) Epoch 5, batch 4700, loss[loss=0.2076, simple_loss=0.2705, pruned_loss=0.0724, over 21525.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3259, pruned_loss=0.09391, over 4280340.94 frames. ], batch size: 230, lr: 6.48e-03, grad_scale: 16.0 2023-06-20 17:08:19,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=760074.0, ans=0.1 2023-06-20 17:08:34,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=760134.0, ans=0.125 2023-06-20 17:08:58,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=760194.0, ans=0.0 2023-06-20 17:09:28,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=760254.0, ans=0.125 2023-06-20 17:09:29,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=760254.0, ans=0.125 2023-06-20 17:09:56,849 INFO [train.py:996] (2/4) Epoch 5, batch 4750, loss[loss=0.2712, simple_loss=0.3287, pruned_loss=0.1069, over 21687.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3223, pruned_loss=0.09428, over 4279596.90 frames. ], batch size: 389, lr: 6.48e-03, grad_scale: 16.0 2023-06-20 17:10:35,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=760494.0, ans=0.125 2023-06-20 17:11:18,310 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.439e+02 2.865e+02 3.322e+02 3.733e+02 5.818e+02, threshold=6.645e+02, percent-clipped=0.0 2023-06-20 17:11:20,800 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=15.0 2023-06-20 17:11:34,510 INFO [train.py:996] (2/4) Epoch 5, batch 4800, loss[loss=0.2618, simple_loss=0.3494, pruned_loss=0.08712, over 21786.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3226, pruned_loss=0.09371, over 4287864.32 frames. ], batch size: 282, lr: 6.48e-03, grad_scale: 32.0 2023-06-20 17:12:24,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=760794.0, ans=0.125 2023-06-20 17:12:32,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=760854.0, ans=0.125 2023-06-20 17:12:34,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=760854.0, ans=0.0 2023-06-20 17:13:06,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=760914.0, ans=0.0 2023-06-20 17:13:15,583 INFO [train.py:996] (2/4) Epoch 5, batch 4850, loss[loss=0.2369, simple_loss=0.3058, pruned_loss=0.08396, over 21284.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3221, pruned_loss=0.09355, over 4289601.07 frames. ], batch size: 159, lr: 6.48e-03, grad_scale: 32.0 2023-06-20 17:13:59,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=761094.0, ans=0.1 2023-06-20 17:14:04,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=761094.0, ans=0.2 2023-06-20 17:14:07,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=761094.0, ans=0.0 2023-06-20 17:14:41,977 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.918e+02 2.733e+02 3.099e+02 3.561e+02 5.577e+02, threshold=6.198e+02, percent-clipped=0.0 2023-06-20 17:14:58,675 INFO [train.py:996] (2/4) Epoch 5, batch 4900, loss[loss=0.3232, simple_loss=0.3898, pruned_loss=0.1283, over 21481.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3233, pruned_loss=0.0937, over 4291259.88 frames. ], batch size: 471, lr: 6.48e-03, grad_scale: 32.0 2023-06-20 17:15:03,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=761274.0, ans=0.125 2023-06-20 17:15:11,124 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.73 vs. limit=10.0 2023-06-20 17:16:35,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=761514.0, ans=0.125 2023-06-20 17:16:41,623 INFO [train.py:996] (2/4) Epoch 5, batch 4950, loss[loss=0.2183, simple_loss=0.3072, pruned_loss=0.0647, over 21724.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3261, pruned_loss=0.09155, over 4276948.05 frames. ], batch size: 351, lr: 6.48e-03, grad_scale: 16.0 2023-06-20 17:16:43,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=761574.0, ans=0.5 2023-06-20 17:17:14,685 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=30.50 vs. limit=22.5 2023-06-20 17:17:41,534 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2023-06-20 17:17:55,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=761754.0, ans=0.0 2023-06-20 17:18:08,221 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.944e+02 2.799e+02 3.225e+02 3.689e+02 6.231e+02, threshold=6.450e+02, percent-clipped=1.0 2023-06-20 17:18:10,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=761814.0, ans=0.125 2023-06-20 17:18:22,800 INFO [train.py:996] (2/4) Epoch 5, batch 5000, loss[loss=0.3107, simple_loss=0.3585, pruned_loss=0.1314, over 21798.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3256, pruned_loss=0.08908, over 4278933.83 frames. ], batch size: 508, lr: 6.47e-03, grad_scale: 16.0 2023-06-20 17:19:30,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=762054.0, ans=0.025 2023-06-20 17:20:03,401 INFO [train.py:996] (2/4) Epoch 5, batch 5050, loss[loss=0.2635, simple_loss=0.3203, pruned_loss=0.1033, over 21591.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3268, pruned_loss=0.09152, over 4287022.16 frames. ], batch size: 212, lr: 6.47e-03, grad_scale: 16.0 2023-06-20 17:20:12,187 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.81 vs. limit=15.0 2023-06-20 17:20:55,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=762294.0, ans=0.0 2023-06-20 17:21:31,162 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.896e+02 3.588e+02 4.285e+02 7.263e+02, threshold=7.176e+02, percent-clipped=2.0 2023-06-20 17:21:44,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=762474.0, ans=0.0 2023-06-20 17:21:45,598 INFO [train.py:996] (2/4) Epoch 5, batch 5100, loss[loss=0.2412, simple_loss=0.3063, pruned_loss=0.08804, over 21579.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3269, pruned_loss=0.09218, over 4287849.31 frames. ], batch size: 212, lr: 6.47e-03, grad_scale: 16.0 2023-06-20 17:21:49,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=762474.0, ans=0.0 2023-06-20 17:22:05,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=762534.0, ans=0.125 2023-06-20 17:22:14,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=762534.0, ans=0.125 2023-06-20 17:22:37,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=762594.0, ans=0.0 2023-06-20 17:23:29,430 INFO [train.py:996] (2/4) Epoch 5, batch 5150, loss[loss=0.2827, simple_loss=0.3373, pruned_loss=0.114, over 21776.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3246, pruned_loss=0.09321, over 4290681.54 frames. ], batch size: 441, lr: 6.47e-03, grad_scale: 16.0 2023-06-20 17:24:19,264 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.89 vs. limit=15.0 2023-06-20 17:24:57,421 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.048e+02 2.960e+02 3.348e+02 3.858e+02 5.752e+02, threshold=6.696e+02, percent-clipped=0.0 2023-06-20 17:25:09,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=763014.0, ans=0.2 2023-06-20 17:25:13,096 INFO [train.py:996] (2/4) Epoch 5, batch 5200, loss[loss=0.2601, simple_loss=0.3238, pruned_loss=0.09821, over 21083.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3245, pruned_loss=0.09292, over 4287286.40 frames. ], batch size: 607, lr: 6.47e-03, grad_scale: 32.0 2023-06-20 17:25:14,385 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.48 vs. limit=15.0 2023-06-20 17:25:49,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=763134.0, ans=0.1 2023-06-20 17:26:23,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=763254.0, ans=0.2 2023-06-20 17:26:35,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=763254.0, ans=0.1 2023-06-20 17:26:50,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=763314.0, ans=0.1 2023-06-20 17:26:54,761 INFO [train.py:996] (2/4) Epoch 5, batch 5250, loss[loss=0.258, simple_loss=0.3324, pruned_loss=0.0918, over 21649.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3273, pruned_loss=0.09102, over 4285739.27 frames. ], batch size: 263, lr: 6.47e-03, grad_scale: 32.0 2023-06-20 17:26:55,734 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.34 vs. limit=12.0 2023-06-20 17:27:07,314 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-20 17:28:04,666 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:28:08,325 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-20 17:28:19,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=763614.0, ans=0.0 2023-06-20 17:28:21,662 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.864e+02 2.952e+02 3.364e+02 4.524e+02 6.907e+02, threshold=6.729e+02, percent-clipped=2.0 2023-06-20 17:28:23,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=763614.0, ans=0.0 2023-06-20 17:28:36,605 INFO [train.py:996] (2/4) Epoch 5, batch 5300, loss[loss=0.2379, simple_loss=0.3093, pruned_loss=0.08329, over 21522.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3271, pruned_loss=0.09258, over 4297592.40 frames. ], batch size: 548, lr: 6.47e-03, grad_scale: 32.0 2023-06-20 17:29:12,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=763734.0, ans=0.0 2023-06-20 17:29:24,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=763794.0, ans=0.0 2023-06-20 17:29:50,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=763854.0, ans=0.125 2023-06-20 17:30:08,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=763914.0, ans=0.0 2023-06-20 17:30:13,193 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:30:13,879 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.39 vs. limit=6.0 2023-06-20 17:30:22,056 INFO [train.py:996] (2/4) Epoch 5, batch 5350, loss[loss=0.2772, simple_loss=0.3382, pruned_loss=0.1082, over 21293.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3271, pruned_loss=0.09514, over 4302943.84 frames. ], batch size: 143, lr: 6.47e-03, grad_scale: 32.0 2023-06-20 17:30:34,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=763974.0, ans=0.07 2023-06-20 17:30:57,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=764034.0, ans=0.125 2023-06-20 17:30:59,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=764094.0, ans=15.0 2023-06-20 17:31:27,194 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.17 vs. limit=12.0 2023-06-20 17:31:43,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=764214.0, ans=0.0 2023-06-20 17:31:44,361 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 3.105e+02 3.554e+02 4.280e+02 7.043e+02, threshold=7.109e+02, percent-clipped=1.0 2023-06-20 17:31:59,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=764214.0, ans=0.1 2023-06-20 17:32:03,841 INFO [train.py:996] (2/4) Epoch 5, batch 5400, loss[loss=0.2525, simple_loss=0.3046, pruned_loss=0.1002, over 21458.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3272, pruned_loss=0.09573, over 4302378.26 frames. ], batch size: 211, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:32:11,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=764274.0, ans=0.125 2023-06-20 17:32:23,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=764274.0, ans=0.125 2023-06-20 17:32:40,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=764394.0, ans=0.125 2023-06-20 17:32:53,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=764394.0, ans=0.125 2023-06-20 17:33:21,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=764454.0, ans=0.125 2023-06-20 17:33:35,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=764514.0, ans=0.1 2023-06-20 17:33:44,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=764574.0, ans=0.0 2023-06-20 17:33:45,598 INFO [train.py:996] (2/4) Epoch 5, batch 5450, loss[loss=0.2766, simple_loss=0.3792, pruned_loss=0.087, over 21645.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3277, pruned_loss=0.09363, over 4305504.89 frames. ], batch size: 389, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:34:03,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=764574.0, ans=0.0 2023-06-20 17:34:03,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=764574.0, ans=0.0 2023-06-20 17:34:13,874 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.16 vs. limit=15.0 2023-06-20 17:34:14,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=764634.0, ans=0.0 2023-06-20 17:34:21,733 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.46 vs. limit=22.5 2023-06-20 17:34:48,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=764694.0, ans=0.125 2023-06-20 17:35:13,984 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.778e+02 2.553e+02 3.012e+02 3.713e+02 8.478e+02, threshold=6.025e+02, percent-clipped=4.0 2023-06-20 17:35:34,673 INFO [train.py:996] (2/4) Epoch 5, batch 5500, loss[loss=0.2145, simple_loss=0.3138, pruned_loss=0.05762, over 21799.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3317, pruned_loss=0.09029, over 4305223.05 frames. ], batch size: 282, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:36:25,251 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.93 vs. limit=15.0 2023-06-20 17:36:57,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=765114.0, ans=0.035 2023-06-20 17:37:17,270 INFO [train.py:996] (2/4) Epoch 5, batch 5550, loss[loss=0.2008, simple_loss=0.2894, pruned_loss=0.05606, over 21664.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3291, pruned_loss=0.08651, over 4303643.92 frames. ], batch size: 247, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:37:31,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=765174.0, ans=0.1 2023-06-20 17:37:52,367 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-20 17:38:48,570 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.848e+02 2.754e+02 3.445e+02 4.644e+02 7.344e+02, threshold=6.889e+02, percent-clipped=6.0 2023-06-20 17:39:13,785 INFO [train.py:996] (2/4) Epoch 5, batch 5600, loss[loss=0.2439, simple_loss=0.3196, pruned_loss=0.08414, over 21479.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3275, pruned_loss=0.08421, over 4292516.62 frames. ], batch size: 548, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:39:50,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=765594.0, ans=0.125 2023-06-20 17:40:22,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=765654.0, ans=0.2 2023-06-20 17:40:25,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=765654.0, ans=0.0 2023-06-20 17:40:55,626 INFO [train.py:996] (2/4) Epoch 5, batch 5650, loss[loss=0.2379, simple_loss=0.3081, pruned_loss=0.0838, over 21932.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3337, pruned_loss=0.08774, over 4285104.27 frames. ], batch size: 316, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:40:59,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=765774.0, ans=0.0 2023-06-20 17:41:09,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=765774.0, ans=0.125 2023-06-20 17:41:31,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=765894.0, ans=0.0 2023-06-20 17:41:33,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=765894.0, ans=0.125 2023-06-20 17:42:02,854 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=12.0 2023-06-20 17:42:16,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=766014.0, ans=0.2 2023-06-20 17:42:17,317 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.50 vs. limit=15.0 2023-06-20 17:42:17,861 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.160e+02 3.200e+02 3.767e+02 5.001e+02 8.912e+02, threshold=7.534e+02, percent-clipped=5.0 2023-06-20 17:42:31,397 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=15.0 2023-06-20 17:42:38,879 INFO [train.py:996] (2/4) Epoch 5, batch 5700, loss[loss=0.2333, simple_loss=0.3007, pruned_loss=0.083, over 21219.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3306, pruned_loss=0.08878, over 4287842.41 frames. ], batch size: 608, lr: 6.46e-03, grad_scale: 32.0 2023-06-20 17:42:42,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=766074.0, ans=0.0 2023-06-20 17:43:04,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=766134.0, ans=0.0 2023-06-20 17:44:28,611 INFO [train.py:996] (2/4) Epoch 5, batch 5750, loss[loss=0.3735, simple_loss=0.4684, pruned_loss=0.1393, over 21185.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3284, pruned_loss=0.08625, over 4291468.84 frames. ], batch size: 548, lr: 6.46e-03, grad_scale: 16.0 2023-06-20 17:44:54,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=766434.0, ans=0.1 2023-06-20 17:45:21,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=766494.0, ans=0.0 2023-06-20 17:45:23,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=766494.0, ans=0.125 2023-06-20 17:45:47,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=766554.0, ans=0.125 2023-06-20 17:45:53,256 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.737e+02 3.307e+02 4.353e+02 7.537e+02, threshold=6.613e+02, percent-clipped=1.0 2023-06-20 17:46:07,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=766614.0, ans=0.125 2023-06-20 17:46:11,490 INFO [train.py:996] (2/4) Epoch 5, batch 5800, loss[loss=0.2699, simple_loss=0.3679, pruned_loss=0.08596, over 21642.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3258, pruned_loss=0.08407, over 4282243.00 frames. ], batch size: 389, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:46:16,278 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.86 vs. limit=15.0 2023-06-20 17:47:24,086 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=15.0 2023-06-20 17:47:53,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=766974.0, ans=0.09899494936611666 2023-06-20 17:47:54,123 INFO [train.py:996] (2/4) Epoch 5, batch 5850, loss[loss=0.187, simple_loss=0.2921, pruned_loss=0.04092, over 21768.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3218, pruned_loss=0.07885, over 4282207.84 frames. ], batch size: 282, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:48:41,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=767094.0, ans=0.2 2023-06-20 17:48:50,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=767094.0, ans=0.125 2023-06-20 17:48:59,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=767154.0, ans=0.025 2023-06-20 17:49:21,580 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.541e+02 2.196e+02 2.438e+02 2.861e+02 4.189e+02, threshold=4.877e+02, percent-clipped=0.0 2023-06-20 17:49:29,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=767214.0, ans=0.05 2023-06-20 17:49:34,154 INFO [train.py:996] (2/4) Epoch 5, batch 5900, loss[loss=0.1253, simple_loss=0.1835, pruned_loss=0.03352, over 16622.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3151, pruned_loss=0.07392, over 4277482.27 frames. ], batch size: 60, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:49:36,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=767274.0, ans=0.125 2023-06-20 17:50:58,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=767514.0, ans=0.125 2023-06-20 17:51:09,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=767514.0, ans=0.0 2023-06-20 17:51:14,332 INFO [train.py:996] (2/4) Epoch 5, batch 5950, loss[loss=0.2319, simple_loss=0.2837, pruned_loss=0.09003, over 20241.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.315, pruned_loss=0.07814, over 4280164.83 frames. ], batch size: 703, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:51:56,504 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=15.0 2023-06-20 17:52:17,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=767754.0, ans=0.0 2023-06-20 17:52:42,897 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.189e+02 3.088e+02 3.712e+02 4.428e+02 7.411e+02, threshold=7.424e+02, percent-clipped=12.0 2023-06-20 17:53:01,228 INFO [train.py:996] (2/4) Epoch 5, batch 6000, loss[loss=0.2035, simple_loss=0.2647, pruned_loss=0.0711, over 21629.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3117, pruned_loss=0.08166, over 4277468.79 frames. ], batch size: 264, lr: 6.45e-03, grad_scale: 32.0 2023-06-20 17:53:01,229 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 17:53:19,518 INFO [train.py:1028] (2/4) Epoch 5, validation: loss=0.2687, simple_loss=0.3621, pruned_loss=0.08766, over 1796401.00 frames. 2023-06-20 17:53:19,519 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-20 17:53:34,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=767874.0, ans=0.1 2023-06-20 17:53:41,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=767934.0, ans=0.125 2023-06-20 17:53:53,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=767934.0, ans=0.125 2023-06-20 17:53:55,170 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=22.5 2023-06-20 17:54:52,075 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.88 vs. limit=10.0 2023-06-20 17:55:11,123 INFO [train.py:996] (2/4) Epoch 5, batch 6050, loss[loss=0.1969, simple_loss=0.2635, pruned_loss=0.06512, over 21478.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3064, pruned_loss=0.08287, over 4280257.09 frames. ], batch size: 212, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:55:14,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=768174.0, ans=0.0 2023-06-20 17:55:51,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=768294.0, ans=0.125 2023-06-20 17:55:54,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=768294.0, ans=0.2 2023-06-20 17:56:11,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=768354.0, ans=0.09899494936611666 2023-06-20 17:56:30,091 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.566e+02 3.006e+02 3.910e+02 6.691e+02, threshold=6.013e+02, percent-clipped=0.0 2023-06-20 17:56:46,432 INFO [train.py:996] (2/4) Epoch 5, batch 6100, loss[loss=0.2498, simple_loss=0.3178, pruned_loss=0.09093, over 21793.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3053, pruned_loss=0.0817, over 4264244.28 frames. ], batch size: 282, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:56:46,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=768474.0, ans=0.0 2023-06-20 17:56:48,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=768474.0, ans=0.1 2023-06-20 17:56:53,200 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 17:57:16,012 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=15.0 2023-06-20 17:57:34,017 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=22.5 2023-06-20 17:57:36,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=768594.0, ans=0.5 2023-06-20 17:58:04,154 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.69 vs. limit=10.0 2023-06-20 17:58:20,841 INFO [train.py:996] (2/4) Epoch 5, batch 6150, loss[loss=0.2223, simple_loss=0.3012, pruned_loss=0.07165, over 21730.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3108, pruned_loss=0.08567, over 4266504.62 frames. ], batch size: 282, lr: 6.45e-03, grad_scale: 16.0 2023-06-20 17:58:38,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=768774.0, ans=0.1 2023-06-20 17:59:19,600 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-20 17:59:51,303 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.097e+02 2.776e+02 3.188e+02 3.842e+02 5.972e+02, threshold=6.377e+02, percent-clipped=0.0 2023-06-20 17:59:53,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=769014.0, ans=0.0 2023-06-20 17:59:55,501 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:00:08,579 INFO [train.py:996] (2/4) Epoch 5, batch 6200, loss[loss=0.258, simple_loss=0.3297, pruned_loss=0.09321, over 21505.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3137, pruned_loss=0.08624, over 4268152.38 frames. ], batch size: 441, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:00:12,942 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.68 vs. limit=15.0 2023-06-20 18:00:19,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=769074.0, ans=22.5 2023-06-20 18:01:29,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=769314.0, ans=0.0 2023-06-20 18:01:30,408 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.31 vs. limit=15.0 2023-06-20 18:01:47,620 INFO [train.py:996] (2/4) Epoch 5, batch 6250, loss[loss=0.2535, simple_loss=0.3403, pruned_loss=0.08331, over 21392.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3192, pruned_loss=0.08578, over 4276801.88 frames. ], batch size: 194, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:03:11,655 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.664e+02 3.120e+02 3.845e+02 7.013e+02, threshold=6.240e+02, percent-clipped=3.0 2023-06-20 18:03:25,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=769614.0, ans=0.2 2023-06-20 18:03:28,186 INFO [train.py:996] (2/4) Epoch 5, batch 6300, loss[loss=0.2689, simple_loss=0.3289, pruned_loss=0.1045, over 21902.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3222, pruned_loss=0.08435, over 4282061.12 frames. ], batch size: 107, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:05:06,365 INFO [train.py:996] (2/4) Epoch 5, batch 6350, loss[loss=0.2879, simple_loss=0.3388, pruned_loss=0.1185, over 21581.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3262, pruned_loss=0.0886, over 4284099.64 frames. ], batch size: 548, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:05:23,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=769974.0, ans=0.125 2023-06-20 18:05:30,534 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.38 vs. limit=15.0 2023-06-20 18:06:34,765 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.243e+02 2.998e+02 3.510e+02 4.011e+02 7.678e+02, threshold=7.020e+02, percent-clipped=3.0 2023-06-20 18:06:51,049 INFO [train.py:996] (2/4) Epoch 5, batch 6400, loss[loss=0.2909, simple_loss=0.356, pruned_loss=0.113, over 21352.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3344, pruned_loss=0.09367, over 4280791.67 frames. ], batch size: 131, lr: 6.44e-03, grad_scale: 32.0 2023-06-20 18:07:15,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=770334.0, ans=0.1 2023-06-20 18:07:33,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=770394.0, ans=0.125 2023-06-20 18:07:40,484 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.13 vs. limit=15.0 2023-06-20 18:07:41,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=770394.0, ans=0.07 2023-06-20 18:07:54,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=770454.0, ans=0.05 2023-06-20 18:08:02,221 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-20 18:08:11,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=770514.0, ans=0.125 2023-06-20 18:08:17,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=770514.0, ans=0.0 2023-06-20 18:08:33,242 INFO [train.py:996] (2/4) Epoch 5, batch 6450, loss[loss=0.2119, simple_loss=0.2983, pruned_loss=0.06275, over 21602.00 frames. ], tot_loss[loss=0.2622, simple_loss=0.3371, pruned_loss=0.09366, over 4281354.89 frames. ], batch size: 230, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:08:58,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=770634.0, ans=0.1 2023-06-20 18:09:05,034 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.43 vs. limit=22.5 2023-06-20 18:09:06,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=770634.0, ans=0.1 2023-06-20 18:09:54,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=770814.0, ans=0.0 2023-06-20 18:10:05,743 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 2.819e+02 3.379e+02 4.009e+02 7.496e+02, threshold=6.759e+02, percent-clipped=3.0 2023-06-20 18:10:13,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=770814.0, ans=0.125 2023-06-20 18:10:15,979 INFO [train.py:996] (2/4) Epoch 5, batch 6500, loss[loss=0.2308, simple_loss=0.3219, pruned_loss=0.06987, over 21577.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3304, pruned_loss=0.09255, over 4277150.58 frames. ], batch size: 389, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:10:40,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=770934.0, ans=0.0 2023-06-20 18:10:41,648 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.12 vs. limit=10.0 2023-06-20 18:10:43,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=770934.0, ans=0.1 2023-06-20 18:10:47,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=770934.0, ans=0.125 2023-06-20 18:11:09,353 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.01 vs. limit=15.0 2023-06-20 18:11:11,044 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=22.5 2023-06-20 18:11:19,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=771054.0, ans=0.125 2023-06-20 18:11:46,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=771114.0, ans=0.0 2023-06-20 18:12:02,389 INFO [train.py:996] (2/4) Epoch 5, batch 6550, loss[loss=0.238, simple_loss=0.3027, pruned_loss=0.08663, over 21249.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3293, pruned_loss=0.0922, over 4279657.21 frames. ], batch size: 176, lr: 6.44e-03, grad_scale: 16.0 2023-06-20 18:12:08,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=771174.0, ans=0.125 2023-06-20 18:12:16,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=771174.0, ans=0.125 2023-06-20 18:12:26,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=771234.0, ans=0.125 2023-06-20 18:12:58,098 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.96 vs. limit=15.0 2023-06-20 18:13:25,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=771414.0, ans=0.125 2023-06-20 18:13:31,237 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.995e+02 2.772e+02 3.430e+02 4.140e+02 7.576e+02, threshold=6.860e+02, percent-clipped=2.0 2023-06-20 18:13:49,233 INFO [train.py:996] (2/4) Epoch 5, batch 6600, loss[loss=0.2709, simple_loss=0.3789, pruned_loss=0.08147, over 19905.00 frames. ], tot_loss[loss=0.254, simple_loss=0.324, pruned_loss=0.092, over 4277261.41 frames. ], batch size: 703, lr: 6.43e-03, grad_scale: 8.0 2023-06-20 18:15:31,884 INFO [train.py:996] (2/4) Epoch 5, batch 6650, loss[loss=0.2066, simple_loss=0.2582, pruned_loss=0.07744, over 21176.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3156, pruned_loss=0.08909, over 4267534.87 frames. ], batch size: 548, lr: 6.43e-03, grad_scale: 8.0 2023-06-20 18:16:10,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=771894.0, ans=0.0 2023-06-20 18:16:12,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=771894.0, ans=0.04949747468305833 2023-06-20 18:17:04,491 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.116e+02 2.627e+02 3.141e+02 4.437e+02 8.167e+02, threshold=6.282e+02, percent-clipped=6.0 2023-06-20 18:17:12,619 INFO [train.py:996] (2/4) Epoch 5, batch 6700, loss[loss=0.2121, simple_loss=0.2763, pruned_loss=0.07394, over 21499.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3095, pruned_loss=0.08737, over 4268469.70 frames. ], batch size: 195, lr: 6.43e-03, grad_scale: 8.0 2023-06-20 18:17:18,165 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=22.5 2023-06-20 18:17:42,346 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=15.0 2023-06-20 18:17:43,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=772134.0, ans=0.1 2023-06-20 18:18:15,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=772254.0, ans=0.0 2023-06-20 18:18:52,293 INFO [train.py:996] (2/4) Epoch 5, batch 6750, loss[loss=0.2669, simple_loss=0.3174, pruned_loss=0.1082, over 21650.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3086, pruned_loss=0.08835, over 4262920.99 frames. ], batch size: 247, lr: 6.43e-03, grad_scale: 8.0 2023-06-20 18:19:04,580 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2023-06-20 18:19:12,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=772434.0, ans=0.1 2023-06-20 18:19:48,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=772554.0, ans=0.125 2023-06-20 18:20:15,320 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.402e+02 2.981e+02 3.524e+02 4.393e+02 7.808e+02, threshold=7.048e+02, percent-clipped=4.0 2023-06-20 18:20:33,915 INFO [train.py:996] (2/4) Epoch 5, batch 6800, loss[loss=0.277, simple_loss=0.3333, pruned_loss=0.1104, over 21410.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3118, pruned_loss=0.09146, over 4274156.68 frames. ], batch size: 548, lr: 6.43e-03, grad_scale: 16.0 2023-06-20 18:20:38,228 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.39 vs. limit=10.0 2023-06-20 18:20:55,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=772734.0, ans=0.0 2023-06-20 18:22:04,526 INFO [train.py:996] (2/4) Epoch 5, batch 6850, loss[loss=0.2339, simple_loss=0.2926, pruned_loss=0.08762, over 21279.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3082, pruned_loss=0.0918, over 4274603.22 frames. ], batch size: 144, lr: 6.43e-03, grad_scale: 16.0 2023-06-20 18:22:21,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=772974.0, ans=0.1 2023-06-20 18:22:58,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=773094.0, ans=0.125 2023-06-20 18:23:27,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=773214.0, ans=0.0 2023-06-20 18:23:32,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=773214.0, ans=0.0 2023-06-20 18:23:32,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=773214.0, ans=0.1 2023-06-20 18:23:35,262 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.254e+02 2.843e+02 3.245e+02 3.952e+02 6.473e+02, threshold=6.489e+02, percent-clipped=0.0 2023-06-20 18:23:35,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=773214.0, ans=0.1 2023-06-20 18:23:52,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=773274.0, ans=0.0 2023-06-20 18:23:53,367 INFO [train.py:996] (2/4) Epoch 5, batch 6900, loss[loss=0.3252, simple_loss=0.4447, pruned_loss=0.1029, over 19810.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3116, pruned_loss=0.09156, over 4277243.67 frames. ], batch size: 702, lr: 6.43e-03, grad_scale: 16.0 2023-06-20 18:24:09,430 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.78 vs. limit=15.0 2023-06-20 18:24:30,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=773394.0, ans=0.2 2023-06-20 18:24:57,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=773454.0, ans=0.125 2023-06-20 18:25:31,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=773514.0, ans=0.125 2023-06-20 18:25:37,592 INFO [train.py:996] (2/4) Epoch 5, batch 6950, loss[loss=0.2761, simple_loss=0.3468, pruned_loss=0.1027, over 21867.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3116, pruned_loss=0.08746, over 4285042.89 frames. ], batch size: 371, lr: 6.43e-03, grad_scale: 16.0 2023-06-20 18:25:38,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=773574.0, ans=0.0 2023-06-20 18:26:24,769 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.97 vs. limit=22.5 2023-06-20 18:26:30,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=773754.0, ans=0.0 2023-06-20 18:26:34,516 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=13.70 vs. limit=15.0 2023-06-20 18:27:11,063 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.901e+02 2.952e+02 3.293e+02 4.286e+02 8.056e+02, threshold=6.585e+02, percent-clipped=5.0 2023-06-20 18:27:16,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=773814.0, ans=0.1 2023-06-20 18:27:19,081 INFO [train.py:996] (2/4) Epoch 5, batch 7000, loss[loss=0.2661, simple_loss=0.3106, pruned_loss=0.1108, over 21518.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3154, pruned_loss=0.09062, over 4283908.97 frames. ], batch size: 441, lr: 6.42e-03, grad_scale: 16.0 2023-06-20 18:27:32,563 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-20 18:27:38,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=773874.0, ans=0.125 2023-06-20 18:27:52,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=773934.0, ans=0.025 2023-06-20 18:28:08,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=773994.0, ans=0.0 2023-06-20 18:28:50,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=774114.0, ans=0.2 2023-06-20 18:29:08,823 INFO [train.py:996] (2/4) Epoch 5, batch 7050, loss[loss=0.2366, simple_loss=0.3063, pruned_loss=0.08349, over 21482.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3125, pruned_loss=0.088, over 4280370.56 frames. ], batch size: 389, lr: 6.42e-03, grad_scale: 16.0 2023-06-20 18:29:19,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=774174.0, ans=0.125 2023-06-20 18:29:25,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=774234.0, ans=0.125 2023-06-20 18:29:37,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=774234.0, ans=0.2 2023-06-20 18:30:00,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=774294.0, ans=0.2 2023-06-20 18:30:28,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=774354.0, ans=0.0 2023-06-20 18:30:31,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=774414.0, ans=0.0 2023-06-20 18:30:41,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=774414.0, ans=0.1 2023-06-20 18:30:44,368 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.057e+02 2.864e+02 3.382e+02 4.312e+02 8.915e+02, threshold=6.764e+02, percent-clipped=3.0 2023-06-20 18:30:52,682 INFO [train.py:996] (2/4) Epoch 5, batch 7100, loss[loss=0.2887, simple_loss=0.3547, pruned_loss=0.1113, over 21291.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3163, pruned_loss=0.0893, over 4280412.34 frames. ], batch size: 143, lr: 6.42e-03, grad_scale: 16.0 2023-06-20 18:31:01,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=774474.0, ans=0.125 2023-06-20 18:31:28,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=774594.0, ans=0.125 2023-06-20 18:32:17,626 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2023-06-20 18:32:31,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=774714.0, ans=0.2 2023-06-20 18:32:34,562 INFO [train.py:996] (2/4) Epoch 5, batch 7150, loss[loss=0.2538, simple_loss=0.3288, pruned_loss=0.08943, over 21748.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3129, pruned_loss=0.08678, over 4268457.43 frames. ], batch size: 298, lr: 6.42e-03, grad_scale: 16.0 2023-06-20 18:32:46,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=774774.0, ans=0.125 2023-06-20 18:33:04,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=774834.0, ans=0.0 2023-06-20 18:33:06,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=774834.0, ans=0.125 2023-06-20 18:34:04,736 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=15.0 2023-06-20 18:34:08,289 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.943e+02 2.836e+02 3.355e+02 3.927e+02 6.037e+02, threshold=6.711e+02, percent-clipped=0.0 2023-06-20 18:34:16,164 INFO [train.py:996] (2/4) Epoch 5, batch 7200, loss[loss=0.2465, simple_loss=0.3456, pruned_loss=0.0737, over 20742.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.317, pruned_loss=0.09002, over 4268933.74 frames. ], batch size: 607, lr: 6.42e-03, grad_scale: 32.0 2023-06-20 18:34:23,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=775074.0, ans=0.0 2023-06-20 18:35:18,012 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.359e-02 2023-06-20 18:35:41,619 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:35:49,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=775314.0, ans=0.125 2023-06-20 18:35:58,274 INFO [train.py:996] (2/4) Epoch 5, batch 7250, loss[loss=0.2278, simple_loss=0.2859, pruned_loss=0.0849, over 21890.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.313, pruned_loss=0.09031, over 4271896.13 frames. ], batch size: 373, lr: 6.42e-03, grad_scale: 32.0 2023-06-20 18:36:00,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=775374.0, ans=0.125 2023-06-20 18:36:22,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=775434.0, ans=0.0 2023-06-20 18:36:26,611 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:37:01,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=775554.0, ans=0.125 2023-06-20 18:37:13,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=775554.0, ans=0.0 2023-06-20 18:37:14,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=775554.0, ans=0.125 2023-06-20 18:37:27,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=775614.0, ans=0.125 2023-06-20 18:37:29,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=775614.0, ans=0.2 2023-06-20 18:37:32,035 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.342e+02 2.761e+02 3.389e+02 4.055e+02 6.932e+02, threshold=6.778e+02, percent-clipped=1.0 2023-06-20 18:37:40,436 INFO [train.py:996] (2/4) Epoch 5, batch 7300, loss[loss=0.233, simple_loss=0.2924, pruned_loss=0.08685, over 21900.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3077, pruned_loss=0.089, over 4259991.51 frames. ], batch size: 125, lr: 6.42e-03, grad_scale: 32.0 2023-06-20 18:38:01,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=775734.0, ans=0.125 2023-06-20 18:38:21,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=775734.0, ans=0.125 2023-06-20 18:38:56,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=775854.0, ans=0.0 2023-06-20 18:39:13,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=775914.0, ans=0.125 2023-06-20 18:39:17,798 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=14.04 vs. limit=22.5 2023-06-20 18:39:25,186 INFO [train.py:996] (2/4) Epoch 5, batch 7350, loss[loss=0.2885, simple_loss=0.3535, pruned_loss=0.1118, over 21485.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3055, pruned_loss=0.09007, over 4262647.78 frames. ], batch size: 131, lr: 6.42e-03, grad_scale: 32.0 2023-06-20 18:39:42,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=775974.0, ans=0.2 2023-06-20 18:40:20,738 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=22.5 2023-06-20 18:40:52,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=776214.0, ans=0.125 2023-06-20 18:41:01,636 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 3.040e+02 3.793e+02 4.434e+02 6.655e+02, threshold=7.586e+02, percent-clipped=0.0 2023-06-20 18:41:06,091 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.09 vs. limit=22.5 2023-06-20 18:41:09,301 INFO [train.py:996] (2/4) Epoch 5, batch 7400, loss[loss=0.2806, simple_loss=0.3478, pruned_loss=0.1067, over 21764.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3111, pruned_loss=0.09238, over 4261640.44 frames. ], batch size: 441, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:41:17,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=776274.0, ans=0.125 2023-06-20 18:41:31,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=776274.0, ans=0.2 2023-06-20 18:42:08,588 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.06 vs. limit=15.0 2023-06-20 18:42:16,954 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=22.5 2023-06-20 18:42:22,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=776454.0, ans=0.0 2023-06-20 18:42:37,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=776514.0, ans=0.2 2023-06-20 18:42:58,419 INFO [train.py:996] (2/4) Epoch 5, batch 7450, loss[loss=0.2508, simple_loss=0.345, pruned_loss=0.07831, over 21557.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3101, pruned_loss=0.09071, over 4263153.24 frames. ], batch size: 441, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:43:27,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=776634.0, ans=0.125 2023-06-20 18:44:03,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=776754.0, ans=0.04949747468305833 2023-06-20 18:44:07,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=776754.0, ans=0.125 2023-06-20 18:44:34,835 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.318e+02 2.956e+02 3.297e+02 4.311e+02 7.109e+02, threshold=6.593e+02, percent-clipped=0.0 2023-06-20 18:44:43,198 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.52 vs. limit=22.5 2023-06-20 18:44:48,787 INFO [train.py:996] (2/4) Epoch 5, batch 7500, loss[loss=0.2507, simple_loss=0.3295, pruned_loss=0.08595, over 21235.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3134, pruned_loss=0.09145, over 4269528.56 frames. ], batch size: 159, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:44:57,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=776874.0, ans=0.1 2023-06-20 18:45:04,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=776874.0, ans=0.125 2023-06-20 18:45:08,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=776874.0, ans=0.2 2023-06-20 18:45:58,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=777054.0, ans=0.125 2023-06-20 18:46:05,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=777054.0, ans=0.1 2023-06-20 18:46:27,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=777114.0, ans=0.125 2023-06-20 18:46:31,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=777114.0, ans=0.0 2023-06-20 18:46:33,794 INFO [train.py:996] (2/4) Epoch 5, batch 7550, loss[loss=0.2038, simple_loss=0.2965, pruned_loss=0.05559, over 21651.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3222, pruned_loss=0.09089, over 4260105.26 frames. ], batch size: 263, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:48:01,295 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.164e+02 2.804e+02 3.228e+02 4.067e+02 8.299e+02, threshold=6.455e+02, percent-clipped=1.0 2023-06-20 18:48:11,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=777414.0, ans=0.0 2023-06-20 18:48:14,636 INFO [train.py:996] (2/4) Epoch 5, batch 7600, loss[loss=0.2585, simple_loss=0.3246, pruned_loss=0.09624, over 21895.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3212, pruned_loss=0.08922, over 4270214.43 frames. ], batch size: 332, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:48:17,083 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 18:48:48,493 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-20 18:50:01,086 INFO [train.py:996] (2/4) Epoch 5, batch 7650, loss[loss=0.2698, simple_loss=0.3326, pruned_loss=0.1035, over 21917.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3208, pruned_loss=0.09203, over 4280160.77 frames. ], batch size: 107, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:50:17,563 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=15.0 2023-06-20 18:50:30,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=777834.0, ans=0.0 2023-06-20 18:51:12,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=777954.0, ans=0.125 2023-06-20 18:51:27,832 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=22.5 2023-06-20 18:51:36,502 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.282e+02 3.076e+02 3.625e+02 4.419e+02 8.627e+02, threshold=7.249e+02, percent-clipped=2.0 2023-06-20 18:51:44,649 INFO [train.py:996] (2/4) Epoch 5, batch 7700, loss[loss=0.2852, simple_loss=0.3532, pruned_loss=0.1086, over 21431.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3244, pruned_loss=0.09538, over 4285364.72 frames. ], batch size: 131, lr: 6.41e-03, grad_scale: 32.0 2023-06-20 18:52:12,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=778134.0, ans=0.025 2023-06-20 18:53:16,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=778314.0, ans=10.0 2023-06-20 18:53:35,141 INFO [train.py:996] (2/4) Epoch 5, batch 7750, loss[loss=0.2676, simple_loss=0.3477, pruned_loss=0.09378, over 21289.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3317, pruned_loss=0.09552, over 4285024.00 frames. ], batch size: 176, lr: 6.41e-03, grad_scale: 16.0 2023-06-20 18:53:37,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=778374.0, ans=0.07 2023-06-20 18:53:54,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=778434.0, ans=0.1 2023-06-20 18:54:16,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=778494.0, ans=0.125 2023-06-20 18:54:35,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=778554.0, ans=0.125 2023-06-20 18:55:12,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=778614.0, ans=0.0 2023-06-20 18:55:13,723 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 2.997e+02 3.414e+02 4.185e+02 6.372e+02, threshold=6.827e+02, percent-clipped=0.0 2023-06-20 18:55:19,860 INFO [train.py:996] (2/4) Epoch 5, batch 7800, loss[loss=0.1957, simple_loss=0.2711, pruned_loss=0.06018, over 21408.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.3323, pruned_loss=0.09506, over 4277685.92 frames. ], batch size: 194, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 18:55:23,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=778674.0, ans=0.125 2023-06-20 18:56:57,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=778914.0, ans=0.125 2023-06-20 18:57:03,282 INFO [train.py:996] (2/4) Epoch 5, batch 7850, loss[loss=0.2458, simple_loss=0.295, pruned_loss=0.09832, over 21345.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3269, pruned_loss=0.09499, over 4266927.18 frames. ], batch size: 473, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 18:57:13,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=778974.0, ans=0.0 2023-06-20 18:58:41,653 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 2.807e+02 3.191e+02 4.000e+02 6.084e+02, threshold=6.382e+02, percent-clipped=0.0 2023-06-20 18:58:48,747 INFO [train.py:996] (2/4) Epoch 5, batch 7900, loss[loss=0.2449, simple_loss=0.2975, pruned_loss=0.09614, over 21140.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3225, pruned_loss=0.09424, over 4258065.60 frames. ], batch size: 143, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 18:59:04,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=779334.0, ans=0.125 2023-06-20 18:59:17,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=779334.0, ans=0.125 2023-06-20 18:59:40,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=779394.0, ans=0.125 2023-06-20 19:00:04,912 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=15.0 2023-06-20 19:00:13,616 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.55 vs. limit=22.5 2023-06-20 19:00:34,084 INFO [train.py:996] (2/4) Epoch 5, batch 7950, loss[loss=0.2616, simple_loss=0.3658, pruned_loss=0.07866, over 21337.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3259, pruned_loss=0.09278, over 4257384.56 frames. ], batch size: 548, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 19:00:46,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=779574.0, ans=0.1 2023-06-20 19:00:46,940 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-06-20 19:01:25,909 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.29 vs. limit=15.0 2023-06-20 19:01:39,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=779754.0, ans=0.2 2023-06-20 19:02:04,340 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=15.0 2023-06-20 19:02:08,272 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 3.095e+02 3.551e+02 4.583e+02 8.567e+02, threshold=7.102e+02, percent-clipped=4.0 2023-06-20 19:02:14,723 INFO [train.py:996] (2/4) Epoch 5, batch 8000, loss[loss=0.2542, simple_loss=0.3102, pruned_loss=0.09915, over 16810.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3267, pruned_loss=0.09428, over 4259453.70 frames. ], batch size: 61, lr: 6.40e-03, grad_scale: 32.0 2023-06-20 19:03:07,475 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=12.0 2023-06-20 19:03:09,284 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=22.5 2023-06-20 19:03:41,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=780114.0, ans=0.1 2023-06-20 19:03:51,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=780114.0, ans=0.1 2023-06-20 19:04:08,254 INFO [train.py:996] (2/4) Epoch 5, batch 8050, loss[loss=0.2574, simple_loss=0.3345, pruned_loss=0.0901, over 21863.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3302, pruned_loss=0.09463, over 4259710.81 frames. ], batch size: 317, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 19:04:17,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=780174.0, ans=0.0 2023-06-20 19:04:17,470 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.28 vs. limit=15.0 2023-06-20 19:04:39,987 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.68 vs. limit=15.0 2023-06-20 19:04:49,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=780234.0, ans=0.0 2023-06-20 19:04:52,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=780294.0, ans=0.0 2023-06-20 19:05:10,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=780354.0, ans=0.0 2023-06-20 19:05:46,956 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 3.429e+02 3.996e+02 5.275e+02 1.132e+03, threshold=7.992e+02, percent-clipped=3.0 2023-06-20 19:05:52,196 INFO [train.py:996] (2/4) Epoch 5, batch 8100, loss[loss=0.262, simple_loss=0.326, pruned_loss=0.09897, over 21721.00 frames. ], tot_loss[loss=0.261, simple_loss=0.3303, pruned_loss=0.09587, over 4261120.86 frames. ], batch size: 389, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 19:06:06,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=780474.0, ans=0.04949747468305833 2023-06-20 19:06:31,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=780534.0, ans=0.1 2023-06-20 19:07:49,539 INFO [train.py:996] (2/4) Epoch 5, batch 8150, loss[loss=0.2206, simple_loss=0.298, pruned_loss=0.07157, over 21460.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3386, pruned_loss=0.09698, over 4265316.22 frames. ], batch size: 212, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 19:08:49,705 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:08:51,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=780954.0, ans=0.125 2023-06-20 19:09:27,397 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.246e+02 3.029e+02 3.506e+02 4.177e+02 7.466e+02, threshold=7.011e+02, percent-clipped=0.0 2023-06-20 19:09:29,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=781014.0, ans=0.125 2023-06-20 19:09:32,501 INFO [train.py:996] (2/4) Epoch 5, batch 8200, loss[loss=0.2777, simple_loss=0.3122, pruned_loss=0.1216, over 21491.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3316, pruned_loss=0.09471, over 4251778.38 frames. ], batch size: 511, lr: 6.40e-03, grad_scale: 16.0 2023-06-20 19:09:42,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=781074.0, ans=0.0 2023-06-20 19:09:57,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=781134.0, ans=0.125 2023-06-20 19:10:29,295 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-06-20 19:10:41,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=781254.0, ans=0.125 2023-06-20 19:10:49,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=781254.0, ans=0.2 2023-06-20 19:11:15,888 INFO [train.py:996] (2/4) Epoch 5, batch 8250, loss[loss=0.2182, simple_loss=0.3084, pruned_loss=0.06399, over 21780.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3313, pruned_loss=0.0947, over 4263725.97 frames. ], batch size: 282, lr: 6.39e-03, grad_scale: 16.0 2023-06-20 19:12:10,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=781494.0, ans=10.0 2023-06-20 19:12:40,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=781554.0, ans=0.125 2023-06-20 19:12:42,516 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:12:56,439 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.026e+02 2.737e+02 3.159e+02 4.146e+02 7.904e+02, threshold=6.318e+02, percent-clipped=1.0 2023-06-20 19:12:59,868 INFO [train.py:996] (2/4) Epoch 5, batch 8300, loss[loss=0.2481, simple_loss=0.3273, pruned_loss=0.0844, over 21722.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3294, pruned_loss=0.09185, over 4271603.44 frames. ], batch size: 298, lr: 6.39e-03, grad_scale: 8.0 2023-06-20 19:13:03,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=781674.0, ans=0.125 2023-06-20 19:14:00,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=781854.0, ans=0.125 2023-06-20 19:14:23,113 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-06-20 19:14:23,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=781854.0, ans=0.1 2023-06-20 19:14:43,859 INFO [train.py:996] (2/4) Epoch 5, batch 8350, loss[loss=0.2394, simple_loss=0.3292, pruned_loss=0.07484, over 20764.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3263, pruned_loss=0.08919, over 4273548.33 frames. ], batch size: 607, lr: 6.39e-03, grad_scale: 8.0 2023-06-20 19:14:46,781 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-20 19:14:49,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=781974.0, ans=0.0 2023-06-20 19:14:51,977 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-20 19:14:59,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=782034.0, ans=0.0 2023-06-20 19:15:23,886 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=15.0 2023-06-20 19:15:26,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=782094.0, ans=0.0 2023-06-20 19:15:49,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=782154.0, ans=0.125 2023-06-20 19:16:10,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=782214.0, ans=0.2 2023-06-20 19:16:18,285 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.977e+02 2.671e+02 3.158e+02 4.298e+02 8.367e+02, threshold=6.316e+02, percent-clipped=9.0 2023-06-20 19:16:21,601 INFO [train.py:996] (2/4) Epoch 5, batch 8400, loss[loss=0.183, simple_loss=0.2608, pruned_loss=0.05256, over 21245.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3244, pruned_loss=0.0873, over 4275349.75 frames. ], batch size: 176, lr: 6.39e-03, grad_scale: 16.0 2023-06-20 19:16:35,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=782274.0, ans=0.1 2023-06-20 19:17:25,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=782454.0, ans=0.1 2023-06-20 19:17:40,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=782454.0, ans=0.0 2023-06-20 19:17:40,463 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:17:40,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=782454.0, ans=0.125 2023-06-20 19:17:52,279 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.87 vs. limit=15.0 2023-06-20 19:18:01,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=782514.0, ans=0.0 2023-06-20 19:18:06,504 INFO [train.py:996] (2/4) Epoch 5, batch 8450, loss[loss=0.2266, simple_loss=0.2979, pruned_loss=0.07762, over 21818.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3223, pruned_loss=0.08653, over 4284415.69 frames. ], batch size: 333, lr: 6.39e-03, grad_scale: 16.0 2023-06-20 19:18:10,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=782574.0, ans=0.125 2023-06-20 19:19:18,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=782754.0, ans=0.2 2023-06-20 19:19:28,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=782754.0, ans=0.0 2023-06-20 19:19:41,237 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.54 vs. limit=15.0 2023-06-20 19:19:46,762 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 2.787e+02 3.267e+02 4.084e+02 6.258e+02, threshold=6.534e+02, percent-clipped=0.0 2023-06-20 19:19:48,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=782874.0, ans=0.2 2023-06-20 19:19:49,896 INFO [train.py:996] (2/4) Epoch 5, batch 8500, loss[loss=0.2261, simple_loss=0.2933, pruned_loss=0.07949, over 21977.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3201, pruned_loss=0.08821, over 4266451.27 frames. ], batch size: 103, lr: 6.39e-03, grad_scale: 16.0 2023-06-20 19:19:50,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=782874.0, ans=0.2 2023-06-20 19:19:57,463 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=12.0 2023-06-20 19:20:00,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=782874.0, ans=0.125 2023-06-20 19:20:50,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=782994.0, ans=0.125 2023-06-20 19:20:54,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=783054.0, ans=0.04949747468305833 2023-06-20 19:21:18,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=783114.0, ans=0.125 2023-06-20 19:21:35,051 INFO [train.py:996] (2/4) Epoch 5, batch 8550, loss[loss=0.2411, simple_loss=0.3213, pruned_loss=0.08043, over 21418.00 frames. ], tot_loss[loss=0.252, simple_loss=0.323, pruned_loss=0.09052, over 4269097.88 frames. ], batch size: 194, lr: 6.39e-03, grad_scale: 16.0 2023-06-20 19:21:36,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=783174.0, ans=0.125 2023-06-20 19:23:16,134 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.348e+02 2.988e+02 3.421e+02 4.230e+02 6.048e+02, threshold=6.842e+02, percent-clipped=0.0 2023-06-20 19:23:19,480 INFO [train.py:996] (2/4) Epoch 5, batch 8600, loss[loss=0.2835, simple_loss=0.3604, pruned_loss=0.1033, over 21755.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3309, pruned_loss=0.09267, over 4272381.65 frames. ], batch size: 332, lr: 6.39e-03, grad_scale: 16.0 2023-06-20 19:23:27,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=783474.0, ans=0.125 2023-06-20 19:23:44,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=783474.0, ans=0.1 2023-06-20 19:23:50,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=783534.0, ans=15.0 2023-06-20 19:23:52,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=783534.0, ans=10.0 2023-06-20 19:23:54,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=783534.0, ans=0.125 2023-06-20 19:23:55,044 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.73 vs. limit=15.0 2023-06-20 19:24:21,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=783594.0, ans=0.2 2023-06-20 19:24:52,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=783714.0, ans=0.1 2023-06-20 19:24:52,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=783714.0, ans=0.0 2023-06-20 19:25:14,172 INFO [train.py:996] (2/4) Epoch 5, batch 8650, loss[loss=0.1948, simple_loss=0.2967, pruned_loss=0.04652, over 21772.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3364, pruned_loss=0.09247, over 4271605.37 frames. ], batch size: 351, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:25:31,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=783834.0, ans=0.125 2023-06-20 19:26:25,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=783954.0, ans=0.0 2023-06-20 19:26:48,627 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.009e+02 2.897e+02 3.428e+02 4.241e+02 7.600e+02, threshold=6.856e+02, percent-clipped=1.0 2023-06-20 19:26:51,762 INFO [train.py:996] (2/4) Epoch 5, batch 8700, loss[loss=0.2145, simple_loss=0.2776, pruned_loss=0.07565, over 21675.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3296, pruned_loss=0.08934, over 4264605.29 frames. ], batch size: 333, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:27:04,522 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=12.0 2023-06-20 19:27:05,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=784074.0, ans=0.0 2023-06-20 19:27:08,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=784134.0, ans=0.125 2023-06-20 19:28:12,254 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.87 vs. limit=6.0 2023-06-20 19:28:34,542 INFO [train.py:996] (2/4) Epoch 5, batch 8750, loss[loss=0.2507, simple_loss=0.3181, pruned_loss=0.09168, over 21923.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3249, pruned_loss=0.09008, over 4269530.20 frames. ], batch size: 316, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:29:06,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=784434.0, ans=0.0 2023-06-20 19:29:06,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=784434.0, ans=0.1 2023-06-20 19:30:14,573 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 3.095e+02 3.629e+02 5.257e+02 8.550e+02, threshold=7.257e+02, percent-clipped=6.0 2023-06-20 19:30:18,157 INFO [train.py:996] (2/4) Epoch 5, batch 8800, loss[loss=0.2681, simple_loss=0.3303, pruned_loss=0.1029, over 19996.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3332, pruned_loss=0.09387, over 4270358.16 frames. ], batch size: 702, lr: 6.38e-03, grad_scale: 32.0 2023-06-20 19:30:23,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=784674.0, ans=0.1 2023-06-20 19:30:49,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=784734.0, ans=0.2 2023-06-20 19:30:52,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=784734.0, ans=0.2 2023-06-20 19:31:14,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=784794.0, ans=0.125 2023-06-20 19:31:19,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=784854.0, ans=0.025 2023-06-20 19:31:27,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=784854.0, ans=0.125 2023-06-20 19:31:39,736 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=15.0 2023-06-20 19:31:55,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=784914.0, ans=0.125 2023-06-20 19:32:01,541 INFO [train.py:996] (2/4) Epoch 5, batch 8850, loss[loss=0.2605, simple_loss=0.3501, pruned_loss=0.0855, over 21582.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3404, pruned_loss=0.09531, over 4274032.66 frames. ], batch size: 389, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:32:02,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=784974.0, ans=0.09899494936611666 2023-06-20 19:33:25,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=785214.0, ans=0.0 2023-06-20 19:33:30,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=785214.0, ans=0.025 2023-06-20 19:33:45,540 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.269e+02 3.064e+02 3.643e+02 4.711e+02 6.430e+02, threshold=7.286e+02, percent-clipped=0.0 2023-06-20 19:33:46,825 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-20 19:33:47,244 INFO [train.py:996] (2/4) Epoch 5, batch 8900, loss[loss=0.2349, simple_loss=0.3039, pruned_loss=0.08299, over 21586.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3379, pruned_loss=0.0953, over 4275071.10 frames. ], batch size: 263, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:34:11,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=785334.0, ans=0.2 2023-06-20 19:34:54,390 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.95 vs. limit=15.0 2023-06-20 19:35:37,886 INFO [train.py:996] (2/4) Epoch 5, batch 8950, loss[loss=0.2539, simple_loss=0.3293, pruned_loss=0.08926, over 21552.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3345, pruned_loss=0.09355, over 4271473.12 frames. ], batch size: 389, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:35:48,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=785574.0, ans=0.125 2023-06-20 19:35:59,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=785634.0, ans=0.2 2023-06-20 19:36:08,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=785634.0, ans=0.125 2023-06-20 19:36:30,543 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=22.5 2023-06-20 19:37:18,313 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.114e+02 3.139e+02 3.697e+02 4.422e+02 7.989e+02, threshold=7.395e+02, percent-clipped=1.0 2023-06-20 19:37:19,869 INFO [train.py:996] (2/4) Epoch 5, batch 9000, loss[loss=0.2209, simple_loss=0.2946, pruned_loss=0.0736, over 21564.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.3295, pruned_loss=0.09365, over 4272459.70 frames. ], batch size: 263, lr: 6.38e-03, grad_scale: 16.0 2023-06-20 19:37:19,870 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 19:37:36,471 INFO [train.py:1028] (2/4) Epoch 5, validation: loss=0.2656, simple_loss=0.3627, pruned_loss=0.0843, over 1796401.00 frames. 2023-06-20 19:37:36,472 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-20 19:37:40,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=785874.0, ans=0.125 2023-06-20 19:38:21,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=785994.0, ans=0.125 2023-06-20 19:39:08,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=786114.0, ans=6.0 2023-06-20 19:39:11,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=786114.0, ans=0.125 2023-06-20 19:39:20,782 INFO [train.py:996] (2/4) Epoch 5, batch 9050, loss[loss=0.2472, simple_loss=0.3311, pruned_loss=0.08167, over 20807.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3259, pruned_loss=0.09098, over 4275390.09 frames. ], batch size: 607, lr: 6.37e-03, grad_scale: 16.0 2023-06-20 19:39:55,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=786234.0, ans=0.125 2023-06-20 19:40:58,940 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.870e+02 2.876e+02 3.189e+02 3.875e+02 6.604e+02, threshold=6.378e+02, percent-clipped=0.0 2023-06-20 19:41:00,733 INFO [train.py:996] (2/4) Epoch 5, batch 9100, loss[loss=0.2553, simple_loss=0.3451, pruned_loss=0.08271, over 21792.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3313, pruned_loss=0.0933, over 4275673.49 frames. ], batch size: 316, lr: 6.37e-03, grad_scale: 16.0 2023-06-20 19:41:48,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=786594.0, ans=0.1 2023-06-20 19:42:19,531 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:42:20,169 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.43 vs. limit=12.0 2023-06-20 19:42:34,598 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=15.0 2023-06-20 19:42:45,516 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=22.5 2023-06-20 19:42:45,937 INFO [train.py:996] (2/4) Epoch 5, batch 9150, loss[loss=0.2378, simple_loss=0.3183, pruned_loss=0.07863, over 21714.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3319, pruned_loss=0.09031, over 4272328.39 frames. ], batch size: 247, lr: 6.37e-03, grad_scale: 16.0 2023-06-20 19:43:11,992 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-06-20 19:43:17,215 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=22.5 2023-06-20 19:44:35,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=787014.0, ans=0.125 2023-06-20 19:44:38,235 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.853e+02 3.445e+02 4.216e+02 8.485e+02, threshold=6.890e+02, percent-clipped=4.0 2023-06-20 19:44:39,958 INFO [train.py:996] (2/4) Epoch 5, batch 9200, loss[loss=0.3028, simple_loss=0.4056, pruned_loss=0.1, over 20804.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3348, pruned_loss=0.09, over 4271180.81 frames. ], batch size: 608, lr: 6.37e-03, grad_scale: 32.0 2023-06-20 19:46:23,473 INFO [train.py:996] (2/4) Epoch 5, batch 9250, loss[loss=0.2587, simple_loss=0.3129, pruned_loss=0.1023, over 21827.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3359, pruned_loss=0.09289, over 4276853.55 frames. ], batch size: 372, lr: 6.37e-03, grad_scale: 32.0 2023-06-20 19:46:49,305 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-20 19:47:09,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=787494.0, ans=0.125 2023-06-20 19:47:49,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=787614.0, ans=0.125 2023-06-20 19:48:03,942 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-20 19:48:05,899 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 3.152e+02 3.844e+02 5.008e+02 7.995e+02, threshold=7.688e+02, percent-clipped=5.0 2023-06-20 19:48:12,347 INFO [train.py:996] (2/4) Epoch 5, batch 9300, loss[loss=0.2639, simple_loss=0.3704, pruned_loss=0.07873, over 21250.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3314, pruned_loss=0.09257, over 4269791.29 frames. ], batch size: 549, lr: 6.37e-03, grad_scale: 32.0 2023-06-20 19:48:43,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=787734.0, ans=0.125 2023-06-20 19:49:13,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=787854.0, ans=0.1 2023-06-20 19:49:52,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=787914.0, ans=0.125 2023-06-20 19:49:57,398 INFO [train.py:996] (2/4) Epoch 5, batch 9350, loss[loss=0.2883, simple_loss=0.3626, pruned_loss=0.107, over 21437.00 frames. ], tot_loss[loss=0.2612, simple_loss=0.335, pruned_loss=0.09369, over 4266779.89 frames. ], batch size: 131, lr: 6.37e-03, grad_scale: 16.0 2023-06-20 19:50:15,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=787974.0, ans=0.2 2023-06-20 19:50:40,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=788094.0, ans=10.0 2023-06-20 19:50:46,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=788094.0, ans=0.2 2023-06-20 19:50:46,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=788094.0, ans=0.125 2023-06-20 19:51:09,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=788154.0, ans=0.2 2023-06-20 19:51:41,582 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.197e+02 2.821e+02 3.195e+02 3.693e+02 6.028e+02, threshold=6.390e+02, percent-clipped=0.0 2023-06-20 19:51:41,606 INFO [train.py:996] (2/4) Epoch 5, batch 9400, loss[loss=0.2458, simple_loss=0.3068, pruned_loss=0.09241, over 21602.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3371, pruned_loss=0.09472, over 4262701.37 frames. ], batch size: 247, lr: 6.37e-03, grad_scale: 16.0 2023-06-20 19:51:42,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=788274.0, ans=0.0 2023-06-20 19:51:57,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=788274.0, ans=0.1 2023-06-20 19:53:03,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=788454.0, ans=0.015 2023-06-20 19:53:21,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=788514.0, ans=0.05 2023-06-20 19:53:31,438 INFO [train.py:996] (2/4) Epoch 5, batch 9450, loss[loss=0.2594, simple_loss=0.3582, pruned_loss=0.08028, over 20721.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3302, pruned_loss=0.09325, over 4267508.48 frames. ], batch size: 607, lr: 6.36e-03, grad_scale: 16.0 2023-06-20 19:54:06,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=788694.0, ans=0.125 2023-06-20 19:54:50,875 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.32 vs. limit=12.0 2023-06-20 19:54:59,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=788814.0, ans=0.125 2023-06-20 19:55:14,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=788874.0, ans=0.2 2023-06-20 19:55:15,212 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.274e+02 3.307e+02 4.000e+02 4.833e+02 8.255e+02, threshold=8.000e+02, percent-clipped=8.0 2023-06-20 19:55:15,242 INFO [train.py:996] (2/4) Epoch 5, batch 9500, loss[loss=0.2155, simple_loss=0.2933, pruned_loss=0.06883, over 21711.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3218, pruned_loss=0.09081, over 4259359.54 frames. ], batch size: 298, lr: 6.36e-03, grad_scale: 16.0 2023-06-20 19:56:12,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=788994.0, ans=0.125 2023-06-20 19:56:58,678 INFO [train.py:996] (2/4) Epoch 5, batch 9550, loss[loss=0.2685, simple_loss=0.3438, pruned_loss=0.09665, over 21763.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3265, pruned_loss=0.09379, over 4267532.71 frames. ], batch size: 124, lr: 6.36e-03, grad_scale: 16.0 2023-06-20 19:57:06,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=789174.0, ans=0.125 2023-06-20 19:57:13,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=789234.0, ans=0.125 2023-06-20 19:58:15,560 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 19:58:24,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=789414.0, ans=0.125 2023-06-20 19:58:42,217 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 2.917e+02 3.406e+02 3.917e+02 8.592e+02, threshold=6.812e+02, percent-clipped=1.0 2023-06-20 19:58:42,246 INFO [train.py:996] (2/4) Epoch 5, batch 9600, loss[loss=0.2223, simple_loss=0.2951, pruned_loss=0.07475, over 21432.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3284, pruned_loss=0.09485, over 4278280.07 frames. ], batch size: 194, lr: 6.36e-03, grad_scale: 32.0 2023-06-20 19:58:57,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=789534.0, ans=0.125 2023-06-20 19:59:01,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=789534.0, ans=0.125 2023-06-20 19:59:56,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=789654.0, ans=0.5 2023-06-20 20:00:20,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=789714.0, ans=0.1 2023-06-20 20:00:20,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=789714.0, ans=0.125 2023-06-20 20:00:28,516 INFO [train.py:996] (2/4) Epoch 5, batch 9650, loss[loss=0.2962, simple_loss=0.3663, pruned_loss=0.113, over 21416.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3299, pruned_loss=0.09535, over 4274872.28 frames. ], batch size: 131, lr: 6.36e-03, grad_scale: 32.0 2023-06-20 20:00:40,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=789774.0, ans=0.0 2023-06-20 20:01:06,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=789894.0, ans=0.0 2023-06-20 20:01:57,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=790014.0, ans=0.0 2023-06-20 20:02:02,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=790014.0, ans=0.1 2023-06-20 20:02:13,493 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.238e+02 2.873e+02 3.424e+02 4.153e+02 6.817e+02, threshold=6.847e+02, percent-clipped=1.0 2023-06-20 20:02:13,518 INFO [train.py:996] (2/4) Epoch 5, batch 9700, loss[loss=0.2765, simple_loss=0.3598, pruned_loss=0.09657, over 20784.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3333, pruned_loss=0.09617, over 4275743.53 frames. ], batch size: 609, lr: 6.36e-03, grad_scale: 32.0 2023-06-20 20:02:17,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=790074.0, ans=15.0 2023-06-20 20:02:22,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=790074.0, ans=0.2 2023-06-20 20:02:37,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=790134.0, ans=0.1 2023-06-20 20:03:29,913 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.17 vs. limit=6.0 2023-06-20 20:03:54,080 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.27 vs. limit=15.0 2023-06-20 20:03:56,233 INFO [train.py:996] (2/4) Epoch 5, batch 9750, loss[loss=0.2132, simple_loss=0.2715, pruned_loss=0.07747, over 21540.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3267, pruned_loss=0.09419, over 4273303.26 frames. ], batch size: 263, lr: 6.36e-03, grad_scale: 16.0 2023-06-20 20:04:05,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=790374.0, ans=0.125 2023-06-20 20:04:20,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=790434.0, ans=10.0 2023-06-20 20:04:20,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=790434.0, ans=0.2 2023-06-20 20:04:55,425 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-20 20:05:08,433 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.54 vs. limit=15.0 2023-06-20 20:05:12,819 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:05:34,134 INFO [train.py:996] (2/4) Epoch 5, batch 9800, loss[loss=0.2485, simple_loss=0.3318, pruned_loss=0.08256, over 21829.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3264, pruned_loss=0.09486, over 4274558.57 frames. ], batch size: 124, lr: 6.36e-03, grad_scale: 16.0 2023-06-20 20:05:35,776 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 3.031e+02 3.637e+02 4.540e+02 9.363e+02, threshold=7.274e+02, percent-clipped=7.0 2023-06-20 20:05:38,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=790674.0, ans=0.125 2023-06-20 20:06:58,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=790914.0, ans=0.0 2023-06-20 20:07:11,936 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.01 vs. limit=5.0 2023-06-20 20:07:12,194 INFO [train.py:996] (2/4) Epoch 5, batch 9850, loss[loss=0.2053, simple_loss=0.265, pruned_loss=0.07283, over 21651.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3235, pruned_loss=0.09407, over 4254139.00 frames. ], batch size: 247, lr: 6.35e-03, grad_scale: 16.0 2023-06-20 20:07:14,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=790974.0, ans=0.125 2023-06-20 20:07:58,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=791094.0, ans=0.0 2023-06-20 20:08:51,623 INFO [train.py:996] (2/4) Epoch 5, batch 9900, loss[loss=0.2875, simple_loss=0.3529, pruned_loss=0.1111, over 21661.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3198, pruned_loss=0.09313, over 4255254.03 frames. ], batch size: 441, lr: 6.35e-03, grad_scale: 16.0 2023-06-20 20:08:53,116 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.008e+02 2.770e+02 3.187e+02 3.744e+02 7.656e+02, threshold=6.375e+02, percent-clipped=1.0 2023-06-20 20:09:07,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=791334.0, ans=0.125 2023-06-20 20:09:18,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=791334.0, ans=0.125 2023-06-20 20:09:40,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=791394.0, ans=0.0 2023-06-20 20:10:27,335 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=12.0 2023-06-20 20:10:29,360 INFO [train.py:996] (2/4) Epoch 5, batch 9950, loss[loss=0.2489, simple_loss=0.2977, pruned_loss=0.1, over 21319.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3221, pruned_loss=0.09544, over 4256105.90 frames. ], batch size: 211, lr: 6.35e-03, grad_scale: 16.0 2023-06-20 20:10:54,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=791634.0, ans=0.125 2023-06-20 20:10:56,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=791634.0, ans=0.0 2023-06-20 20:10:56,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=791634.0, ans=0.1 2023-06-20 20:10:59,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=791634.0, ans=0.2 2023-06-20 20:11:27,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=791694.0, ans=0.2 2023-06-20 20:11:29,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=791694.0, ans=0.125 2023-06-20 20:11:29,713 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.28 vs. limit=10.0 2023-06-20 20:11:42,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=791754.0, ans=0.0 2023-06-20 20:11:44,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=791754.0, ans=0.125 2023-06-20 20:11:46,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=791754.0, ans=0.1 2023-06-20 20:11:46,715 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.74 vs. limit=22.5 2023-06-20 20:11:50,019 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.36 vs. limit=15.0 2023-06-20 20:12:08,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=791814.0, ans=0.0 2023-06-20 20:12:11,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=791874.0, ans=0.0 2023-06-20 20:12:13,183 INFO [train.py:996] (2/4) Epoch 5, batch 10000, loss[loss=0.1859, simple_loss=0.2543, pruned_loss=0.05877, over 21082.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3173, pruned_loss=0.09367, over 4257085.47 frames. ], batch size: 143, lr: 6.35e-03, grad_scale: 32.0 2023-06-20 20:12:14,865 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 2.860e+02 3.256e+02 3.834e+02 6.756e+02, threshold=6.512e+02, percent-clipped=1.0 2023-06-20 20:12:31,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=791874.0, ans=0.2 2023-06-20 20:12:47,966 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.375e-03 2023-06-20 20:14:04,475 INFO [train.py:996] (2/4) Epoch 5, batch 10050, loss[loss=0.211, simple_loss=0.2814, pruned_loss=0.07024, over 21437.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3188, pruned_loss=0.09361, over 4257473.57 frames. ], batch size: 194, lr: 6.35e-03, grad_scale: 32.0 2023-06-20 20:14:15,570 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.53 vs. limit=10.0 2023-06-20 20:14:29,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=792234.0, ans=0.125 2023-06-20 20:15:48,188 INFO [train.py:996] (2/4) Epoch 5, batch 10100, loss[loss=0.25, simple_loss=0.3457, pruned_loss=0.07713, over 21243.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3145, pruned_loss=0.09033, over 4266673.87 frames. ], batch size: 548, lr: 6.35e-03, grad_scale: 32.0 2023-06-20 20:15:49,942 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.949e+02 3.459e+02 3.930e+02 6.580e+02, threshold=6.918e+02, percent-clipped=2.0 2023-06-20 20:16:32,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=792594.0, ans=0.04949747468305833 2023-06-20 20:17:09,079 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.91 vs. limit=15.0 2023-06-20 20:17:23,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=792714.0, ans=0.0 2023-06-20 20:17:27,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=792714.0, ans=0.125 2023-06-20 20:17:30,464 INFO [train.py:996] (2/4) Epoch 5, batch 10150, loss[loss=0.2796, simple_loss=0.3427, pruned_loss=0.1082, over 21567.00 frames. ], tot_loss[loss=0.255, simple_loss=0.322, pruned_loss=0.09399, over 4264292.52 frames. ], batch size: 414, lr: 6.35e-03, grad_scale: 32.0 2023-06-20 20:18:20,322 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=12.0 2023-06-20 20:18:47,353 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.08 vs. limit=6.0 2023-06-20 20:18:57,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=793014.0, ans=0.04949747468305833 2023-06-20 20:19:02,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=793014.0, ans=0.125 2023-06-20 20:19:03,178 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-20 20:19:19,880 INFO [train.py:996] (2/4) Epoch 5, batch 10200, loss[loss=0.192, simple_loss=0.2582, pruned_loss=0.06295, over 16329.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.322, pruned_loss=0.09254, over 4263584.31 frames. ], batch size: 63, lr: 6.35e-03, grad_scale: 32.0 2023-06-20 20:19:21,489 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.868e+02 3.276e+02 4.054e+02 7.472e+02, threshold=6.552e+02, percent-clipped=1.0 2023-06-20 20:19:53,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=793134.0, ans=0.05 2023-06-20 20:20:57,307 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=22.5 2023-06-20 20:20:58,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=793314.0, ans=0.1 2023-06-20 20:21:02,693 INFO [train.py:996] (2/4) Epoch 5, batch 10250, loss[loss=0.202, simple_loss=0.291, pruned_loss=0.05647, over 21336.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3166, pruned_loss=0.08644, over 4255820.31 frames. ], batch size: 194, lr: 6.35e-03, grad_scale: 32.0 2023-06-20 20:21:03,306 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:21:28,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=793434.0, ans=0.0 2023-06-20 20:21:44,448 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=15.0 2023-06-20 20:21:58,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=793494.0, ans=0.0 2023-06-20 20:22:50,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=793614.0, ans=0.125 2023-06-20 20:22:53,301 INFO [train.py:996] (2/4) Epoch 5, batch 10300, loss[loss=0.2389, simple_loss=0.3115, pruned_loss=0.08309, over 21265.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3185, pruned_loss=0.08799, over 4256501.98 frames. ], batch size: 176, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:22:54,772 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.579e+02 2.621e+02 3.179e+02 4.486e+02 7.082e+02, threshold=6.359e+02, percent-clipped=5.0 2023-06-20 20:23:08,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=793734.0, ans=0.2 2023-06-20 20:23:13,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=793734.0, ans=0.125 2023-06-20 20:23:43,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=793794.0, ans=0.125 2023-06-20 20:24:13,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=793914.0, ans=0.125 2023-06-20 20:24:37,254 INFO [train.py:996] (2/4) Epoch 5, batch 10350, loss[loss=0.2193, simple_loss=0.2877, pruned_loss=0.07542, over 21680.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3212, pruned_loss=0.08818, over 4258102.36 frames. ], batch size: 298, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:25:06,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=794034.0, ans=0.125 2023-06-20 20:25:53,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=794154.0, ans=0.1 2023-06-20 20:26:22,033 INFO [train.py:996] (2/4) Epoch 5, batch 10400, loss[loss=0.2111, simple_loss=0.2914, pruned_loss=0.06545, over 21914.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3134, pruned_loss=0.08588, over 4261360.99 frames. ], batch size: 373, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:26:23,728 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.345e+02 3.112e+02 3.947e+02 5.007e+02 1.010e+03, threshold=7.895e+02, percent-clipped=9.0 2023-06-20 20:26:31,680 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.69 vs. limit=15.0 2023-06-20 20:26:39,836 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.70 vs. limit=10.0 2023-06-20 20:27:43,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=794454.0, ans=0.2 2023-06-20 20:28:13,690 INFO [train.py:996] (2/4) Epoch 5, batch 10450, loss[loss=0.2959, simple_loss=0.3581, pruned_loss=0.1168, over 21829.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3183, pruned_loss=0.08966, over 4257643.12 frames. ], batch size: 124, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:28:16,161 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-20 20:28:31,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=794634.0, ans=0.125 2023-06-20 20:28:33,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=794634.0, ans=0.0 2023-06-20 20:28:56,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=794694.0, ans=0.0 2023-06-20 20:29:04,719 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=22.5 2023-06-20 20:29:39,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=794814.0, ans=0.1 2023-06-20 20:29:57,613 INFO [train.py:996] (2/4) Epoch 5, batch 10500, loss[loss=0.2102, simple_loss=0.2754, pruned_loss=0.07245, over 21248.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.317, pruned_loss=0.08782, over 4261194.09 frames. ], batch size: 159, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:29:58,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=794874.0, ans=0.0 2023-06-20 20:29:59,127 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.969e+02 2.789e+02 3.342e+02 3.915e+02 9.640e+02, threshold=6.684e+02, percent-clipped=1.0 2023-06-20 20:29:59,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=794874.0, ans=0.0 2023-06-20 20:30:02,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=794874.0, ans=0.125 2023-06-20 20:30:08,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=794874.0, ans=0.125 2023-06-20 20:30:18,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=794934.0, ans=0.125 2023-06-20 20:30:27,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=794934.0, ans=0.125 2023-06-20 20:30:55,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=794994.0, ans=0.05 2023-06-20 20:31:42,463 INFO [train.py:996] (2/4) Epoch 5, batch 10550, loss[loss=0.2061, simple_loss=0.2655, pruned_loss=0.07332, over 21416.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.311, pruned_loss=0.08688, over 4251659.08 frames. ], batch size: 212, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:32:28,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=795294.0, ans=0.125 2023-06-20 20:32:33,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=795294.0, ans=0.125 2023-06-20 20:32:34,294 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.11 vs. limit=22.5 2023-06-20 20:32:52,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=795354.0, ans=0.125 2023-06-20 20:33:23,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=795414.0, ans=0.0 2023-06-20 20:33:28,223 INFO [train.py:996] (2/4) Epoch 5, batch 10600, loss[loss=0.204, simple_loss=0.2909, pruned_loss=0.05859, over 21719.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3066, pruned_loss=0.08526, over 4253990.88 frames. ], batch size: 247, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:33:29,831 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 3.099e+02 3.993e+02 4.741e+02 9.586e+02, threshold=7.985e+02, percent-clipped=4.0 2023-06-20 20:34:02,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=795534.0, ans=0.0 2023-06-20 20:34:47,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=795654.0, ans=0.125 2023-06-20 20:34:47,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=795654.0, ans=0.125 2023-06-20 20:34:49,774 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=22.5 2023-06-20 20:34:54,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=795654.0, ans=0.125 2023-06-20 20:35:03,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=795714.0, ans=0.2 2023-06-20 20:35:23,429 INFO [train.py:996] (2/4) Epoch 5, batch 10650, loss[loss=0.2175, simple_loss=0.3031, pruned_loss=0.06594, over 21576.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3107, pruned_loss=0.08433, over 4253249.25 frames. ], batch size: 389, lr: 6.34e-03, grad_scale: 32.0 2023-06-20 20:35:23,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=795774.0, ans=0.0 2023-06-20 20:35:38,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=795834.0, ans=0.125 2023-06-20 20:35:41,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=795834.0, ans=0.1 2023-06-20 20:36:14,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=795894.0, ans=0.0 2023-06-20 20:36:44,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=796014.0, ans=0.0 2023-06-20 20:36:47,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=796014.0, ans=0.2 2023-06-20 20:37:05,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=796074.0, ans=0.125 2023-06-20 20:37:06,879 INFO [train.py:996] (2/4) Epoch 5, batch 10700, loss[loss=0.2564, simple_loss=0.326, pruned_loss=0.09341, over 21716.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3098, pruned_loss=0.08455, over 4252154.66 frames. ], batch size: 298, lr: 6.33e-03, grad_scale: 32.0 2023-06-20 20:37:08,786 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.900e+02 2.769e+02 3.121e+02 4.008e+02 5.487e+02, threshold=6.241e+02, percent-clipped=0.0 2023-06-20 20:37:42,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=796134.0, ans=0.0 2023-06-20 20:37:47,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=796194.0, ans=0.125 2023-06-20 20:37:48,076 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.50 vs. limit=10.0 2023-06-20 20:38:33,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=796314.0, ans=0.125 2023-06-20 20:38:46,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=796314.0, ans=0.125 2023-06-20 20:38:53,072 INFO [train.py:996] (2/4) Epoch 5, batch 10750, loss[loss=0.289, simple_loss=0.391, pruned_loss=0.0935, over 21556.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3223, pruned_loss=0.08942, over 4256158.43 frames. ], batch size: 471, lr: 6.33e-03, grad_scale: 16.0 2023-06-20 20:39:50,451 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.08 vs. limit=15.0 2023-06-20 20:40:44,438 INFO [train.py:996] (2/4) Epoch 5, batch 10800, loss[loss=0.2797, simple_loss=0.3451, pruned_loss=0.1072, over 21568.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3274, pruned_loss=0.08976, over 4259849.37 frames. ], batch size: 230, lr: 6.33e-03, grad_scale: 32.0 2023-06-20 20:40:47,905 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.174e+02 3.089e+02 3.673e+02 4.196e+02 7.308e+02, threshold=7.346e+02, percent-clipped=3.0 2023-06-20 20:40:57,212 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=15.0 2023-06-20 20:41:01,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=796734.0, ans=0.0 2023-06-20 20:42:29,750 INFO [train.py:996] (2/4) Epoch 5, batch 10850, loss[loss=0.2608, simple_loss=0.3196, pruned_loss=0.101, over 21999.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3264, pruned_loss=0.08987, over 4265611.74 frames. ], batch size: 103, lr: 6.33e-03, grad_scale: 32.0 2023-06-20 20:42:44,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=797034.0, ans=0.5 2023-06-20 20:43:06,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=797094.0, ans=0.0 2023-06-20 20:43:32,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=797154.0, ans=0.0 2023-06-20 20:43:32,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=797154.0, ans=0.2 2023-06-20 20:43:52,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=797214.0, ans=0.125 2023-06-20 20:44:07,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=797214.0, ans=0.125 2023-06-20 20:44:11,615 INFO [train.py:996] (2/4) Epoch 5, batch 10900, loss[loss=0.2344, simple_loss=0.312, pruned_loss=0.07842, over 21376.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3202, pruned_loss=0.08771, over 4267182.58 frames. ], batch size: 211, lr: 6.33e-03, grad_scale: 16.0 2023-06-20 20:44:16,178 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.440e+02 2.886e+02 3.394e+02 4.121e+02 7.095e+02, threshold=6.789e+02, percent-clipped=0.0 2023-06-20 20:44:23,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=797274.0, ans=0.0 2023-06-20 20:44:26,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=797334.0, ans=0.0 2023-06-20 20:44:37,927 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-20 20:45:08,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=797394.0, ans=0.0 2023-06-20 20:45:24,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=797454.0, ans=0.1 2023-06-20 20:45:50,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=797514.0, ans=0.125 2023-06-20 20:45:53,509 INFO [train.py:996] (2/4) Epoch 5, batch 10950, loss[loss=0.2143, simple_loss=0.2768, pruned_loss=0.07586, over 21414.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3162, pruned_loss=0.08624, over 4266194.75 frames. ], batch size: 194, lr: 6.33e-03, grad_scale: 16.0 2023-06-20 20:46:03,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=797574.0, ans=0.07 2023-06-20 20:46:25,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=797634.0, ans=0.2 2023-06-20 20:46:52,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=797754.0, ans=0.125 2023-06-20 20:47:08,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=797754.0, ans=0.125 2023-06-20 20:47:24,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=797814.0, ans=0.05 2023-06-20 20:47:35,116 INFO [train.py:996] (2/4) Epoch 5, batch 11000, loss[loss=0.3056, simple_loss=0.3478, pruned_loss=0.1317, over 21728.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3149, pruned_loss=0.08817, over 4267967.57 frames. ], batch size: 508, lr: 6.33e-03, grad_scale: 16.0 2023-06-20 20:47:39,994 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.006e+02 2.761e+02 3.277e+02 3.770e+02 5.855e+02, threshold=6.553e+02, percent-clipped=0.0 2023-06-20 20:47:50,971 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.54 vs. limit=15.0 2023-06-20 20:48:05,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=797934.0, ans=0.125 2023-06-20 20:49:15,238 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.15 vs. limit=15.0 2023-06-20 20:49:17,467 INFO [train.py:996] (2/4) Epoch 5, batch 11050, loss[loss=0.2685, simple_loss=0.3194, pruned_loss=0.1088, over 14964.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3134, pruned_loss=0.08994, over 4269200.79 frames. ], batch size: 61, lr: 6.33e-03, grad_scale: 16.0 2023-06-20 20:49:51,599 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:49:55,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=798294.0, ans=0.2 2023-06-20 20:50:59,780 INFO [train.py:996] (2/4) Epoch 5, batch 11100, loss[loss=0.2342, simple_loss=0.304, pruned_loss=0.08224, over 21755.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3138, pruned_loss=0.09075, over 4269142.37 frames. ], batch size: 371, lr: 6.33e-03, grad_scale: 16.0 2023-06-20 20:51:00,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=798474.0, ans=0.0 2023-06-20 20:51:04,516 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.421e+02 2.972e+02 3.396e+02 4.031e+02 6.791e+02, threshold=6.791e+02, percent-clipped=1.0 2023-06-20 20:51:07,410 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-06-20 20:51:08,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=798474.0, ans=0.0 2023-06-20 20:51:17,104 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.61 vs. limit=6.0 2023-06-20 20:51:50,543 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 20:51:52,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=798594.0, ans=0.1 2023-06-20 20:51:57,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=798594.0, ans=0.125 2023-06-20 20:52:33,267 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-20 20:52:43,901 INFO [train.py:996] (2/4) Epoch 5, batch 11150, loss[loss=0.2603, simple_loss=0.3217, pruned_loss=0.09945, over 21174.00 frames. ], tot_loss[loss=0.247, simple_loss=0.313, pruned_loss=0.09043, over 4255410.50 frames. ], batch size: 548, lr: 6.32e-03, grad_scale: 16.0 2023-06-20 20:52:46,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=798774.0, ans=0.1 2023-06-20 20:52:54,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=798774.0, ans=0.1 2023-06-20 20:54:15,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=799014.0, ans=0.07 2023-06-20 20:54:27,933 INFO [train.py:996] (2/4) Epoch 5, batch 11200, loss[loss=0.2342, simple_loss=0.2984, pruned_loss=0.08505, over 21538.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3115, pruned_loss=0.09095, over 4262120.40 frames. ], batch size: 414, lr: 6.32e-03, grad_scale: 32.0 2023-06-20 20:54:29,492 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.92 vs. limit=8.0 2023-06-20 20:54:33,051 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 2.645e+02 2.968e+02 3.525e+02 6.155e+02, threshold=5.936e+02, percent-clipped=0.0 2023-06-20 20:55:26,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=799254.0, ans=0.125 2023-06-20 20:56:11,037 INFO [train.py:996] (2/4) Epoch 5, batch 11250, loss[loss=0.2427, simple_loss=0.3089, pruned_loss=0.0883, over 21787.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.311, pruned_loss=0.09056, over 4247695.14 frames. ], batch size: 317, lr: 6.32e-03, grad_scale: 32.0 2023-06-20 20:57:01,641 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-20 20:57:13,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=799554.0, ans=0.125 2023-06-20 20:57:32,801 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=15.0 2023-06-20 20:57:52,806 INFO [train.py:996] (2/4) Epoch 5, batch 11300, loss[loss=0.2211, simple_loss=0.2977, pruned_loss=0.07228, over 21797.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3114, pruned_loss=0.08999, over 4256171.52 frames. ], batch size: 332, lr: 6.32e-03, grad_scale: 32.0 2023-06-20 20:57:57,539 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 2.794e+02 3.076e+02 3.460e+02 4.900e+02, threshold=6.152e+02, percent-clipped=0.0 2023-06-20 20:58:01,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=799674.0, ans=0.125 2023-06-20 20:58:49,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=799794.0, ans=0.125 2023-06-20 20:59:21,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=799914.0, ans=0.125 2023-06-20 20:59:38,221 INFO [train.py:996] (2/4) Epoch 5, batch 11350, loss[loss=0.2737, simple_loss=0.3412, pruned_loss=0.1031, over 21690.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3125, pruned_loss=0.08876, over 4256892.35 frames. ], batch size: 298, lr: 6.32e-03, grad_scale: 32.0 2023-06-20 21:00:01,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=800034.0, ans=0.1 2023-06-20 21:00:11,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=800034.0, ans=0.125 2023-06-20 21:00:21,645 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.87 vs. limit=15.0 2023-06-20 21:00:29,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=800094.0, ans=0.125 2023-06-20 21:00:35,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=800094.0, ans=0.125 2023-06-20 21:00:35,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=800094.0, ans=0.0 2023-06-20 21:00:54,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=800154.0, ans=0.125 2023-06-20 21:00:56,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=800154.0, ans=0.125 2023-06-20 21:01:21,809 INFO [train.py:996] (2/4) Epoch 5, batch 11400, loss[loss=0.2652, simple_loss=0.3524, pruned_loss=0.08902, over 21758.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3187, pruned_loss=0.09138, over 4257141.13 frames. ], batch size: 332, lr: 6.32e-03, grad_scale: 32.0 2023-06-20 21:01:26,751 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.351e+02 3.159e+02 3.655e+02 4.619e+02 8.867e+02, threshold=7.309e+02, percent-clipped=8.0 2023-06-20 21:01:39,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=800274.0, ans=0.125 2023-06-20 21:01:42,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=800334.0, ans=0.125 2023-06-20 21:01:52,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=800334.0, ans=0.125 2023-06-20 21:02:13,009 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=15.0 2023-06-20 21:02:46,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=800514.0, ans=0.125 2023-06-20 21:02:54,450 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.34 vs. limit=22.5 2023-06-20 21:02:58,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=800514.0, ans=0.1 2023-06-20 21:03:10,022 INFO [train.py:996] (2/4) Epoch 5, batch 11450, loss[loss=0.2365, simple_loss=0.3146, pruned_loss=0.07914, over 21724.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3177, pruned_loss=0.08937, over 4256880.02 frames. ], batch size: 332, lr: 6.32e-03, grad_scale: 16.0 2023-06-20 21:03:18,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=800574.0, ans=0.025 2023-06-20 21:03:59,144 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.27 vs. limit=22.5 2023-06-20 21:04:11,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=800754.0, ans=0.0 2023-06-20 21:04:23,764 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.11 vs. limit=10.0 2023-06-20 21:04:50,152 INFO [train.py:996] (2/4) Epoch 5, batch 11500, loss[loss=0.2482, simple_loss=0.3387, pruned_loss=0.07884, over 21672.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3207, pruned_loss=0.09055, over 4257930.40 frames. ], batch size: 389, lr: 6.32e-03, grad_scale: 16.0 2023-06-20 21:04:56,807 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.797e+02 3.137e+02 3.643e+02 6.427e+02, threshold=6.273e+02, percent-clipped=0.0 2023-06-20 21:05:17,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=800934.0, ans=0.125 2023-06-20 21:05:57,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=801054.0, ans=0.1 2023-06-20 21:05:58,206 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=15.0 2023-06-20 21:06:28,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=801114.0, ans=0.125 2023-06-20 21:06:39,671 INFO [train.py:996] (2/4) Epoch 5, batch 11550, loss[loss=0.458, simple_loss=0.5243, pruned_loss=0.1958, over 21452.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.328, pruned_loss=0.09071, over 4262354.42 frames. ], batch size: 508, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:06:48,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=801174.0, ans=0.0 2023-06-20 21:07:16,602 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.77 vs. limit=10.0 2023-06-20 21:07:43,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=801354.0, ans=0.0 2023-06-20 21:07:50,989 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.99 vs. limit=15.0 2023-06-20 21:08:13,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=801414.0, ans=10.0 2023-06-20 21:08:24,264 INFO [train.py:996] (2/4) Epoch 5, batch 11600, loss[loss=0.2816, simple_loss=0.3805, pruned_loss=0.09137, over 21765.00 frames. ], tot_loss[loss=0.264, simple_loss=0.342, pruned_loss=0.09298, over 4261230.01 frames. ], batch size: 332, lr: 6.31e-03, grad_scale: 32.0 2023-06-20 21:08:30,829 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.937e+02 3.426e+02 4.222e+02 6.279e+02, threshold=6.853e+02, percent-clipped=1.0 2023-06-20 21:09:17,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=801594.0, ans=0.0 2023-06-20 21:09:44,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=801654.0, ans=0.1 2023-06-20 21:10:04,665 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-20 21:10:08,656 INFO [train.py:996] (2/4) Epoch 5, batch 11650, loss[loss=0.2581, simple_loss=0.3204, pruned_loss=0.09787, over 21818.00 frames. ], tot_loss[loss=0.2685, simple_loss=0.3487, pruned_loss=0.09411, over 4267295.96 frames. ], batch size: 107, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:10:39,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=801834.0, ans=0.125 2023-06-20 21:10:46,797 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-06-20 21:10:50,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=801894.0, ans=0.125 2023-06-20 21:11:07,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=801954.0, ans=0.2 2023-06-20 21:11:51,273 INFO [train.py:996] (2/4) Epoch 5, batch 11700, loss[loss=0.2072, simple_loss=0.2722, pruned_loss=0.07108, over 21590.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3384, pruned_loss=0.09331, over 4272796.82 frames. ], batch size: 298, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:11:59,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=802074.0, ans=0.0 2023-06-20 21:12:03,984 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.215e+02 2.808e+02 3.203e+02 3.779e+02 6.898e+02, threshold=6.406e+02, percent-clipped=1.0 2023-06-20 21:13:22,654 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.00 vs. limit=6.0 2023-06-20 21:13:28,462 INFO [train.py:996] (2/4) Epoch 5, batch 11750, loss[loss=0.2558, simple_loss=0.3209, pruned_loss=0.0953, over 21868.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3292, pruned_loss=0.09265, over 4277786.89 frames. ], batch size: 317, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:13:49,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=802434.0, ans=0.125 2023-06-20 21:13:55,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=802434.0, ans=0.125 2023-06-20 21:14:03,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=802434.0, ans=0.125 2023-06-20 21:15:18,432 INFO [train.py:996] (2/4) Epoch 5, batch 11800, loss[loss=0.2766, simple_loss=0.3575, pruned_loss=0.09784, over 21629.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3307, pruned_loss=0.09416, over 4277837.60 frames. ], batch size: 389, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:15:25,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=802674.0, ans=0.1 2023-06-20 21:15:25,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=802674.0, ans=0.125 2023-06-20 21:15:26,779 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.280e+02 2.905e+02 3.538e+02 4.499e+02 8.498e+02, threshold=7.075e+02, percent-clipped=3.0 2023-06-20 21:15:45,452 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=12.0 2023-06-20 21:15:54,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=802794.0, ans=0.125 2023-06-20 21:16:16,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=802854.0, ans=0.0 2023-06-20 21:16:23,933 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.49 vs. limit=12.0 2023-06-20 21:16:35,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=802914.0, ans=0.1 2023-06-20 21:16:37,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=802914.0, ans=0.5 2023-06-20 21:16:53,456 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=16.32 vs. limit=15.0 2023-06-20 21:16:53,714 INFO [train.py:996] (2/4) Epoch 5, batch 11850, loss[loss=0.2991, simple_loss=0.3687, pruned_loss=0.1147, over 21766.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3332, pruned_loss=0.09377, over 4274513.01 frames. ], batch size: 414, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:17:20,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=803034.0, ans=0.125 2023-06-20 21:17:48,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=803094.0, ans=0.125 2023-06-20 21:18:01,563 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.06 vs. limit=15.0 2023-06-20 21:18:02,766 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.30 vs. limit=15.0 2023-06-20 21:18:07,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=803154.0, ans=0.125 2023-06-20 21:18:40,293 INFO [train.py:996] (2/4) Epoch 5, batch 11900, loss[loss=0.2613, simple_loss=0.3167, pruned_loss=0.103, over 21175.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3331, pruned_loss=0.09114, over 4273118.14 frames. ], batch size: 159, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:18:48,882 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.635e+02 2.949e+02 3.429e+02 6.903e+02, threshold=5.899e+02, percent-clipped=0.0 2023-06-20 21:19:17,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=803334.0, ans=0.125 2023-06-20 21:20:16,281 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-20 21:20:25,273 INFO [train.py:996] (2/4) Epoch 5, batch 11950, loss[loss=0.1712, simple_loss=0.2575, pruned_loss=0.04248, over 21555.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3336, pruned_loss=0.08857, over 4276136.45 frames. ], batch size: 230, lr: 6.31e-03, grad_scale: 16.0 2023-06-20 21:20:32,733 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.17 vs. limit=10.0 2023-06-20 21:21:03,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=803634.0, ans=0.125 2023-06-20 21:21:14,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=803694.0, ans=0.0 2023-06-20 21:21:14,854 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-20 21:21:23,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=803754.0, ans=0.1 2023-06-20 21:21:35,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=803754.0, ans=0.5 2023-06-20 21:22:08,858 INFO [train.py:996] (2/4) Epoch 5, batch 12000, loss[loss=0.2506, simple_loss=0.3001, pruned_loss=0.1006, over 21196.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3284, pruned_loss=0.08636, over 4273052.85 frames. ], batch size: 176, lr: 6.30e-03, grad_scale: 32.0 2023-06-20 21:22:08,859 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 21:22:26,111 INFO [train.py:1028] (2/4) Epoch 5, validation: loss=0.2641, simple_loss=0.3594, pruned_loss=0.08443, over 1796401.00 frames. 2023-06-20 21:22:26,112 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-20 21:22:34,440 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.073e+02 3.012e+02 3.779e+02 4.599e+02 7.953e+02, threshold=7.557e+02, percent-clipped=8.0 2023-06-20 21:24:07,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=804114.0, ans=0.125 2023-06-20 21:24:10,203 INFO [train.py:996] (2/4) Epoch 5, batch 12050, loss[loss=0.2569, simple_loss=0.3144, pruned_loss=0.09967, over 21412.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3248, pruned_loss=0.08898, over 4279920.16 frames. ], batch size: 177, lr: 6.30e-03, grad_scale: 32.0 2023-06-20 21:24:15,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=804174.0, ans=0.125 2023-06-20 21:24:39,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=804234.0, ans=0.0 2023-06-20 21:24:40,248 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-06-20 21:26:00,554 INFO [train.py:996] (2/4) Epoch 5, batch 12100, loss[loss=0.2978, simple_loss=0.3616, pruned_loss=0.117, over 21162.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3304, pruned_loss=0.09374, over 4281391.48 frames. ], batch size: 143, lr: 6.30e-03, grad_scale: 32.0 2023-06-20 21:26:14,367 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 2.887e+02 3.241e+02 3.794e+02 5.961e+02, threshold=6.483e+02, percent-clipped=0.0 2023-06-20 21:26:46,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=804594.0, ans=0.125 2023-06-20 21:27:07,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=804654.0, ans=0.125 2023-06-20 21:27:17,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=804654.0, ans=0.1 2023-06-20 21:27:27,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=804654.0, ans=0.0 2023-06-20 21:27:54,343 INFO [train.py:996] (2/4) Epoch 5, batch 12150, loss[loss=0.2366, simple_loss=0.31, pruned_loss=0.08163, over 21230.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3344, pruned_loss=0.09333, over 4280160.78 frames. ], batch size: 176, lr: 6.30e-03, grad_scale: 32.0 2023-06-20 21:28:37,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=804894.0, ans=0.125 2023-06-20 21:29:38,007 INFO [train.py:996] (2/4) Epoch 5, batch 12200, loss[loss=0.2589, simple_loss=0.3042, pruned_loss=0.1069, over 21227.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3314, pruned_loss=0.09167, over 4281238.91 frames. ], batch size: 471, lr: 6.30e-03, grad_scale: 32.0 2023-06-20 21:29:51,307 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.288e+02 2.953e+02 3.466e+02 4.601e+02 9.385e+02, threshold=6.933e+02, percent-clipped=9.0 2023-06-20 21:30:26,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=805194.0, ans=0.1 2023-06-20 21:30:30,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=805194.0, ans=0.0 2023-06-20 21:31:00,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=805254.0, ans=0.0 2023-06-20 21:31:22,571 INFO [train.py:996] (2/4) Epoch 5, batch 12250, loss[loss=0.2416, simple_loss=0.3215, pruned_loss=0.08085, over 21669.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3227, pruned_loss=0.08861, over 4289174.04 frames. ], batch size: 391, lr: 6.30e-03, grad_scale: 32.0 2023-06-20 21:31:39,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=805374.0, ans=0.1 2023-06-20 21:31:41,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=805374.0, ans=0.125 2023-06-20 21:33:06,434 INFO [train.py:996] (2/4) Epoch 5, batch 12300, loss[loss=0.167, simple_loss=0.2448, pruned_loss=0.04461, over 21773.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3142, pruned_loss=0.08318, over 4286661.54 frames. ], batch size: 124, lr: 6.30e-03, grad_scale: 16.0 2023-06-20 21:33:20,994 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.563e+02 2.636e+02 3.177e+02 4.024e+02 7.253e+02, threshold=6.354e+02, percent-clipped=1.0 2023-06-20 21:33:35,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=805734.0, ans=0.0 2023-06-20 21:34:41,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=805914.0, ans=0.2 2023-06-20 21:34:45,293 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.45 vs. limit=22.5 2023-06-20 21:34:49,578 INFO [train.py:996] (2/4) Epoch 5, batch 12350, loss[loss=0.2404, simple_loss=0.3146, pruned_loss=0.08306, over 21888.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3183, pruned_loss=0.08409, over 4286301.98 frames. ], batch size: 118, lr: 6.30e-03, grad_scale: 16.0 2023-06-20 21:35:02,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=805974.0, ans=0.125 2023-06-20 21:35:03,578 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 21:36:04,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=806154.0, ans=0.015 2023-06-20 21:36:06,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=806154.0, ans=0.0 2023-06-20 21:36:15,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=806154.0, ans=0.125 2023-06-20 21:36:17,894 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-20 21:36:32,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=806214.0, ans=10.0 2023-06-20 21:36:34,911 INFO [train.py:996] (2/4) Epoch 5, batch 12400, loss[loss=0.2476, simple_loss=0.3199, pruned_loss=0.08762, over 21847.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3209, pruned_loss=0.08672, over 4284056.34 frames. ], batch size: 371, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:36:49,183 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.846e+02 2.789e+02 3.195e+02 3.703e+02 5.558e+02, threshold=6.391e+02, percent-clipped=0.0 2023-06-20 21:37:08,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=806334.0, ans=15.0 2023-06-20 21:37:33,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=806394.0, ans=0.125 2023-06-20 21:37:33,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=806394.0, ans=0.0 2023-06-20 21:38:25,352 INFO [train.py:996] (2/4) Epoch 5, batch 12450, loss[loss=0.3238, simple_loss=0.3852, pruned_loss=0.1312, over 21379.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3251, pruned_loss=0.09023, over 4284318.37 frames. ], batch size: 131, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:38:56,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=806634.0, ans=0.125 2023-06-20 21:38:58,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=806634.0, ans=0.125 2023-06-20 21:40:17,355 INFO [train.py:996] (2/4) Epoch 5, batch 12500, loss[loss=0.3063, simple_loss=0.3853, pruned_loss=0.1137, over 21483.00 frames. ], tot_loss[loss=0.2618, simple_loss=0.3363, pruned_loss=0.09362, over 4281464.28 frames. ], batch size: 194, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:40:24,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=806874.0, ans=0.125 2023-06-20 21:40:27,267 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.481e+02 2.818e+02 3.353e+02 4.175e+02 5.969e+02, threshold=6.707e+02, percent-clipped=0.0 2023-06-20 21:41:12,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=806994.0, ans=0.125 2023-06-20 21:41:58,503 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=15.0 2023-06-20 21:42:04,544 INFO [train.py:996] (2/4) Epoch 5, batch 12550, loss[loss=0.2568, simple_loss=0.3438, pruned_loss=0.08486, over 21730.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3411, pruned_loss=0.09667, over 4283754.15 frames. ], batch size: 332, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:42:42,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=807234.0, ans=0.1 2023-06-20 21:43:03,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=807294.0, ans=0.125 2023-06-20 21:43:06,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=807294.0, ans=0.2 2023-06-20 21:43:53,045 INFO [train.py:996] (2/4) Epoch 5, batch 12600, loss[loss=0.2827, simple_loss=0.3911, pruned_loss=0.08715, over 20787.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3399, pruned_loss=0.09413, over 4282536.06 frames. ], batch size: 608, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:44:08,409 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.175e+02 2.858e+02 3.271e+02 3.869e+02 6.376e+02, threshold=6.541e+02, percent-clipped=0.0 2023-06-20 21:44:14,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=807534.0, ans=0.125 2023-06-20 21:44:32,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=807594.0, ans=0.125 2023-06-20 21:45:18,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=807714.0, ans=0.0 2023-06-20 21:45:19,943 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-20 21:45:24,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=807714.0, ans=0.04949747468305833 2023-06-20 21:45:28,416 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=22.5 2023-06-20 21:45:35,602 INFO [train.py:996] (2/4) Epoch 5, batch 12650, loss[loss=0.2362, simple_loss=0.3033, pruned_loss=0.08453, over 21312.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3308, pruned_loss=0.08991, over 4275129.75 frames. ], batch size: 176, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:46:44,775 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-20 21:47:10,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=808014.0, ans=0.125 2023-06-20 21:47:20,044 INFO [train.py:996] (2/4) Epoch 5, batch 12700, loss[loss=0.224, simple_loss=0.2971, pruned_loss=0.07545, over 21068.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3296, pruned_loss=0.09241, over 4278149.51 frames. ], batch size: 608, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:47:35,843 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 2.923e+02 3.430e+02 4.123e+02 8.274e+02, threshold=6.860e+02, percent-clipped=2.0 2023-06-20 21:47:41,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=808134.0, ans=0.125 2023-06-20 21:47:43,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=808134.0, ans=0.125 2023-06-20 21:48:02,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=808194.0, ans=0.125 2023-06-20 21:48:34,720 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=15.0 2023-06-20 21:49:03,369 INFO [train.py:996] (2/4) Epoch 5, batch 12750, loss[loss=0.2815, simple_loss=0.3431, pruned_loss=0.1099, over 21872.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.3322, pruned_loss=0.09379, over 4278781.75 frames. ], batch size: 107, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:49:24,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=808434.0, ans=0.125 2023-06-20 21:49:39,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=808434.0, ans=0.125 2023-06-20 21:49:42,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=808434.0, ans=0.125 2023-06-20 21:50:25,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=808554.0, ans=0.1 2023-06-20 21:50:37,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=808614.0, ans=0.125 2023-06-20 21:50:52,091 INFO [train.py:996] (2/4) Epoch 5, batch 12800, loss[loss=0.2776, simple_loss=0.334, pruned_loss=0.1106, over 21558.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3316, pruned_loss=0.09454, over 4282884.10 frames. ], batch size: 548, lr: 6.29e-03, grad_scale: 32.0 2023-06-20 21:50:52,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=808674.0, ans=0.125 2023-06-20 21:50:54,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=808674.0, ans=0.1 2023-06-20 21:51:03,861 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 2.761e+02 3.177e+02 3.732e+02 6.852e+02, threshold=6.353e+02, percent-clipped=0.0 2023-06-20 21:51:14,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=808734.0, ans=0.2 2023-06-20 21:51:17,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=808734.0, ans=0.0 2023-06-20 21:51:24,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=808734.0, ans=0.125 2023-06-20 21:51:55,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=808854.0, ans=0.125 2023-06-20 21:52:20,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=808914.0, ans=0.0 2023-06-20 21:52:37,061 INFO [train.py:996] (2/4) Epoch 5, batch 12850, loss[loss=0.2244, simple_loss=0.3212, pruned_loss=0.06375, over 21729.00 frames. ], tot_loss[loss=0.2628, simple_loss=0.3345, pruned_loss=0.09551, over 4277967.38 frames. ], batch size: 351, lr: 6.28e-03, grad_scale: 32.0 2023-06-20 21:52:47,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=808974.0, ans=0.07 2023-06-20 21:53:08,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=809034.0, ans=0.125 2023-06-20 21:53:08,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=809034.0, ans=0.0 2023-06-20 21:53:11,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=809034.0, ans=0.0 2023-06-20 21:53:23,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=809094.0, ans=0.2 2023-06-20 21:53:33,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=809094.0, ans=0.125 2023-06-20 21:53:38,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=809094.0, ans=0.125 2023-06-20 21:53:52,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=809154.0, ans=0.1 2023-06-20 21:54:20,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=809214.0, ans=0.1 2023-06-20 21:54:27,293 INFO [train.py:996] (2/4) Epoch 5, batch 12900, loss[loss=0.2294, simple_loss=0.2986, pruned_loss=0.08009, over 21567.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.3315, pruned_loss=0.09194, over 4281575.66 frames. ], batch size: 212, lr: 6.28e-03, grad_scale: 32.0 2023-06-20 21:54:45,414 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.796e+02 3.210e+02 3.770e+02 8.746e+02, threshold=6.419e+02, percent-clipped=1.0 2023-06-20 21:55:56,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=809514.0, ans=0.05 2023-06-20 21:56:09,847 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-20 21:56:16,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=809574.0, ans=0.0 2023-06-20 21:56:17,473 INFO [train.py:996] (2/4) Epoch 5, batch 12950, loss[loss=0.2686, simple_loss=0.347, pruned_loss=0.09512, over 21719.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3305, pruned_loss=0.08989, over 4278553.03 frames. ], batch size: 298, lr: 6.28e-03, grad_scale: 8.0 2023-06-20 21:57:08,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=809694.0, ans=0.125 2023-06-20 21:57:09,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=809694.0, ans=0.0 2023-06-20 21:57:33,665 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2023-06-20 21:57:45,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=809814.0, ans=0.0 2023-06-20 21:57:57,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=809814.0, ans=0.125 2023-06-20 21:58:00,377 INFO [train.py:996] (2/4) Epoch 5, batch 13000, loss[loss=0.2022, simple_loss=0.2679, pruned_loss=0.06821, over 21143.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3303, pruned_loss=0.08967, over 4278373.49 frames. ], batch size: 143, lr: 6.28e-03, grad_scale: 8.0 2023-06-20 21:58:15,216 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.939e+02 3.382e+02 4.149e+02 6.832e+02, threshold=6.764e+02, percent-clipped=3.0 2023-06-20 21:58:20,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=809934.0, ans=0.0 2023-06-20 21:58:27,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=809934.0, ans=0.1 2023-06-20 21:59:45,280 INFO [train.py:996] (2/4) Epoch 5, batch 13050, loss[loss=0.2607, simple_loss=0.3244, pruned_loss=0.09849, over 21757.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.328, pruned_loss=0.08736, over 4261299.72 frames. ], batch size: 389, lr: 6.28e-03, grad_scale: 8.0 2023-06-20 22:00:33,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=810294.0, ans=0.125 2023-06-20 22:00:53,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=810354.0, ans=0.125 2023-06-20 22:01:16,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=810414.0, ans=0.125 2023-06-20 22:01:17,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=810414.0, ans=0.125 2023-06-20 22:01:22,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=810414.0, ans=0.125 2023-06-20 22:01:27,035 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.81 vs. limit=15.0 2023-06-20 22:01:29,090 INFO [train.py:996] (2/4) Epoch 5, batch 13100, loss[loss=0.2206, simple_loss=0.3072, pruned_loss=0.067, over 21729.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.329, pruned_loss=0.08759, over 4270162.80 frames. ], batch size: 247, lr: 6.28e-03, grad_scale: 8.0 2023-06-20 22:01:31,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=810474.0, ans=0.0 2023-06-20 22:01:49,418 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.828e+02 3.544e+02 4.567e+02 8.084e+02, threshold=7.089e+02, percent-clipped=1.0 2023-06-20 22:03:16,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=810714.0, ans=0.125 2023-06-20 22:03:19,527 INFO [train.py:996] (2/4) Epoch 5, batch 13150, loss[loss=0.2119, simple_loss=0.2912, pruned_loss=0.06631, over 21833.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3288, pruned_loss=0.08977, over 4275163.68 frames. ], batch size: 317, lr: 6.28e-03, grad_scale: 8.0 2023-06-20 22:03:57,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=810894.0, ans=0.0 2023-06-20 22:04:34,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=810954.0, ans=0.125 2023-06-20 22:04:36,479 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.53 vs. limit=15.0 2023-06-20 22:05:03,354 INFO [train.py:996] (2/4) Epoch 5, batch 13200, loss[loss=0.2676, simple_loss=0.3321, pruned_loss=0.1016, over 21948.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3295, pruned_loss=0.09127, over 4277157.97 frames. ], batch size: 372, lr: 6.28e-03, grad_scale: 16.0 2023-06-20 22:05:10,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=811074.0, ans=0.125 2023-06-20 22:05:18,457 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 2.901e+02 3.344e+02 4.017e+02 7.205e+02, threshold=6.688e+02, percent-clipped=1.0 2023-06-20 22:05:27,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=811134.0, ans=0.125 2023-06-20 22:05:40,382 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=15.0 2023-06-20 22:06:10,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=811254.0, ans=0.125 2023-06-20 22:06:48,155 INFO [train.py:996] (2/4) Epoch 5, batch 13250, loss[loss=0.2493, simple_loss=0.3153, pruned_loss=0.09163, over 21508.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3287, pruned_loss=0.09267, over 4271343.39 frames. ], batch size: 548, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:06:53,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=811374.0, ans=0.125 2023-06-20 22:07:43,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=811494.0, ans=0.125 2023-06-20 22:08:17,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=811614.0, ans=0.125 2023-06-20 22:08:39,112 INFO [train.py:996] (2/4) Epoch 5, batch 13300, loss[loss=0.2841, simple_loss=0.3453, pruned_loss=0.1115, over 21828.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3322, pruned_loss=0.09239, over 4269507.68 frames. ], batch size: 298, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:08:59,103 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.093e+02 2.849e+02 3.375e+02 4.056e+02 7.431e+02, threshold=6.749e+02, percent-clipped=1.0 2023-06-20 22:09:03,894 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.77 vs. limit=10.0 2023-06-20 22:09:52,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=811854.0, ans=0.125 2023-06-20 22:09:58,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=811854.0, ans=0.2 2023-06-20 22:10:13,328 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=22.5 2023-06-20 22:10:21,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=811914.0, ans=0.0 2023-06-20 22:10:28,594 INFO [train.py:996] (2/4) Epoch 5, batch 13350, loss[loss=0.2734, simple_loss=0.3493, pruned_loss=0.09875, over 21630.00 frames. ], tot_loss[loss=0.2649, simple_loss=0.337, pruned_loss=0.09638, over 4273205.50 frames. ], batch size: 230, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:10:29,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=811974.0, ans=0.0 2023-06-20 22:10:46,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=812034.0, ans=15.0 2023-06-20 22:11:25,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=812094.0, ans=0.1 2023-06-20 22:11:36,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=812154.0, ans=0.0 2023-06-20 22:11:38,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=812154.0, ans=0.0 2023-06-20 22:11:52,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=812214.0, ans=0.125 2023-06-20 22:11:53,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=812214.0, ans=0.0 2023-06-20 22:11:53,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=812214.0, ans=0.0 2023-06-20 22:12:08,362 INFO [train.py:996] (2/4) Epoch 5, batch 13400, loss[loss=0.3286, simple_loss=0.3735, pruned_loss=0.1419, over 21507.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.337, pruned_loss=0.09822, over 4277444.16 frames. ], batch size: 507, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:12:13,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=812274.0, ans=0.0 2023-06-20 22:12:22,372 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.312e+02 2.994e+02 3.475e+02 4.105e+02 5.675e+02, threshold=6.951e+02, percent-clipped=0.0 2023-06-20 22:12:29,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=812334.0, ans=0.0 2023-06-20 22:13:40,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=812514.0, ans=0.125 2023-06-20 22:13:46,632 INFO [train.py:996] (2/4) Epoch 5, batch 13450, loss[loss=0.2358, simple_loss=0.2936, pruned_loss=0.08906, over 21469.00 frames. ], tot_loss[loss=0.271, simple_loss=0.3392, pruned_loss=0.1014, over 4273964.11 frames. ], batch size: 194, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:13:58,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=812574.0, ans=0.2 2023-06-20 22:14:15,463 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:14:30,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=812694.0, ans=0.0 2023-06-20 22:14:52,485 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-20 22:14:53,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=812754.0, ans=0.125 2023-06-20 22:15:00,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=812754.0, ans=0.125 2023-06-20 22:15:11,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=812814.0, ans=0.2 2023-06-20 22:15:11,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=812814.0, ans=0.2 2023-06-20 22:15:13,383 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.24 vs. limit=15.0 2023-06-20 22:15:14,935 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=22.5 2023-06-20 22:15:33,805 INFO [train.py:996] (2/4) Epoch 5, batch 13500, loss[loss=0.246, simple_loss=0.302, pruned_loss=0.09497, over 21365.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3304, pruned_loss=0.09844, over 4268529.38 frames. ], batch size: 211, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:15:53,687 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.342e+02 3.173e+02 3.616e+02 4.505e+02 8.152e+02, threshold=7.232e+02, percent-clipped=1.0 2023-06-20 22:15:57,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=812934.0, ans=0.125 2023-06-20 22:16:39,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=813054.0, ans=0.0 2023-06-20 22:16:54,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=813114.0, ans=0.2 2023-06-20 22:17:15,901 INFO [train.py:996] (2/4) Epoch 5, batch 13550, loss[loss=0.2425, simple_loss=0.3329, pruned_loss=0.07606, over 21400.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3326, pruned_loss=0.09692, over 4267303.66 frames. ], batch size: 194, lr: 6.27e-03, grad_scale: 8.0 2023-06-20 22:17:46,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=813234.0, ans=0.125 2023-06-20 22:18:00,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=813294.0, ans=0.0 2023-06-20 22:18:26,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=813354.0, ans=0.0 2023-06-20 22:18:38,316 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=22.5 2023-06-20 22:18:55,146 INFO [train.py:996] (2/4) Epoch 5, batch 13600, loss[loss=0.2261, simple_loss=0.2973, pruned_loss=0.07743, over 21491.00 frames. ], tot_loss[loss=0.2629, simple_loss=0.3341, pruned_loss=0.09586, over 4272632.75 frames. ], batch size: 194, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:19:13,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=813474.0, ans=0.125 2023-06-20 22:19:16,175 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.068e+02 2.895e+02 3.506e+02 4.425e+02 7.285e+02, threshold=7.012e+02, percent-clipped=2.0 2023-06-20 22:19:18,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=813534.0, ans=0.125 2023-06-20 22:20:39,916 INFO [train.py:996] (2/4) Epoch 5, batch 13650, loss[loss=0.2357, simple_loss=0.2906, pruned_loss=0.09042, over 21737.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.33, pruned_loss=0.09236, over 4266463.08 frames. ], batch size: 112, lr: 6.27e-03, grad_scale: 16.0 2023-06-20 22:20:53,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=813774.0, ans=0.125 2023-06-20 22:20:56,517 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=22.5 2023-06-20 22:21:06,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=813834.0, ans=0.0 2023-06-20 22:21:20,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=813894.0, ans=0.2 2023-06-20 22:21:32,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=813954.0, ans=0.125 2023-06-20 22:22:19,153 INFO [train.py:996] (2/4) Epoch 5, batch 13700, loss[loss=0.2633, simple_loss=0.3385, pruned_loss=0.09409, over 21791.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3255, pruned_loss=0.09188, over 4264439.52 frames. ], batch size: 332, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:22:35,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=814134.0, ans=0.125 2023-06-20 22:22:41,395 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.894e+02 2.806e+02 3.342e+02 4.306e+02 8.545e+02, threshold=6.684e+02, percent-clipped=2.0 2023-06-20 22:22:47,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=814134.0, ans=0.125 2023-06-20 22:23:36,519 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=12.0 2023-06-20 22:23:45,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=814314.0, ans=0.0 2023-06-20 22:24:01,305 INFO [train.py:996] (2/4) Epoch 5, batch 13750, loss[loss=0.2708, simple_loss=0.3283, pruned_loss=0.1067, over 20269.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3248, pruned_loss=0.09193, over 4266118.46 frames. ], batch size: 703, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:24:40,206 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:24:58,201 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.21 vs. limit=15.0 2023-06-20 22:25:10,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=814554.0, ans=0.2 2023-06-20 22:25:23,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=814554.0, ans=0.0 2023-06-20 22:25:36,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=814614.0, ans=0.125 2023-06-20 22:25:49,713 INFO [train.py:996] (2/4) Epoch 5, batch 13800, loss[loss=0.237, simple_loss=0.3276, pruned_loss=0.07321, over 20767.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3271, pruned_loss=0.09023, over 4261763.64 frames. ], batch size: 608, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:26:06,020 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.285e+02 2.993e+02 3.321e+02 4.024e+02 5.976e+02, threshold=6.643e+02, percent-clipped=0.0 2023-06-20 22:26:42,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=814854.0, ans=0.0 2023-06-20 22:27:23,932 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.90 vs. limit=22.5 2023-06-20 22:27:26,026 INFO [train.py:996] (2/4) Epoch 5, batch 13850, loss[loss=0.3312, simple_loss=0.404, pruned_loss=0.1292, over 21322.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3342, pruned_loss=0.09241, over 4264495.55 frames. ], batch size: 548, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:27:54,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=815034.0, ans=0.0 2023-06-20 22:27:56,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=815034.0, ans=0.0 2023-06-20 22:28:01,904 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-06-20 22:28:02,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=815094.0, ans=0.125 2023-06-20 22:29:01,723 INFO [train.py:996] (2/4) Epoch 5, batch 13900, loss[loss=0.3547, simple_loss=0.3855, pruned_loss=0.1619, over 21620.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.3388, pruned_loss=0.09611, over 4272137.73 frames. ], batch size: 507, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:29:27,674 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.490e+02 2.909e+02 3.378e+02 3.977e+02 7.082e+02, threshold=6.756e+02, percent-clipped=1.0 2023-06-20 22:29:37,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=815334.0, ans=0.1 2023-06-20 22:29:45,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=815394.0, ans=0.0 2023-06-20 22:30:10,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=815454.0, ans=0.0 2023-06-20 22:30:25,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=815514.0, ans=0.0 2023-06-20 22:30:41,776 INFO [train.py:996] (2/4) Epoch 5, batch 13950, loss[loss=0.3557, simple_loss=0.395, pruned_loss=0.1582, over 21588.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.338, pruned_loss=0.09727, over 4277538.51 frames. ], batch size: 471, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:30:49,504 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.89 vs. limit=15.0 2023-06-20 22:31:19,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=815634.0, ans=0.125 2023-06-20 22:31:27,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=815694.0, ans=0.1 2023-06-20 22:31:42,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=815754.0, ans=0.125 2023-06-20 22:32:05,631 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-20 22:32:24,856 INFO [train.py:996] (2/4) Epoch 5, batch 14000, loss[loss=0.2066, simple_loss=0.3182, pruned_loss=0.0475, over 19769.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3342, pruned_loss=0.09453, over 4274831.04 frames. ], batch size: 702, lr: 6.26e-03, grad_scale: 32.0 2023-06-20 22:32:45,988 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.619e+02 3.139e+02 3.881e+02 6.690e+02, threshold=6.278e+02, percent-clipped=0.0 2023-06-20 22:32:46,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=815934.0, ans=0.0 2023-06-20 22:33:02,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=815994.0, ans=0.1 2023-06-20 22:33:32,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=816054.0, ans=0.125 2023-06-20 22:33:56,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=816114.0, ans=0.2 2023-06-20 22:34:05,240 INFO [train.py:996] (2/4) Epoch 5, batch 14050, loss[loss=0.1972, simple_loss=0.2626, pruned_loss=0.06594, over 21481.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.328, pruned_loss=0.09051, over 4279112.94 frames. ], batch size: 230, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:34:05,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=816174.0, ans=0.1 2023-06-20 22:34:15,796 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.05 vs. limit=12.0 2023-06-20 22:35:44,107 INFO [train.py:996] (2/4) Epoch 5, batch 14100, loss[loss=0.2666, simple_loss=0.3269, pruned_loss=0.1032, over 21691.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3218, pruned_loss=0.09058, over 4272758.87 frames. ], batch size: 298, lr: 6.26e-03, grad_scale: 16.0 2023-06-20 22:36:07,018 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.084e+02 2.771e+02 3.154e+02 4.028e+02 6.108e+02, threshold=6.308e+02, percent-clipped=0.0 2023-06-20 22:36:22,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=816594.0, ans=22.5 2023-06-20 22:36:22,185 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-20 22:36:25,597 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=15.0 2023-06-20 22:37:06,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=816714.0, ans=0.0 2023-06-20 22:37:18,622 INFO [train.py:996] (2/4) Epoch 5, batch 14150, loss[loss=0.2549, simple_loss=0.3344, pruned_loss=0.08772, over 21878.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3265, pruned_loss=0.09176, over 4279167.74 frames. ], batch size: 98, lr: 6.25e-03, grad_scale: 16.0 2023-06-20 22:37:26,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=816774.0, ans=0.1 2023-06-20 22:38:17,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=816954.0, ans=0.125 2023-06-20 22:38:20,723 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.73 vs. limit=12.0 2023-06-20 22:38:31,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=816954.0, ans=0.0 2023-06-20 22:38:35,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=817014.0, ans=0.1 2023-06-20 22:38:57,651 INFO [train.py:996] (2/4) Epoch 5, batch 14200, loss[loss=0.2475, simple_loss=0.3092, pruned_loss=0.0929, over 21637.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3255, pruned_loss=0.09012, over 4267110.43 frames. ], batch size: 414, lr: 6.25e-03, grad_scale: 16.0 2023-06-20 22:39:17,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=817134.0, ans=0.0 2023-06-20 22:39:19,849 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.484e+02 2.963e+02 3.702e+02 8.044e+02, threshold=5.927e+02, percent-clipped=3.0 2023-06-20 22:40:20,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=817314.0, ans=0.0 2023-06-20 22:40:36,170 INFO [train.py:996] (2/4) Epoch 5, batch 14250, loss[loss=0.206, simple_loss=0.2623, pruned_loss=0.07489, over 20276.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3192, pruned_loss=0.0902, over 4261385.84 frames. ], batch size: 703, lr: 6.25e-03, grad_scale: 16.0 2023-06-20 22:40:38,866 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=22.5 2023-06-20 22:41:34,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=817554.0, ans=0.125 2023-06-20 22:41:37,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=817554.0, ans=0.0 2023-06-20 22:41:42,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=817554.0, ans=0.0 2023-06-20 22:42:17,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=817674.0, ans=0.2 2023-06-20 22:42:22,715 INFO [train.py:996] (2/4) Epoch 5, batch 14300, loss[loss=0.3877, simple_loss=0.4656, pruned_loss=0.1549, over 21700.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3161, pruned_loss=0.0881, over 4241915.35 frames. ], batch size: 414, lr: 6.25e-03, grad_scale: 16.0 2023-06-20 22:42:23,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=817674.0, ans=0.0 2023-06-20 22:42:33,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=817674.0, ans=0.2 2023-06-20 22:42:46,153 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.228e+02 3.147e+02 3.840e+02 5.009e+02 9.347e+02, threshold=7.680e+02, percent-clipped=16.0 2023-06-20 22:43:04,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=817794.0, ans=0.125 2023-06-20 22:43:29,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=817854.0, ans=0.0 2023-06-20 22:43:52,779 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.23 vs. limit=15.0 2023-06-20 22:43:55,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=817914.0, ans=0.1 2023-06-20 22:44:02,691 INFO [train.py:996] (2/4) Epoch 5, batch 14350, loss[loss=0.2467, simple_loss=0.3154, pruned_loss=0.08904, over 21882.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3248, pruned_loss=0.0902, over 4252928.03 frames. ], batch size: 316, lr: 6.25e-03, grad_scale: 16.0 2023-06-20 22:44:06,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=817974.0, ans=0.125 2023-06-20 22:44:06,777 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=15.0 2023-06-20 22:44:17,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=818034.0, ans=0.0 2023-06-20 22:44:59,114 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=12.0 2023-06-20 22:45:03,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=818154.0, ans=0.2 2023-06-20 22:45:40,634 INFO [train.py:996] (2/4) Epoch 5, batch 14400, loss[loss=0.2578, simple_loss=0.3169, pruned_loss=0.0994, over 21827.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.323, pruned_loss=0.0906, over 4266427.53 frames. ], batch size: 351, lr: 6.25e-03, grad_scale: 32.0 2023-06-20 22:45:58,169 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.046e+02 2.774e+02 3.108e+02 3.689e+02 4.790e+02, threshold=6.217e+02, percent-clipped=0.0 2023-06-20 22:46:07,480 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.04 vs. limit=6.0 2023-06-20 22:46:14,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=818334.0, ans=0.0 2023-06-20 22:46:16,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=818394.0, ans=0.0 2023-06-20 22:46:45,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=818454.0, ans=0.2 2023-06-20 22:47:20,542 INFO [train.py:996] (2/4) Epoch 5, batch 14450, loss[loss=0.2472, simple_loss=0.3148, pruned_loss=0.08978, over 21490.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3181, pruned_loss=0.09064, over 4263000.98 frames. ], batch size: 131, lr: 6.25e-03, grad_scale: 32.0 2023-06-20 22:47:23,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=818574.0, ans=0.125 2023-06-20 22:48:03,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=818694.0, ans=0.07 2023-06-20 22:48:41,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=818814.0, ans=0.1 2023-06-20 22:48:49,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=818814.0, ans=0.0 2023-06-20 22:48:58,551 INFO [train.py:996] (2/4) Epoch 5, batch 14500, loss[loss=0.2159, simple_loss=0.2807, pruned_loss=0.07559, over 21789.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3141, pruned_loss=0.09032, over 4269234.10 frames. ], batch size: 316, lr: 6.25e-03, grad_scale: 32.0 2023-06-20 22:49:16,440 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 2.793e+02 3.259e+02 3.991e+02 5.427e+02, threshold=6.518e+02, percent-clipped=0.0 2023-06-20 22:49:52,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=818994.0, ans=0.125 2023-06-20 22:50:27,615 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:50:37,013 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-20 22:50:40,052 INFO [train.py:996] (2/4) Epoch 5, batch 14550, loss[loss=0.2404, simple_loss=0.3234, pruned_loss=0.0787, over 19996.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3195, pruned_loss=0.09202, over 4269956.52 frames. ], batch size: 703, lr: 6.24e-03, grad_scale: 32.0 2023-06-20 22:50:43,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=819174.0, ans=0.125 2023-06-20 22:50:55,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=819174.0, ans=0.1 2023-06-20 22:51:04,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=819234.0, ans=0.125 2023-06-20 22:51:30,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=819294.0, ans=0.125 2023-06-20 22:52:06,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=819414.0, ans=0.2 2023-06-20 22:52:20,173 INFO [train.py:996] (2/4) Epoch 5, batch 14600, loss[loss=0.2543, simple_loss=0.3422, pruned_loss=0.08324, over 21666.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3289, pruned_loss=0.0951, over 4274809.17 frames. ], batch size: 263, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 22:52:44,007 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.146e+02 3.577e+02 4.655e+02 8.854e+02, threshold=7.154e+02, percent-clipped=8.0 2023-06-20 22:53:47,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=819714.0, ans=0.07 2023-06-20 22:54:00,101 INFO [train.py:996] (2/4) Epoch 5, batch 14650, loss[loss=0.2408, simple_loss=0.3177, pruned_loss=0.08194, over 21687.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3314, pruned_loss=0.09436, over 4268598.99 frames. ], batch size: 263, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 22:54:21,079 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=22.5 2023-06-20 22:55:00,152 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=15.0 2023-06-20 22:55:36,070 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=15.0 2023-06-20 22:55:41,428 INFO [train.py:996] (2/4) Epoch 5, batch 14700, loss[loss=0.2072, simple_loss=0.288, pruned_loss=0.06324, over 21346.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3226, pruned_loss=0.08797, over 4255830.56 frames. ], batch size: 176, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 22:56:11,181 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.600e+02 2.525e+02 2.974e+02 3.979e+02 6.680e+02, threshold=5.948e+02, percent-clipped=0.0 2023-06-20 22:56:16,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=820134.0, ans=0.125 2023-06-20 22:57:24,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=820314.0, ans=0.125 2023-06-20 22:57:29,235 INFO [train.py:996] (2/4) Epoch 5, batch 14750, loss[loss=0.2962, simple_loss=0.3566, pruned_loss=0.1179, over 21491.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3272, pruned_loss=0.09047, over 4261913.07 frames. ], batch size: 131, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 22:57:40,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=820374.0, ans=15.0 2023-06-20 22:57:51,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=820434.0, ans=0.0 2023-06-20 22:57:51,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=820434.0, ans=0.2 2023-06-20 22:58:15,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=820494.0, ans=0.0 2023-06-20 22:58:18,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=820494.0, ans=0.125 2023-06-20 22:59:11,145 INFO [train.py:996] (2/4) Epoch 5, batch 14800, loss[loss=0.2616, simple_loss=0.3721, pruned_loss=0.07553, over 19958.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3386, pruned_loss=0.09644, over 4264499.40 frames. ], batch size: 702, lr: 6.24e-03, grad_scale: 32.0 2023-06-20 22:59:23,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=820674.0, ans=0.125 2023-06-20 22:59:28,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=820734.0, ans=0.2 2023-06-20 22:59:30,587 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.410e+02 3.152e+02 3.633e+02 4.425e+02 1.058e+03, threshold=7.266e+02, percent-clipped=3.0 2023-06-20 22:59:45,579 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-20 23:00:44,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=820914.0, ans=0.0 2023-06-20 23:00:44,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=820914.0, ans=0.125 2023-06-20 23:00:55,538 INFO [train.py:996] (2/4) Epoch 5, batch 14850, loss[loss=0.2358, simple_loss=0.2933, pruned_loss=0.08918, over 21823.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3324, pruned_loss=0.09578, over 4263815.43 frames. ], batch size: 352, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 23:01:36,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=821094.0, ans=0.1 2023-06-20 23:01:37,508 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.68 vs. limit=10.0 2023-06-20 23:02:20,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=821214.0, ans=0.125 2023-06-20 23:02:22,135 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-20 23:02:22,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=821214.0, ans=0.125 2023-06-20 23:02:37,278 INFO [train.py:996] (2/4) Epoch 5, batch 14900, loss[loss=0.3147, simple_loss=0.3642, pruned_loss=0.1326, over 21243.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.335, pruned_loss=0.09798, over 4264374.38 frames. ], batch size: 143, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 23:03:08,767 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.285e+02 3.108e+02 3.722e+02 4.348e+02 7.688e+02, threshold=7.444e+02, percent-clipped=1.0 2023-06-20 23:03:49,930 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=15.0 2023-06-20 23:04:14,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=821514.0, ans=0.125 2023-06-20 23:04:28,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=821574.0, ans=0.125 2023-06-20 23:04:29,723 INFO [train.py:996] (2/4) Epoch 5, batch 14950, loss[loss=0.2541, simple_loss=0.3067, pruned_loss=0.1008, over 20131.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3361, pruned_loss=0.09773, over 4262657.92 frames. ], batch size: 703, lr: 6.24e-03, grad_scale: 16.0 2023-06-20 23:04:35,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=821574.0, ans=0.0 2023-06-20 23:04:50,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=821634.0, ans=0.2 2023-06-20 23:05:16,333 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=22.5 2023-06-20 23:06:00,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=821814.0, ans=0.125 2023-06-20 23:06:02,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=821814.0, ans=0.0 2023-06-20 23:06:02,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=821814.0, ans=0.125 2023-06-20 23:06:10,056 INFO [train.py:996] (2/4) Epoch 5, batch 15000, loss[loss=0.3245, simple_loss=0.3824, pruned_loss=0.1333, over 21765.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3381, pruned_loss=0.09922, over 4263960.26 frames. ], batch size: 441, lr: 6.23e-03, grad_scale: 16.0 2023-06-20 23:06:10,064 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-20 23:06:22,973 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.3948, 3.0509, 1.8509, 1.4322], device='cuda:2') 2023-06-20 23:06:26,238 INFO [train.py:1028] (2/4) Epoch 5, validation: loss=0.2595, simple_loss=0.3578, pruned_loss=0.08055, over 1796401.00 frames. 2023-06-20 23:06:26,238 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-20 23:06:37,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=821874.0, ans=0.125 2023-06-20 23:07:00,531 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.991e+02 3.617e+02 4.837e+02 7.610e+02, threshold=7.234e+02, percent-clipped=2.0 2023-06-20 23:08:12,427 INFO [train.py:996] (2/4) Epoch 5, batch 15050, loss[loss=0.2392, simple_loss=0.3288, pruned_loss=0.07485, over 21862.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.3396, pruned_loss=0.1005, over 4266457.06 frames. ], batch size: 316, lr: 6.23e-03, grad_scale: 16.0 2023-06-20 23:08:15,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=822174.0, ans=0.0 2023-06-20 23:09:15,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=822354.0, ans=0.1 2023-06-20 23:09:15,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=822354.0, ans=0.125 2023-06-20 23:09:44,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=822414.0, ans=0.95 2023-06-20 23:09:47,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=822414.0, ans=0.1 2023-06-20 23:09:59,504 INFO [train.py:996] (2/4) Epoch 5, batch 15100, loss[loss=0.3766, simple_loss=0.4201, pruned_loss=0.1666, over 21314.00 frames. ], tot_loss[loss=0.2738, simple_loss=0.3449, pruned_loss=0.1013, over 4271682.52 frames. ], batch size: 507, lr: 6.23e-03, grad_scale: 16.0 2023-06-20 23:10:21,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=822534.0, ans=0.125 2023-06-20 23:10:22,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=822534.0, ans=0.125 2023-06-20 23:10:25,330 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.469e+02 3.218e+02 4.050e+02 5.256e+02 8.500e+02, threshold=8.100e+02, percent-clipped=5.0 2023-06-20 23:10:26,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=822534.0, ans=0.2 2023-06-20 23:11:06,707 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.46 vs. limit=10.0 2023-06-20 23:11:43,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=822774.0, ans=0.0 2023-06-20 23:11:43,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=822774.0, ans=0.0 2023-06-20 23:11:44,948 INFO [train.py:996] (2/4) Epoch 5, batch 15150, loss[loss=0.2433, simple_loss=0.2934, pruned_loss=0.09663, over 21186.00 frames. ], tot_loss[loss=0.2718, simple_loss=0.3416, pruned_loss=0.1011, over 4266513.05 frames. ], batch size: 548, lr: 6.23e-03, grad_scale: 16.0 2023-06-20 23:11:45,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=822774.0, ans=0.2 2023-06-20 23:11:58,753 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-20 23:12:05,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=822834.0, ans=0.125 2023-06-20 23:12:23,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=822894.0, ans=0.0 2023-06-20 23:12:49,954 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.10 vs. limit=15.0 2023-06-20 23:12:51,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=822954.0, ans=0.125 2023-06-20 23:13:24,130 INFO [train.py:996] (2/4) Epoch 5, batch 15200, loss[loss=0.239, simple_loss=0.3208, pruned_loss=0.07863, over 21267.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3319, pruned_loss=0.09653, over 4263160.18 frames. ], batch size: 551, lr: 6.23e-03, grad_scale: 32.0 2023-06-20 23:13:31,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=823074.0, ans=22.5 2023-06-20 23:13:34,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=823074.0, ans=0.0 2023-06-20 23:13:41,449 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2023-06-20 23:13:45,055 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.966e+02 2.736e+02 3.206e+02 4.003e+02 7.087e+02, threshold=6.412e+02, percent-clipped=0.0 2023-06-20 23:13:54,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=823134.0, ans=0.2 2023-06-20 23:14:31,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=823254.0, ans=0.125 2023-06-20 23:15:01,163 INFO [train.py:996] (2/4) Epoch 5, batch 15250, loss[loss=0.2513, simple_loss=0.3076, pruned_loss=0.09754, over 21676.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3243, pruned_loss=0.09406, over 4264732.98 frames. ], batch size: 282, lr: 6.23e-03, grad_scale: 32.0 2023-06-20 23:16:05,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=823554.0, ans=0.0 2023-06-20 23:16:42,974 INFO [train.py:996] (2/4) Epoch 5, batch 15300, loss[loss=0.2783, simple_loss=0.3464, pruned_loss=0.1051, over 21754.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3283, pruned_loss=0.09749, over 4263012.67 frames. ], batch size: 332, lr: 6.23e-03, grad_scale: 32.0 2023-06-20 23:17:04,271 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.998e+02 3.594e+02 4.256e+02 7.669e+02, threshold=7.187e+02, percent-clipped=3.0 2023-06-20 23:17:14,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=823734.0, ans=0.125 2023-06-20 23:18:23,683 INFO [train.py:996] (2/4) Epoch 5, batch 15350, loss[loss=0.2658, simple_loss=0.3488, pruned_loss=0.09139, over 21819.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3352, pruned_loss=0.09965, over 4263001.08 frames. ], batch size: 118, lr: 6.23e-03, grad_scale: 16.0 2023-06-20 23:18:38,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=824034.0, ans=0.125 2023-06-20 23:18:40,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=824034.0, ans=0.1 2023-06-20 23:18:45,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=824034.0, ans=0.2 2023-06-20 23:20:03,477 INFO [train.py:996] (2/4) Epoch 5, batch 15400, loss[loss=0.2454, simple_loss=0.3242, pruned_loss=0.08325, over 16318.00 frames. ], tot_loss[loss=0.2657, simple_loss=0.3371, pruned_loss=0.09716, over 4259525.07 frames. ], batch size: 63, lr: 6.23e-03, grad_scale: 16.0 2023-06-20 23:20:25,811 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.035e+02 2.899e+02 3.241e+02 4.047e+02 6.361e+02, threshold=6.483e+02, percent-clipped=0.0 2023-06-20 23:20:55,658 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.23 vs. limit=15.0 2023-06-20 23:20:56,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=824394.0, ans=0.0 2023-06-20 23:21:39,529 INFO [train.py:996] (2/4) Epoch 5, batch 15450, loss[loss=0.3197, simple_loss=0.3967, pruned_loss=0.1213, over 20005.00 frames. ], tot_loss[loss=0.2636, simple_loss=0.3346, pruned_loss=0.09632, over 4253810.80 frames. ], batch size: 703, lr: 6.22e-03, grad_scale: 16.0 2023-06-20 23:21:57,877 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=22.5 2023-06-20 23:22:35,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=824694.0, ans=0.04949747468305833 2023-06-20 23:22:41,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=824754.0, ans=0.2 2023-06-20 23:22:52,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=824754.0, ans=0.2 2023-06-20 23:23:17,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=824814.0, ans=0.5 2023-06-20 23:23:20,727 INFO [train.py:996] (2/4) Epoch 5, batch 15500, loss[loss=0.2337, simple_loss=0.2757, pruned_loss=0.09585, over 20838.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3358, pruned_loss=0.09634, over 4248259.22 frames. ], batch size: 611, lr: 6.22e-03, grad_scale: 16.0 2023-06-20 23:23:35,351 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.20 vs. limit=22.5 2023-06-20 23:23:54,396 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 2.818e+02 3.290e+02 3.883e+02 6.635e+02, threshold=6.579e+02, percent-clipped=1.0 2023-06-20 23:25:01,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=825174.0, ans=0.125 2023-06-20 23:25:02,125 INFO [train.py:996] (2/4) Epoch 5, batch 15550, loss[loss=0.2045, simple_loss=0.2584, pruned_loss=0.07531, over 17191.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3322, pruned_loss=0.09398, over 4250384.69 frames. ], batch size: 68, lr: 6.22e-03, grad_scale: 16.0 2023-06-20 23:25:49,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=825294.0, ans=0.1 2023-06-20 23:26:19,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=825354.0, ans=0.0 2023-06-20 23:26:22,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=825354.0, ans=0.2 2023-06-20 23:26:40,362 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=15.0 2023-06-20 23:26:42,193 INFO [train.py:996] (2/4) Epoch 5, batch 15600, loss[loss=0.2705, simple_loss=0.3364, pruned_loss=0.1023, over 21763.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.324, pruned_loss=0.09249, over 4251109.58 frames. ], batch size: 371, lr: 6.22e-03, grad_scale: 32.0 2023-06-20 23:27:03,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=825534.0, ans=0.1 2023-06-20 23:27:09,690 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.848e+02 3.319e+02 3.887e+02 5.745e+02, threshold=6.638e+02, percent-clipped=0.0 2023-06-20 23:27:16,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=825534.0, ans=0.125 2023-06-20 23:27:23,697 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.33 vs. limit=10.0 2023-06-20 23:27:35,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=825594.0, ans=0.0 2023-06-20 23:28:01,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=825714.0, ans=0.1 2023-06-20 23:28:17,191 INFO [train.py:996] (2/4) Epoch 5, batch 15650, loss[loss=0.2658, simple_loss=0.3312, pruned_loss=0.1002, over 21316.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3229, pruned_loss=0.09168, over 4255900.04 frames. ], batch size: 471, lr: 6.22e-03, grad_scale: 32.0 2023-06-20 23:28:55,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=825834.0, ans=0.05 2023-06-20 23:29:13,745 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=22.5 2023-06-20 23:30:01,354 INFO [train.py:996] (2/4) Epoch 5, batch 15700, loss[loss=0.2325, simple_loss=0.3186, pruned_loss=0.07322, over 21686.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3183, pruned_loss=0.09017, over 4257813.98 frames. ], batch size: 332, lr: 6.22e-03, grad_scale: 32.0 2023-06-20 23:30:18,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=826074.0, ans=0.0 2023-06-20 23:30:29,896 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.891e+02 2.764e+02 3.253e+02 4.322e+02 6.346e+02, threshold=6.507e+02, percent-clipped=0.0 2023-06-20 23:31:02,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=826194.0, ans=0.125 2023-06-20 23:31:16,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=826254.0, ans=0.125 2023-06-20 23:31:19,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=826314.0, ans=0.02 2023-06-20 23:31:41,403 INFO [train.py:996] (2/4) Epoch 5, batch 15750, loss[loss=0.2127, simple_loss=0.2761, pruned_loss=0.07463, over 15244.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3154, pruned_loss=0.09025, over 4248426.77 frames. ], batch size: 61, lr: 6.22e-03, grad_scale: 32.0 2023-06-20 23:32:01,377 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.10 vs. limit=15.0 2023-06-20 23:32:20,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=826494.0, ans=0.05 2023-06-20 23:32:32,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=826494.0, ans=0.0 2023-06-20 23:33:13,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=826614.0, ans=0.125 2023-06-20 23:33:16,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=826614.0, ans=0.0 2023-06-20 23:33:21,188 INFO [train.py:996] (2/4) Epoch 5, batch 15800, loss[loss=0.2885, simple_loss=0.3291, pruned_loss=0.1239, over 21325.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3119, pruned_loss=0.09045, over 4258753.44 frames. ], batch size: 473, lr: 6.22e-03, grad_scale: 16.0 2023-06-20 23:33:28,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=826674.0, ans=0.125 2023-06-20 23:33:50,790 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.927e+02 3.607e+02 4.746e+02 7.598e+02, threshold=7.214e+02, percent-clipped=2.0 2023-06-20 23:34:38,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=826854.0, ans=0.125 2023-06-20 23:35:01,916 INFO [train.py:996] (2/4) Epoch 5, batch 15850, loss[loss=0.2678, simple_loss=0.3306, pruned_loss=0.1025, over 21474.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3166, pruned_loss=0.09363, over 4261482.55 frames. ], batch size: 389, lr: 6.22e-03, grad_scale: 16.0 2023-06-20 23:35:49,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=827094.0, ans=0.1 2023-06-20 23:36:13,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=827154.0, ans=0.125 2023-06-20 23:36:27,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=827214.0, ans=0.0 2023-06-20 23:36:41,687 INFO [train.py:996] (2/4) Epoch 5, batch 15900, loss[loss=0.2724, simple_loss=0.3416, pruned_loss=0.1016, over 21596.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.315, pruned_loss=0.09337, over 4241947.69 frames. ], batch size: 441, lr: 6.21e-03, grad_scale: 16.0 2023-06-20 23:37:11,686 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 2.862e+02 3.189e+02 4.240e+02 8.969e+02, threshold=6.379e+02, percent-clipped=1.0 2023-06-20 23:37:33,992 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.29 vs. limit=8.0 2023-06-20 23:38:10,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=827514.0, ans=0.5 2023-06-20 23:38:18,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=827514.0, ans=0.125 2023-06-20 23:38:19,263 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=22.5 2023-06-20 23:38:20,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=827514.0, ans=0.07 2023-06-20 23:38:22,891 INFO [train.py:996] (2/4) Epoch 5, batch 15950, loss[loss=0.213, simple_loss=0.2905, pruned_loss=0.06772, over 20693.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3136, pruned_loss=0.0904, over 4250276.38 frames. ], batch size: 608, lr: 6.21e-03, grad_scale: 16.0 2023-06-20 23:38:25,759 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.76 vs. limit=22.5 2023-06-20 23:39:05,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=827694.0, ans=0.0 2023-06-20 23:39:18,688 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.24 vs. limit=22.5 2023-06-20 23:39:23,151 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=15.0 2023-06-20 23:39:38,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=827754.0, ans=0.07 2023-06-20 23:39:56,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=827874.0, ans=0.0 2023-06-20 23:39:57,226 INFO [train.py:996] (2/4) Epoch 5, batch 16000, loss[loss=0.2565, simple_loss=0.3399, pruned_loss=0.08659, over 21650.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3146, pruned_loss=0.08757, over 4262713.93 frames. ], batch size: 263, lr: 6.21e-03, grad_scale: 32.0 2023-06-20 23:40:06,080 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.00 vs. limit=15.0 2023-06-20 23:40:17,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=827874.0, ans=0.125 2023-06-20 23:40:30,944 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.945e+02 2.522e+02 3.012e+02 3.700e+02 7.317e+02, threshold=6.025e+02, percent-clipped=2.0 2023-06-20 23:40:32,428 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=9.19 vs. limit=22.5 2023-06-20 23:40:45,953 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=15.0 2023-06-20 23:41:20,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=828054.0, ans=0.0 2023-06-20 23:41:43,645 INFO [train.py:996] (2/4) Epoch 5, batch 16050, loss[loss=0.195, simple_loss=0.2503, pruned_loss=0.06981, over 16942.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3168, pruned_loss=0.0857, over 4262827.35 frames. ], batch size: 63, lr: 6.21e-03, grad_scale: 32.0 2023-06-20 23:41:45,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=828174.0, ans=0.125 2023-06-20 23:42:04,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=828234.0, ans=0.95 2023-06-20 23:43:23,974 INFO [train.py:996] (2/4) Epoch 5, batch 16100, loss[loss=0.2639, simple_loss=0.3189, pruned_loss=0.1044, over 21800.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3208, pruned_loss=0.08684, over 4267452.90 frames. ], batch size: 247, lr: 6.21e-03, grad_scale: 32.0 2023-06-20 23:43:29,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=828474.0, ans=0.1 2023-06-20 23:43:32,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=828474.0, ans=0.125 2023-06-20 23:43:52,555 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.033e+02 2.759e+02 3.248e+02 4.030e+02 6.532e+02, threshold=6.496e+02, percent-clipped=1.0 2023-06-20 23:44:57,679 INFO [train.py:996] (2/4) Epoch 5, batch 16150, loss[loss=0.2705, simple_loss=0.3302, pruned_loss=0.1054, over 21951.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3202, pruned_loss=0.08968, over 4280433.64 frames. ], batch size: 316, lr: 6.21e-03, grad_scale: 32.0 2023-06-20 23:45:17,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=828774.0, ans=0.125 2023-06-20 23:46:04,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=828954.0, ans=0.2 2023-06-20 23:46:40,207 INFO [train.py:996] (2/4) Epoch 5, batch 16200, loss[loss=0.2759, simple_loss=0.3505, pruned_loss=0.1007, over 21347.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.325, pruned_loss=0.09188, over 4283508.30 frames. ], batch size: 548, lr: 6.21e-03, grad_scale: 32.0 2023-06-20 23:47:09,190 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.322e+02 2.854e+02 3.310e+02 3.979e+02 8.024e+02, threshold=6.619e+02, percent-clipped=1.0 2023-06-20 23:48:19,437 INFO [train.py:996] (2/4) Epoch 5, batch 16250, loss[loss=0.2506, simple_loss=0.3142, pruned_loss=0.09354, over 21356.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3239, pruned_loss=0.0911, over 4277812.02 frames. ], batch size: 471, lr: 6.21e-03, grad_scale: 16.0 2023-06-20 23:48:36,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=829374.0, ans=0.125 2023-06-20 23:48:40,286 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.88 vs. limit=12.0 2023-06-20 23:48:50,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=829434.0, ans=0.0 2023-06-20 23:48:57,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=829494.0, ans=0.125 2023-06-20 23:49:44,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=829614.0, ans=0.1 2023-06-20 23:49:58,567 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=15.0 2023-06-20 23:50:01,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=829614.0, ans=0.125 2023-06-20 23:50:03,659 INFO [train.py:996] (2/4) Epoch 5, batch 16300, loss[loss=0.2146, simple_loss=0.3193, pruned_loss=0.05493, over 21176.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3198, pruned_loss=0.08682, over 4276230.55 frames. ], batch size: 548, lr: 6.21e-03, grad_scale: 16.0 2023-06-20 23:50:07,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=829674.0, ans=0.125 2023-06-20 23:50:09,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=829674.0, ans=0.1 2023-06-20 23:50:29,425 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.495e+02 2.799e+02 3.333e+02 5.849e+02, threshold=5.597e+02, percent-clipped=0.0 2023-06-20 23:51:44,334 INFO [train.py:996] (2/4) Epoch 5, batch 16350, loss[loss=0.2642, simple_loss=0.3294, pruned_loss=0.09949, over 21347.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3216, pruned_loss=0.08907, over 4281037.39 frames. ], batch size: 176, lr: 6.20e-03, grad_scale: 16.0 2023-06-20 23:51:55,190 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=22.5 2023-06-20 23:52:31,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=830094.0, ans=0.125 2023-06-20 23:52:36,721 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.18 vs. limit=12.0 2023-06-20 23:52:45,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=830154.0, ans=0.0 2023-06-20 23:52:57,061 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=17.50 vs. limit=15.0 2023-06-20 23:53:15,145 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2023-06-20 23:53:23,363 INFO [train.py:996] (2/4) Epoch 5, batch 16400, loss[loss=0.2756, simple_loss=0.3366, pruned_loss=0.1073, over 21879.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3276, pruned_loss=0.09215, over 4282769.28 frames. ], batch size: 118, lr: 6.20e-03, grad_scale: 32.0 2023-06-20 23:53:52,854 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.252e+02 2.889e+02 3.302e+02 3.961e+02 7.962e+02, threshold=6.603e+02, percent-clipped=4.0 2023-06-20 23:54:23,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=830454.0, ans=0.0 2023-06-20 23:54:31,490 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-06-20 23:54:34,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=830454.0, ans=0.125 2023-06-20 23:54:43,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=830514.0, ans=0.2 2023-06-20 23:55:02,591 INFO [train.py:996] (2/4) Epoch 5, batch 16450, loss[loss=0.2722, simple_loss=0.3311, pruned_loss=0.1066, over 21664.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3262, pruned_loss=0.0925, over 4292297.31 frames. ], batch size: 263, lr: 6.20e-03, grad_scale: 16.0 2023-06-20 23:56:37,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=830814.0, ans=0.0 2023-06-20 23:56:41,428 INFO [train.py:996] (2/4) Epoch 5, batch 16500, loss[loss=0.2588, simple_loss=0.3233, pruned_loss=0.09712, over 21852.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3255, pruned_loss=0.09296, over 4289681.63 frames. ], batch size: 316, lr: 6.20e-03, grad_scale: 16.0 2023-06-20 23:56:45,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=830874.0, ans=10.0 2023-06-20 23:57:06,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=830934.0, ans=0.0 2023-06-20 23:57:19,060 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.303e+02 3.018e+02 3.661e+02 4.243e+02 1.006e+03, threshold=7.323e+02, percent-clipped=9.0 2023-06-20 23:58:20,786 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=12.0 2023-06-20 23:58:23,189 INFO [train.py:996] (2/4) Epoch 5, batch 16550, loss[loss=0.2219, simple_loss=0.297, pruned_loss=0.07336, over 21512.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3232, pruned_loss=0.09, over 4285891.10 frames. ], batch size: 194, lr: 6.20e-03, grad_scale: 16.0 2023-06-20 23:58:34,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=831174.0, ans=0.0 2023-06-20 23:59:37,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=831354.0, ans=0.05 2023-06-20 23:59:39,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=831354.0, ans=0.1 2023-06-21 00:00:14,158 INFO [train.py:996] (2/4) Epoch 5, batch 16600, loss[loss=0.2819, simple_loss=0.3592, pruned_loss=0.1023, over 21294.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3309, pruned_loss=0.09304, over 4281403.03 frames. ], batch size: 548, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 00:00:19,977 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.63 vs. limit=6.0 2023-06-21 00:00:42,891 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 3.262e+02 3.858e+02 4.542e+02 8.769e+02, threshold=7.716e+02, percent-clipped=2.0 2023-06-21 00:01:57,527 INFO [train.py:996] (2/4) Epoch 5, batch 16650, loss[loss=0.2732, simple_loss=0.3434, pruned_loss=0.1014, over 21234.00 frames. ], tot_loss[loss=0.2645, simple_loss=0.3388, pruned_loss=0.09514, over 4276625.59 frames. ], batch size: 143, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 00:01:59,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=831774.0, ans=0.125 2023-06-21 00:02:03,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=831774.0, ans=0.125 2023-06-21 00:02:05,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=831774.0, ans=0.0 2023-06-21 00:02:11,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=831774.0, ans=0.0 2023-06-21 00:02:41,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=831894.0, ans=0.2 2023-06-21 00:02:42,475 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.96 vs. limit=10.0 2023-06-21 00:03:33,914 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-06-21 00:03:40,747 INFO [train.py:996] (2/4) Epoch 5, batch 16700, loss[loss=0.2031, simple_loss=0.2525, pruned_loss=0.07681, over 21059.00 frames. ], tot_loss[loss=0.2655, simple_loss=0.339, pruned_loss=0.09601, over 4269240.72 frames. ], batch size: 143, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 00:03:53,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=832074.0, ans=0.0 2023-06-21 00:04:18,918 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.321e+02 2.932e+02 3.507e+02 4.315e+02 8.242e+02, threshold=7.013e+02, percent-clipped=1.0 2023-06-21 00:04:26,811 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=15.0 2023-06-21 00:04:53,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=832254.0, ans=0.2 2023-06-21 00:04:56,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=832254.0, ans=0.0 2023-06-21 00:05:22,811 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.95 vs. limit=10.0 2023-06-21 00:05:30,390 INFO [train.py:996] (2/4) Epoch 5, batch 16750, loss[loss=0.2587, simple_loss=0.3261, pruned_loss=0.09565, over 21328.00 frames. ], tot_loss[loss=0.2696, simple_loss=0.3429, pruned_loss=0.09813, over 4268823.07 frames. ], batch size: 176, lr: 6.20e-03, grad_scale: 16.0 2023-06-21 00:05:44,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=832374.0, ans=0.125 2023-06-21 00:05:49,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=832374.0, ans=0.125 2023-06-21 00:07:16,842 INFO [train.py:996] (2/4) Epoch 5, batch 16800, loss[loss=0.2513, simple_loss=0.3233, pruned_loss=0.08965, over 21841.00 frames. ], tot_loss[loss=0.2711, simple_loss=0.3457, pruned_loss=0.09825, over 4270891.62 frames. ], batch size: 298, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 00:07:22,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=832674.0, ans=0.125 2023-06-21 00:07:25,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=832674.0, ans=0.0 2023-06-21 00:07:37,192 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.30 vs. limit=15.0 2023-06-21 00:07:48,842 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.643e+02 3.479e+02 3.935e+02 4.857e+02 8.503e+02, threshold=7.870e+02, percent-clipped=2.0 2023-06-21 00:07:57,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=832794.0, ans=0.125 2023-06-21 00:08:03,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=832794.0, ans=0.125 2023-06-21 00:08:25,183 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-21 00:08:48,850 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=22.5 2023-06-21 00:08:55,649 INFO [train.py:996] (2/4) Epoch 5, batch 16850, loss[loss=0.2578, simple_loss=0.3178, pruned_loss=0.09889, over 21874.00 frames. ], tot_loss[loss=0.2688, simple_loss=0.341, pruned_loss=0.09833, over 4280557.65 frames. ], batch size: 351, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 00:09:04,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=832974.0, ans=0.125 2023-06-21 00:10:01,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=833154.0, ans=0.0 2023-06-21 00:10:01,395 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.290e-02 2023-06-21 00:10:35,438 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.16 vs. limit=22.5 2023-06-21 00:10:35,990 INFO [train.py:996] (2/4) Epoch 5, batch 16900, loss[loss=0.2908, simple_loss=0.4034, pruned_loss=0.08911, over 20738.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3364, pruned_loss=0.09763, over 4281984.73 frames. ], batch size: 607, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 00:10:49,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=833274.0, ans=0.125 2023-06-21 00:11:05,336 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=12.0 2023-06-21 00:11:07,330 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.916e+02 2.951e+02 3.440e+02 4.010e+02 6.855e+02, threshold=6.879e+02, percent-clipped=0.0 2023-06-21 00:12:09,932 INFO [train.py:996] (2/4) Epoch 5, batch 16950, loss[loss=0.2594, simple_loss=0.3175, pruned_loss=0.1006, over 21393.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3315, pruned_loss=0.09672, over 4283659.09 frames. ], batch size: 159, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 00:12:25,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=833574.0, ans=0.125 2023-06-21 00:12:38,163 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.41 vs. limit=22.5 2023-06-21 00:12:53,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=833694.0, ans=0.125 2023-06-21 00:12:56,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=833694.0, ans=0.125 2023-06-21 00:12:56,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=833694.0, ans=0.125 2023-06-21 00:13:03,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=833694.0, ans=0.125 2023-06-21 00:13:38,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=833814.0, ans=0.125 2023-06-21 00:13:38,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=833814.0, ans=0.07 2023-06-21 00:13:49,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=833814.0, ans=0.125 2023-06-21 00:13:59,437 INFO [train.py:996] (2/4) Epoch 5, batch 17000, loss[loss=0.2462, simple_loss=0.3116, pruned_loss=0.09042, over 21860.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.329, pruned_loss=0.0968, over 4288328.85 frames. ], batch size: 298, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 00:14:09,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=833874.0, ans=0.125 2023-06-21 00:14:27,675 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.371e+02 2.869e+02 3.423e+02 4.013e+02 9.065e+02, threshold=6.846e+02, percent-clipped=1.0 2023-06-21 00:14:55,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=834054.0, ans=6.0 2023-06-21 00:15:36,741 INFO [train.py:996] (2/4) Epoch 5, batch 17050, loss[loss=0.3209, simple_loss=0.3916, pruned_loss=0.1251, over 21390.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3371, pruned_loss=0.09953, over 4296259.52 frames. ], batch size: 548, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 00:15:45,861 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=22.5 2023-06-21 00:15:52,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=834234.0, ans=0.125 2023-06-21 00:16:15,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=834294.0, ans=0.0 2023-06-21 00:16:41,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=834354.0, ans=0.1 2023-06-21 00:16:50,134 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-21 00:16:50,332 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=12.0 2023-06-21 00:17:14,715 INFO [train.py:996] (2/4) Epoch 5, batch 17100, loss[loss=0.247, simple_loss=0.3141, pruned_loss=0.09001, over 21773.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3357, pruned_loss=0.1001, over 4298290.06 frames. ], batch size: 389, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 00:17:24,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=834474.0, ans=0.125 2023-06-21 00:17:43,339 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.304e+02 3.091e+02 3.634e+02 4.796e+02 1.009e+03, threshold=7.268e+02, percent-clipped=8.0 2023-06-21 00:18:24,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=834654.0, ans=0.125 2023-06-21 00:18:53,479 INFO [train.py:996] (2/4) Epoch 5, batch 17150, loss[loss=0.2062, simple_loss=0.2779, pruned_loss=0.06723, over 21326.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3298, pruned_loss=0.09722, over 4288432.44 frames. ], batch size: 143, lr: 6.19e-03, grad_scale: 16.0 2023-06-21 00:19:08,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=834834.0, ans=0.125 2023-06-21 00:19:22,306 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=18.42 vs. limit=22.5 2023-06-21 00:19:44,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=834894.0, ans=0.1 2023-06-21 00:19:51,089 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.89 vs. limit=6.0 2023-06-21 00:20:00,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=834954.0, ans=0.125 2023-06-21 00:20:16,472 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=12.0 2023-06-21 00:20:29,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=835014.0, ans=0.125 2023-06-21 00:20:33,460 INFO [train.py:996] (2/4) Epoch 5, batch 17200, loss[loss=0.2963, simple_loss=0.3942, pruned_loss=0.09922, over 19741.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3307, pruned_loss=0.09791, over 4293043.79 frames. ], batch size: 703, lr: 6.19e-03, grad_scale: 32.0 2023-06-21 00:21:12,568 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.241e+02 2.764e+02 3.023e+02 3.387e+02 5.035e+02, threshold=6.046e+02, percent-clipped=0.0 2023-06-21 00:21:18,721 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=10.12 vs. limit=15.0 2023-06-21 00:21:37,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=835254.0, ans=0.125 2023-06-21 00:21:56,054 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.49 vs. limit=10.0 2023-06-21 00:22:01,242 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-21 00:22:08,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=835314.0, ans=0.125 2023-06-21 00:22:19,280 INFO [train.py:996] (2/4) Epoch 5, batch 17250, loss[loss=0.3258, simple_loss=0.3861, pruned_loss=0.1328, over 21316.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.334, pruned_loss=0.09909, over 4287341.71 frames. ], batch size: 507, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 00:22:35,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=835434.0, ans=0.05 2023-06-21 00:22:49,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=835434.0, ans=0.0 2023-06-21 00:23:31,898 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:23:50,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=835614.0, ans=0.0 2023-06-21 00:23:54,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=835614.0, ans=0.125 2023-06-21 00:24:02,216 INFO [train.py:996] (2/4) Epoch 5, batch 17300, loss[loss=0.2641, simple_loss=0.3441, pruned_loss=0.09203, over 21522.00 frames. ], tot_loss[loss=0.2729, simple_loss=0.341, pruned_loss=0.1024, over 4284331.79 frames. ], batch size: 112, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 00:24:06,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=835674.0, ans=0.2 2023-06-21 00:24:41,698 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.738e+02 3.630e+02 4.657e+02 6.212e+02 1.066e+03, threshold=9.314e+02, percent-clipped=26.0 2023-06-21 00:25:48,518 INFO [train.py:996] (2/4) Epoch 5, batch 17350, loss[loss=0.2443, simple_loss=0.35, pruned_loss=0.06928, over 21282.00 frames. ], tot_loss[loss=0.2712, simple_loss=0.3406, pruned_loss=0.1009, over 4285371.97 frames. ], batch size: 548, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 00:26:27,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=836034.0, ans=0.1 2023-06-21 00:26:32,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=836094.0, ans=0.1 2023-06-21 00:27:29,134 INFO [train.py:996] (2/4) Epoch 5, batch 17400, loss[loss=0.2339, simple_loss=0.3061, pruned_loss=0.08083, over 21710.00 frames. ], tot_loss[loss=0.2651, simple_loss=0.336, pruned_loss=0.09711, over 4275675.39 frames. ], batch size: 247, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 00:27:29,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=836274.0, ans=0.2 2023-06-21 00:27:42,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=836274.0, ans=0.125 2023-06-21 00:27:57,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=836334.0, ans=0.0 2023-06-21 00:28:09,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=836334.0, ans=0.0 2023-06-21 00:28:10,152 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 2.783e+02 3.227e+02 3.615e+02 5.491e+02, threshold=6.454e+02, percent-clipped=0.0 2023-06-21 00:28:14,595 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=15.0 2023-06-21 00:28:38,793 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.54 vs. limit=22.5 2023-06-21 00:28:55,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=836514.0, ans=0.1 2023-06-21 00:29:11,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=836514.0, ans=0.0 2023-06-21 00:29:16,092 INFO [train.py:996] (2/4) Epoch 5, batch 17450, loss[loss=0.1935, simple_loss=0.2824, pruned_loss=0.05235, over 21369.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3335, pruned_loss=0.09464, over 4272650.33 frames. ], batch size: 211, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 00:29:28,397 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-21 00:29:32,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=836574.0, ans=0.0 2023-06-21 00:29:35,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=836634.0, ans=0.125 2023-06-21 00:30:13,015 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:30:30,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=836754.0, ans=0.125 2023-06-21 00:30:38,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=836814.0, ans=0.015 2023-06-21 00:30:42,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=836814.0, ans=0.07 2023-06-21 00:30:45,175 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.53 vs. limit=15.0 2023-06-21 00:31:00,469 INFO [train.py:996] (2/4) Epoch 5, batch 17500, loss[loss=0.2697, simple_loss=0.338, pruned_loss=0.1008, over 21871.00 frames. ], tot_loss[loss=0.257, simple_loss=0.33, pruned_loss=0.09203, over 4275443.21 frames. ], batch size: 124, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 00:31:13,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=836874.0, ans=0.125 2023-06-21 00:31:30,406 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:31:33,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=836934.0, ans=0.1 2023-06-21 00:31:34,507 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.915e+02 2.759e+02 3.126e+02 4.015e+02 6.726e+02, threshold=6.252e+02, percent-clipped=1.0 2023-06-21 00:31:45,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=836994.0, ans=0.2 2023-06-21 00:32:32,779 INFO [train.py:996] (2/4) Epoch 5, batch 17550, loss[loss=0.2188, simple_loss=0.3166, pruned_loss=0.06053, over 21371.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3278, pruned_loss=0.09049, over 4278111.35 frames. ], batch size: 131, lr: 6.18e-03, grad_scale: 16.0 2023-06-21 00:33:08,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=837234.0, ans=0.125 2023-06-21 00:33:09,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=837234.0, ans=0.0 2023-06-21 00:33:13,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=837294.0, ans=0.2 2023-06-21 00:33:16,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=837294.0, ans=0.125 2023-06-21 00:33:45,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=837354.0, ans=0.125 2023-06-21 00:33:58,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=837414.0, ans=0.0 2023-06-21 00:33:59,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=837414.0, ans=0.125 2023-06-21 00:34:17,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=837474.0, ans=0.5 2023-06-21 00:34:18,785 INFO [train.py:996] (2/4) Epoch 5, batch 17600, loss[loss=0.248, simple_loss=0.3247, pruned_loss=0.08562, over 21733.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3293, pruned_loss=0.09025, over 4276693.52 frames. ], batch size: 298, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 00:34:30,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=837474.0, ans=0.95 2023-06-21 00:34:48,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=837534.0, ans=0.125 2023-06-21 00:34:53,984 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.052e+02 2.862e+02 3.527e+02 4.406e+02 6.176e+02, threshold=7.053e+02, percent-clipped=0.0 2023-06-21 00:35:30,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=837654.0, ans=0.1 2023-06-21 00:35:48,144 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=15.0 2023-06-21 00:35:48,274 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.25 vs. limit=10.0 2023-06-21 00:35:59,224 INFO [train.py:996] (2/4) Epoch 5, batch 17650, loss[loss=0.2177, simple_loss=0.299, pruned_loss=0.06823, over 21706.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.328, pruned_loss=0.09055, over 4268393.38 frames. ], batch size: 391, lr: 6.18e-03, grad_scale: 32.0 2023-06-21 00:36:02,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=837774.0, ans=0.2 2023-06-21 00:36:07,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=837774.0, ans=0.1 2023-06-21 00:37:05,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=837954.0, ans=0.04949747468305833 2023-06-21 00:37:34,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=838014.0, ans=0.125 2023-06-21 00:37:35,415 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.90 vs. limit=5.0 2023-06-21 00:37:42,016 INFO [train.py:996] (2/4) Epoch 5, batch 17700, loss[loss=0.2628, simple_loss=0.3417, pruned_loss=0.09194, over 20776.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3224, pruned_loss=0.08794, over 4261530.94 frames. ], batch size: 607, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 00:38:17,262 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.231e+02 2.950e+02 3.482e+02 4.668e+02 9.100e+02, threshold=6.963e+02, percent-clipped=4.0 2023-06-21 00:38:59,892 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.16 vs. limit=6.0 2023-06-21 00:39:00,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=838314.0, ans=0.0 2023-06-21 00:39:10,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=838314.0, ans=0.1 2023-06-21 00:39:21,483 INFO [train.py:996] (2/4) Epoch 5, batch 17750, loss[loss=0.2251, simple_loss=0.3192, pruned_loss=0.06553, over 20733.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3287, pruned_loss=0.09131, over 4259384.96 frames. ], batch size: 607, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 00:39:23,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=838374.0, ans=0.125 2023-06-21 00:39:44,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=838434.0, ans=0.125 2023-06-21 00:39:46,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=838434.0, ans=0.0 2023-06-21 00:40:40,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=838554.0, ans=0.125 2023-06-21 00:40:48,078 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=12.0 2023-06-21 00:41:07,673 INFO [train.py:996] (2/4) Epoch 5, batch 17800, loss[loss=0.2052, simple_loss=0.2735, pruned_loss=0.06844, over 21299.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3272, pruned_loss=0.09015, over 4260158.20 frames. ], batch size: 176, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 00:41:18,594 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:41:18,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=838674.0, ans=0.125 2023-06-21 00:41:28,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=838734.0, ans=0.0 2023-06-21 00:41:45,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=838734.0, ans=0.125 2023-06-21 00:41:49,222 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 2.927e+02 3.424e+02 3.955e+02 9.585e+02, threshold=6.848e+02, percent-clipped=3.0 2023-06-21 00:41:51,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=838794.0, ans=0.05 2023-06-21 00:41:57,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=838794.0, ans=0.0 2023-06-21 00:42:15,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=838854.0, ans=0.125 2023-06-21 00:42:21,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=838854.0, ans=0.125 2023-06-21 00:42:49,099 INFO [train.py:996] (2/4) Epoch 5, batch 17850, loss[loss=0.2593, simple_loss=0.3501, pruned_loss=0.08424, over 20660.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3303, pruned_loss=0.09196, over 4268921.88 frames. ], batch size: 607, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 00:43:50,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=839094.0, ans=0.0 2023-06-21 00:43:54,086 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.89 vs. limit=10.0 2023-06-21 00:44:04,059 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.46 vs. limit=12.0 2023-06-21 00:44:04,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=839154.0, ans=0.0 2023-06-21 00:44:28,021 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.56 vs. limit=22.5 2023-06-21 00:44:29,942 INFO [train.py:996] (2/4) Epoch 5, batch 17900, loss[loss=0.2621, simple_loss=0.3514, pruned_loss=0.08641, over 21868.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3353, pruned_loss=0.09401, over 4261525.84 frames. ], batch size: 316, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 00:44:46,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=839274.0, ans=0.2 2023-06-21 00:45:19,616 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.205e+02 2.900e+02 3.378e+02 3.906e+02 6.654e+02, threshold=6.756e+02, percent-clipped=0.0 2023-06-21 00:45:45,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=839454.0, ans=0.2 2023-06-21 00:45:53,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=839454.0, ans=0.125 2023-06-21 00:46:21,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=839574.0, ans=0.0 2023-06-21 00:46:22,390 INFO [train.py:996] (2/4) Epoch 5, batch 17950, loss[loss=0.1927, simple_loss=0.2902, pruned_loss=0.04758, over 21795.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3335, pruned_loss=0.09014, over 4262337.62 frames. ], batch size: 351, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 00:46:35,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=839574.0, ans=0.125 2023-06-21 00:48:00,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=839874.0, ans=0.125 2023-06-21 00:48:01,523 INFO [train.py:996] (2/4) Epoch 5, batch 18000, loss[loss=0.2052, simple_loss=0.2589, pruned_loss=0.07572, over 21317.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.326, pruned_loss=0.08914, over 4265141.86 frames. ], batch size: 551, lr: 6.17e-03, grad_scale: 32.0 2023-06-21 00:48:01,523 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 00:48:09,690 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.5372, 3.3583, 1.8034, 1.8777], device='cuda:2') 2023-06-21 00:48:17,788 INFO [train.py:1028] (2/4) Epoch 5, validation: loss=0.2664, simple_loss=0.3658, pruned_loss=0.08353, over 1796401.00 frames. 2023-06-21 00:48:17,789 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-21 00:48:46,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=839934.0, ans=0.0 2023-06-21 00:49:02,451 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.602e+02 3.109e+02 3.503e+02 6.028e+02, threshold=6.218e+02, percent-clipped=0.0 2023-06-21 00:49:04,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=839994.0, ans=0.125 2023-06-21 00:49:51,573 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.11 vs. limit=15.0 2023-06-21 00:49:58,603 INFO [train.py:996] (2/4) Epoch 5, batch 18050, loss[loss=0.2568, simple_loss=0.322, pruned_loss=0.09578, over 21729.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3218, pruned_loss=0.08926, over 4259683.60 frames. ], batch size: 333, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 00:50:33,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=840234.0, ans=10.0 2023-06-21 00:51:31,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=840414.0, ans=0.125 2023-06-21 00:51:33,715 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-21 00:51:39,040 INFO [train.py:996] (2/4) Epoch 5, batch 18100, loss[loss=0.2342, simple_loss=0.3341, pruned_loss=0.06714, over 21678.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3276, pruned_loss=0.09146, over 4269886.30 frames. ], batch size: 247, lr: 6.17e-03, grad_scale: 16.0 2023-06-21 00:52:15,657 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.78 vs. limit=6.0 2023-06-21 00:52:27,310 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.184e+02 2.902e+02 3.495e+02 4.106e+02 8.308e+02, threshold=6.990e+02, percent-clipped=1.0 2023-06-21 00:52:29,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=840594.0, ans=10.0 2023-06-21 00:53:10,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=840714.0, ans=0.0 2023-06-21 00:53:22,811 INFO [train.py:996] (2/4) Epoch 5, batch 18150, loss[loss=0.2363, simple_loss=0.2919, pruned_loss=0.09034, over 21742.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3274, pruned_loss=0.09066, over 4257615.78 frames. ], batch size: 112, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 00:53:43,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=840834.0, ans=0.0 2023-06-21 00:53:44,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=840834.0, ans=0.125 2023-06-21 00:53:46,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=840834.0, ans=0.2 2023-06-21 00:53:58,024 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-21 00:54:35,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=840954.0, ans=0.125 2023-06-21 00:54:53,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=841074.0, ans=0.0 2023-06-21 00:54:54,873 INFO [train.py:996] (2/4) Epoch 5, batch 18200, loss[loss=0.2048, simple_loss=0.2704, pruned_loss=0.0696, over 21496.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3209, pruned_loss=0.09034, over 4259629.72 frames. ], batch size: 230, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 00:55:34,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=841194.0, ans=0.125 2023-06-21 00:55:37,491 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.042e+02 2.776e+02 3.291e+02 4.569e+02 1.152e+03, threshold=6.583e+02, percent-clipped=3.0 2023-06-21 00:56:16,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=841314.0, ans=0.125 2023-06-21 00:56:32,310 INFO [train.py:996] (2/4) Epoch 5, batch 18250, loss[loss=0.1802, simple_loss=0.2526, pruned_loss=0.05392, over 21518.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3136, pruned_loss=0.08765, over 4260011.52 frames. ], batch size: 212, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 00:58:08,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=841614.0, ans=0.125 2023-06-21 00:58:11,277 INFO [train.py:996] (2/4) Epoch 5, batch 18300, loss[loss=0.2542, simple_loss=0.3484, pruned_loss=0.08002, over 21433.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3145, pruned_loss=0.0876, over 4263419.94 frames. ], batch size: 211, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 00:58:24,840 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=22.5 2023-06-21 00:58:50,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=841734.0, ans=0.1 2023-06-21 00:58:52,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=841794.0, ans=0.0 2023-06-21 00:58:54,880 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.755e+02 2.809e+02 3.144e+02 3.817e+02 6.593e+02, threshold=6.288e+02, percent-clipped=1.0 2023-06-21 00:59:38,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=841914.0, ans=0.1 2023-06-21 00:59:39,586 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 00:59:49,953 INFO [train.py:996] (2/4) Epoch 5, batch 18350, loss[loss=0.2501, simple_loss=0.3049, pruned_loss=0.09763, over 21651.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3183, pruned_loss=0.08719, over 4262735.20 frames. ], batch size: 247, lr: 6.16e-03, grad_scale: 16.0 2023-06-21 01:00:42,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=842094.0, ans=0.09899494936611666 2023-06-21 01:01:12,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=842214.0, ans=0.0 2023-06-21 01:01:24,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=842214.0, ans=0.1 2023-06-21 01:01:25,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=842214.0, ans=0.2 2023-06-21 01:01:30,182 INFO [train.py:996] (2/4) Epoch 5, batch 18400, loss[loss=0.2809, simple_loss=0.364, pruned_loss=0.09891, over 21057.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3143, pruned_loss=0.08637, over 4263911.92 frames. ], batch size: 607, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 01:01:33,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=842274.0, ans=0.125 2023-06-21 01:01:57,443 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-21 01:02:07,227 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=15.0 2023-06-21 01:02:14,612 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.055e+02 2.972e+02 3.476e+02 4.424e+02 9.442e+02, threshold=6.951e+02, percent-clipped=6.0 2023-06-21 01:02:46,200 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.13 vs. limit=15.0 2023-06-21 01:02:56,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=842514.0, ans=0.025 2023-06-21 01:03:10,004 INFO [train.py:996] (2/4) Epoch 5, batch 18450, loss[loss=0.2597, simple_loss=0.3629, pruned_loss=0.07824, over 19863.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3128, pruned_loss=0.08313, over 4265637.71 frames. ], batch size: 702, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 01:03:45,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=842634.0, ans=0.125 2023-06-21 01:04:06,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=842754.0, ans=0.125 2023-06-21 01:04:17,756 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:04:47,261 INFO [train.py:996] (2/4) Epoch 5, batch 18500, loss[loss=0.2047, simple_loss=0.272, pruned_loss=0.06874, over 21507.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.307, pruned_loss=0.0808, over 4252207.67 frames. ], batch size: 212, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 01:05:30,379 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.509e+02 2.864e+02 3.266e+02 4.867e+02, threshold=5.728e+02, percent-clipped=0.0 2023-06-21 01:05:40,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=842994.0, ans=0.125 2023-06-21 01:05:50,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=843054.0, ans=0.1 2023-06-21 01:06:26,374 INFO [train.py:996] (2/4) Epoch 5, batch 18550, loss[loss=0.2125, simple_loss=0.289, pruned_loss=0.068, over 21570.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3036, pruned_loss=0.08005, over 4252578.45 frames. ], batch size: 230, lr: 6.16e-03, grad_scale: 32.0 2023-06-21 01:06:28,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=843174.0, ans=0.1 2023-06-21 01:06:53,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=843234.0, ans=0.015 2023-06-21 01:08:06,352 INFO [train.py:996] (2/4) Epoch 5, batch 18600, loss[loss=0.181, simple_loss=0.2492, pruned_loss=0.05637, over 21853.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3013, pruned_loss=0.08066, over 4238648.53 frames. ], batch size: 107, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 01:08:36,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=843534.0, ans=0.0 2023-06-21 01:08:49,510 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.225e+02 2.765e+02 3.271e+02 3.896e+02 6.265e+02, threshold=6.542e+02, percent-clipped=2.0 2023-06-21 01:08:51,728 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:09:22,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=843654.0, ans=0.125 2023-06-21 01:09:25,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=843714.0, ans=0.2 2023-06-21 01:09:39,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=843774.0, ans=0.125 2023-06-21 01:09:40,832 INFO [train.py:996] (2/4) Epoch 5, batch 18650, loss[loss=0.2521, simple_loss=0.3074, pruned_loss=0.09838, over 21287.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3011, pruned_loss=0.08145, over 4235856.38 frames. ], batch size: 131, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 01:10:22,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=843894.0, ans=0.125 2023-06-21 01:10:49,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=843954.0, ans=0.2 2023-06-21 01:11:13,524 INFO [train.py:996] (2/4) Epoch 5, batch 18700, loss[loss=0.2299, simple_loss=0.2865, pruned_loss=0.08661, over 21475.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.2988, pruned_loss=0.08276, over 4243325.69 frames. ], batch size: 212, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 01:11:20,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=844074.0, ans=0.1 2023-06-21 01:11:25,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=844074.0, ans=0.1 2023-06-21 01:11:26,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=844074.0, ans=0.1 2023-06-21 01:11:56,982 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.719e+02 3.161e+02 4.088e+02 6.146e+02, threshold=6.321e+02, percent-clipped=0.0 2023-06-21 01:12:52,653 INFO [train.py:996] (2/4) Epoch 5, batch 18750, loss[loss=0.2719, simple_loss=0.3425, pruned_loss=0.1007, over 21901.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3025, pruned_loss=0.086, over 4254115.40 frames. ], batch size: 372, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 01:13:05,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=844374.0, ans=0.1 2023-06-21 01:13:49,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=844554.0, ans=0.125 2023-06-21 01:14:23,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=844614.0, ans=0.125 2023-06-21 01:14:32,831 INFO [train.py:996] (2/4) Epoch 5, batch 18800, loss[loss=0.3261, simple_loss=0.3968, pruned_loss=0.1277, over 20719.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3103, pruned_loss=0.08838, over 4253720.85 frames. ], batch size: 607, lr: 6.15e-03, grad_scale: 32.0 2023-06-21 01:14:45,132 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=15.0 2023-06-21 01:14:57,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=844734.0, ans=0.125 2023-06-21 01:15:00,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=844734.0, ans=0.125 2023-06-21 01:15:05,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=844734.0, ans=0.125 2023-06-21 01:15:08,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=844794.0, ans=0.04949747468305833 2023-06-21 01:15:11,000 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.138e+02 3.115e+02 3.803e+02 4.953e+02 7.292e+02, threshold=7.607e+02, percent-clipped=7.0 2023-06-21 01:15:12,146 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=12.0 2023-06-21 01:15:15,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=844794.0, ans=0.95 2023-06-21 01:16:07,902 INFO [train.py:996] (2/4) Epoch 5, batch 18850, loss[loss=0.2452, simple_loss=0.3437, pruned_loss=0.07337, over 21269.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3071, pruned_loss=0.08345, over 4259635.84 frames. ], batch size: 548, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 01:16:08,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=844974.0, ans=0.0 2023-06-21 01:17:00,586 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-21 01:17:34,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=845214.0, ans=0.125 2023-06-21 01:17:46,578 INFO [train.py:996] (2/4) Epoch 5, batch 18900, loss[loss=0.2071, simple_loss=0.2816, pruned_loss=0.06633, over 21369.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3031, pruned_loss=0.08307, over 4258818.15 frames. ], batch size: 211, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 01:18:30,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=845394.0, ans=0.2 2023-06-21 01:18:31,689 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 2.559e+02 2.920e+02 3.727e+02 6.054e+02, threshold=5.840e+02, percent-clipped=0.0 2023-06-21 01:18:33,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=845394.0, ans=0.0 2023-06-21 01:18:47,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=845454.0, ans=0.125 2023-06-21 01:19:27,666 INFO [train.py:996] (2/4) Epoch 5, batch 18950, loss[loss=0.3322, simple_loss=0.4152, pruned_loss=0.1247, over 21612.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3051, pruned_loss=0.08542, over 4267679.00 frames. ], batch size: 508, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 01:19:33,237 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:20:53,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=845814.0, ans=0.5 2023-06-21 01:21:02,979 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.66 vs. limit=15.0 2023-06-21 01:21:08,459 INFO [train.py:996] (2/4) Epoch 5, batch 19000, loss[loss=0.2915, simple_loss=0.3599, pruned_loss=0.1116, over 21749.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3143, pruned_loss=0.08749, over 4267294.19 frames. ], batch size: 441, lr: 6.15e-03, grad_scale: 16.0 2023-06-21 01:21:25,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=845874.0, ans=0.125 2023-06-21 01:21:26,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=845874.0, ans=0.0 2023-06-21 01:21:46,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=845934.0, ans=0.2 2023-06-21 01:21:53,552 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.876e+02 3.456e+02 4.188e+02 7.110e+02, threshold=6.912e+02, percent-clipped=2.0 2023-06-21 01:22:47,102 INFO [train.py:996] (2/4) Epoch 5, batch 19050, loss[loss=0.277, simple_loss=0.35, pruned_loss=0.102, over 21801.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3209, pruned_loss=0.09146, over 4281214.27 frames. ], batch size: 414, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:22:49,733 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=15.0 2023-06-21 01:23:35,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=846294.0, ans=0.1 2023-06-21 01:23:36,259 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=22.5 2023-06-21 01:23:47,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=846294.0, ans=15.0 2023-06-21 01:23:54,198 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.05 vs. limit=15.0 2023-06-21 01:23:56,743 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:24:00,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=846354.0, ans=0.125 2023-06-21 01:24:09,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=846414.0, ans=0.125 2023-06-21 01:24:20,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=846414.0, ans=0.0 2023-06-21 01:24:24,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=846414.0, ans=0.125 2023-06-21 01:24:31,399 INFO [train.py:996] (2/4) Epoch 5, batch 19100, loss[loss=0.2362, simple_loss=0.2989, pruned_loss=0.08676, over 21988.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3191, pruned_loss=0.09248, over 4275039.48 frames. ], batch size: 103, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:25:22,023 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 2.859e+02 3.382e+02 4.111e+02 6.618e+02, threshold=6.763e+02, percent-clipped=0.0 2023-06-21 01:26:17,778 INFO [train.py:996] (2/4) Epoch 5, batch 19150, loss[loss=0.2524, simple_loss=0.3425, pruned_loss=0.08113, over 21621.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3212, pruned_loss=0.09395, over 4272671.04 frames. ], batch size: 263, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:27:27,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=846954.0, ans=0.0 2023-06-21 01:27:51,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=847014.0, ans=0.125 2023-06-21 01:28:00,638 INFO [train.py:996] (2/4) Epoch 5, batch 19200, loss[loss=0.2659, simple_loss=0.3614, pruned_loss=0.08521, over 21774.00 frames. ], tot_loss[loss=0.2595, simple_loss=0.3305, pruned_loss=0.09429, over 4270484.54 frames. ], batch size: 351, lr: 6.14e-03, grad_scale: 32.0 2023-06-21 01:28:06,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=847074.0, ans=0.0 2023-06-21 01:28:07,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=847074.0, ans=0.0 2023-06-21 01:28:43,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=847194.0, ans=0.125 2023-06-21 01:28:47,826 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.827e+02 3.204e+02 4.140e+02 7.071e+02, threshold=6.408e+02, percent-clipped=1.0 2023-06-21 01:28:48,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=847194.0, ans=0.125 2023-06-21 01:29:17,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=847254.0, ans=0.125 2023-06-21 01:29:19,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=847254.0, ans=0.0 2023-06-21 01:29:38,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=847314.0, ans=0.125 2023-06-21 01:29:41,538 INFO [train.py:996] (2/4) Epoch 5, batch 19250, loss[loss=0.2012, simple_loss=0.2858, pruned_loss=0.05827, over 21622.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3322, pruned_loss=0.08939, over 4263164.12 frames. ], batch size: 263, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:29:43,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=847374.0, ans=0.0 2023-06-21 01:30:14,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=847434.0, ans=0.04949747468305833 2023-06-21 01:30:22,994 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-21 01:31:20,366 INFO [train.py:996] (2/4) Epoch 5, batch 19300, loss[loss=0.2369, simple_loss=0.3024, pruned_loss=0.08566, over 21312.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3285, pruned_loss=0.08766, over 4271542.96 frames. ], batch size: 159, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:32:05,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=847794.0, ans=0.125 2023-06-21 01:32:07,948 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.766e+02 2.704e+02 3.211e+02 3.924e+02 6.818e+02, threshold=6.422e+02, percent-clipped=2.0 2023-06-21 01:32:09,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=847794.0, ans=0.0 2023-06-21 01:33:00,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=847974.0, ans=0.125 2023-06-21 01:33:01,245 INFO [train.py:996] (2/4) Epoch 5, batch 19350, loss[loss=0.2346, simple_loss=0.3227, pruned_loss=0.07331, over 21604.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3238, pruned_loss=0.08561, over 4276193.00 frames. ], batch size: 389, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:33:50,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=848094.0, ans=0.125 2023-06-21 01:33:55,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=848094.0, ans=0.2 2023-06-21 01:34:29,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=848214.0, ans=0.125 2023-06-21 01:34:39,461 INFO [train.py:996] (2/4) Epoch 5, batch 19400, loss[loss=0.2035, simple_loss=0.275, pruned_loss=0.06599, over 21254.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.321, pruned_loss=0.08459, over 4284155.24 frames. ], batch size: 176, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:34:52,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=848274.0, ans=0.125 2023-06-21 01:35:03,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=848334.0, ans=0.125 2023-06-21 01:35:25,896 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 2.765e+02 3.064e+02 3.598e+02 5.687e+02, threshold=6.129e+02, percent-clipped=0.0 2023-06-21 01:35:39,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=848394.0, ans=0.125 2023-06-21 01:36:22,718 INFO [train.py:996] (2/4) Epoch 5, batch 19450, loss[loss=0.2029, simple_loss=0.2633, pruned_loss=0.07129, over 21668.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3166, pruned_loss=0.08559, over 4290959.52 frames. ], batch size: 282, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:36:24,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=848574.0, ans=0.1 2023-06-21 01:36:47,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=848634.0, ans=0.125 2023-06-21 01:36:58,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=848694.0, ans=0.125 2023-06-21 01:37:44,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=848814.0, ans=0.125 2023-06-21 01:37:58,172 INFO [train.py:996] (2/4) Epoch 5, batch 19500, loss[loss=0.2757, simple_loss=0.3498, pruned_loss=0.1008, over 21562.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3152, pruned_loss=0.08767, over 4287146.06 frames. ], batch size: 389, lr: 6.14e-03, grad_scale: 16.0 2023-06-21 01:38:44,702 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.339e+02 2.810e+02 3.330e+02 3.940e+02 7.380e+02, threshold=6.661e+02, percent-clipped=6.0 2023-06-21 01:39:24,944 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=22.5 2023-06-21 01:39:30,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=849114.0, ans=10.0 2023-06-21 01:39:36,268 INFO [train.py:996] (2/4) Epoch 5, batch 19550, loss[loss=0.2319, simple_loss=0.3247, pruned_loss=0.06952, over 21158.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3098, pruned_loss=0.0852, over 4289782.26 frames. ], batch size: 548, lr: 6.13e-03, grad_scale: 16.0 2023-06-21 01:40:55,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=849354.0, ans=0.125 2023-06-21 01:41:05,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=849414.0, ans=0.0 2023-06-21 01:41:19,074 INFO [train.py:996] (2/4) Epoch 5, batch 19600, loss[loss=0.2507, simple_loss=0.3123, pruned_loss=0.0946, over 21818.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.312, pruned_loss=0.08719, over 4280749.69 frames. ], batch size: 298, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:42:00,754 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 3.061e+02 3.495e+02 4.046e+02 6.477e+02, threshold=6.990e+02, percent-clipped=0.0 2023-06-21 01:42:18,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=849654.0, ans=0.95 2023-06-21 01:42:18,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=849654.0, ans=0.0 2023-06-21 01:42:57,961 INFO [train.py:996] (2/4) Epoch 5, batch 19650, loss[loss=0.2469, simple_loss=0.3092, pruned_loss=0.09235, over 21496.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3178, pruned_loss=0.091, over 4281781.21 frames. ], batch size: 194, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:43:32,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=849834.0, ans=0.125 2023-06-21 01:44:05,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=849954.0, ans=0.95 2023-06-21 01:44:37,922 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.75 vs. limit=22.5 2023-06-21 01:44:46,874 INFO [train.py:996] (2/4) Epoch 5, batch 19700, loss[loss=0.2052, simple_loss=0.2633, pruned_loss=0.07354, over 21131.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3214, pruned_loss=0.09196, over 4280999.87 frames. ], batch size: 143, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:45:18,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=850134.0, ans=0.125 2023-06-21 01:45:30,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=850194.0, ans=15.0 2023-06-21 01:45:34,556 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.202e+02 2.950e+02 3.404e+02 4.157e+02 1.102e+03, threshold=6.808e+02, percent-clipped=4.0 2023-06-21 01:45:55,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=850254.0, ans=0.125 2023-06-21 01:46:16,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=850314.0, ans=0.0 2023-06-21 01:46:27,937 INFO [train.py:996] (2/4) Epoch 5, batch 19750, loss[loss=0.3397, simple_loss=0.4015, pruned_loss=0.1389, over 21649.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3304, pruned_loss=0.0926, over 4274816.29 frames. ], batch size: 507, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:46:44,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=850374.0, ans=0.125 2023-06-21 01:46:49,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=850434.0, ans=0.125 2023-06-21 01:47:10,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=850494.0, ans=0.1 2023-06-21 01:48:06,776 INFO [train.py:996] (2/4) Epoch 5, batch 19800, loss[loss=0.2825, simple_loss=0.3736, pruned_loss=0.09572, over 19944.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3299, pruned_loss=0.09309, over 4272731.89 frames. ], batch size: 702, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:48:54,009 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.246e+02 3.157e+02 4.058e+02 5.975e+02 1.111e+03, threshold=8.116e+02, percent-clipped=16.0 2023-06-21 01:49:39,646 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=12.0 2023-06-21 01:49:52,584 INFO [train.py:996] (2/4) Epoch 5, batch 19850, loss[loss=0.2098, simple_loss=0.2866, pruned_loss=0.06654, over 21251.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.321, pruned_loss=0.08703, over 4274103.63 frames. ], batch size: 176, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:51:00,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=851154.0, ans=0.125 2023-06-21 01:51:29,798 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=15.0 2023-06-21 01:51:32,075 INFO [train.py:996] (2/4) Epoch 5, batch 19900, loss[loss=0.2302, simple_loss=0.2931, pruned_loss=0.08367, over 21843.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3217, pruned_loss=0.08506, over 4274026.96 frames. ], batch size: 107, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:51:33,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=851274.0, ans=0.5 2023-06-21 01:51:53,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=851334.0, ans=0.0 2023-06-21 01:51:55,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=851334.0, ans=0.125 2023-06-21 01:51:56,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=851334.0, ans=0.125 2023-06-21 01:52:04,019 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.54 vs. limit=15.0 2023-06-21 01:52:19,478 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.788e+02 2.633e+02 2.856e+02 3.289e+02 5.435e+02, threshold=5.712e+02, percent-clipped=0.0 2023-06-21 01:52:23,693 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-06-21 01:52:27,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=851394.0, ans=0.125 2023-06-21 01:53:08,186 INFO [train.py:996] (2/4) Epoch 5, batch 19950, loss[loss=0.2322, simple_loss=0.2977, pruned_loss=0.08335, over 21702.00 frames. ], tot_loss[loss=0.243, simple_loss=0.316, pruned_loss=0.08505, over 4275971.23 frames. ], batch size: 112, lr: 6.13e-03, grad_scale: 32.0 2023-06-21 01:54:05,382 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-21 01:54:10,630 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=15.0 2023-06-21 01:54:15,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=851754.0, ans=0.0 2023-06-21 01:54:22,827 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.63 vs. limit=10.0 2023-06-21 01:54:46,770 INFO [train.py:996] (2/4) Epoch 5, batch 20000, loss[loss=0.2415, simple_loss=0.3063, pruned_loss=0.08836, over 21458.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3189, pruned_loss=0.08621, over 4279249.87 frames. ], batch size: 131, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 01:55:07,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=851934.0, ans=0.0 2023-06-21 01:55:38,164 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.383e+02 3.174e+02 3.672e+02 4.869e+02 7.405e+02, threshold=7.343e+02, percent-clipped=12.0 2023-06-21 01:56:25,460 INFO [train.py:996] (2/4) Epoch 5, batch 20050, loss[loss=0.2862, simple_loss=0.3394, pruned_loss=0.1165, over 21562.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3197, pruned_loss=0.08856, over 4282987.27 frames. ], batch size: 548, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 01:58:06,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=852414.0, ans=0.2 2023-06-21 01:58:10,723 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-21 01:58:11,258 INFO [train.py:996] (2/4) Epoch 5, batch 20100, loss[loss=0.3267, simple_loss=0.4038, pruned_loss=0.1248, over 21681.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3229, pruned_loss=0.09151, over 4289966.99 frames. ], batch size: 389, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 01:58:58,327 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.283e+02 2.825e+02 3.167e+02 3.914e+02 6.858e+02, threshold=6.334e+02, percent-clipped=0.0 2023-06-21 01:59:03,533 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 01:59:51,882 INFO [train.py:996] (2/4) Epoch 5, batch 20150, loss[loss=0.3195, simple_loss=0.3802, pruned_loss=0.1294, over 21758.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.3331, pruned_loss=0.0954, over 4288799.14 frames. ], batch size: 441, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 02:00:09,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=852774.0, ans=0.0 2023-06-21 02:00:14,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=852774.0, ans=0.1 2023-06-21 02:00:17,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=852834.0, ans=0.125 2023-06-21 02:00:19,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=852834.0, ans=0.1 2023-06-21 02:00:29,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=852834.0, ans=0.2 2023-06-21 02:00:35,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=852894.0, ans=10.0 2023-06-21 02:01:14,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=852954.0, ans=0.2 2023-06-21 02:01:17,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=853014.0, ans=0.125 2023-06-21 02:01:18,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=853014.0, ans=0.1 2023-06-21 02:01:38,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=853014.0, ans=0.125 2023-06-21 02:01:44,374 INFO [train.py:996] (2/4) Epoch 5, batch 20200, loss[loss=0.2898, simple_loss=0.3917, pruned_loss=0.09392, over 21238.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3373, pruned_loss=0.09793, over 4282887.98 frames. ], batch size: 549, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 02:01:59,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=853134.0, ans=0.125 2023-06-21 02:02:01,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=853134.0, ans=0.125 2023-06-21 02:02:22,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=853194.0, ans=0.125 2023-06-21 02:02:23,233 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-21 02:02:28,622 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.103e+02 3.828e+02 4.837e+02 8.948e+02, threshold=7.656e+02, percent-clipped=6.0 2023-06-21 02:02:48,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=853254.0, ans=0.07 2023-06-21 02:03:24,909 INFO [train.py:996] (2/4) Epoch 5, batch 20250, loss[loss=0.2394, simple_loss=0.3132, pruned_loss=0.08276, over 21303.00 frames. ], tot_loss[loss=0.2652, simple_loss=0.3385, pruned_loss=0.09594, over 4283405.34 frames. ], batch size: 159, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 02:03:57,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=853434.0, ans=0.95 2023-06-21 02:04:56,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=853614.0, ans=0.125 2023-06-21 02:05:04,420 INFO [train.py:996] (2/4) Epoch 5, batch 20300, loss[loss=0.2141, simple_loss=0.2862, pruned_loss=0.07102, over 21872.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3367, pruned_loss=0.09294, over 4265991.74 frames. ], batch size: 98, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 02:05:51,350 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-06-21 02:05:51,762 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.734e+02 3.068e+02 3.713e+02 6.256e+02, threshold=6.135e+02, percent-clipped=0.0 2023-06-21 02:06:03,743 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.69 vs. limit=15.0 2023-06-21 02:06:10,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=853854.0, ans=0.125 2023-06-21 02:06:10,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=853854.0, ans=0.2 2023-06-21 02:06:23,391 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.28 vs. limit=12.0 2023-06-21 02:06:41,902 INFO [train.py:996] (2/4) Epoch 5, batch 20350, loss[loss=0.303, simple_loss=0.3614, pruned_loss=0.1223, over 21688.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3359, pruned_loss=0.09285, over 4263654.04 frames. ], batch size: 389, lr: 6.12e-03, grad_scale: 16.0 2023-06-21 02:06:55,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=853974.0, ans=0.0 2023-06-21 02:07:51,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=854154.0, ans=0.04949747468305833 2023-06-21 02:08:06,905 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:08:16,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=854274.0, ans=0.125 2023-06-21 02:08:17,288 INFO [train.py:996] (2/4) Epoch 5, batch 20400, loss[loss=0.2963, simple_loss=0.3485, pruned_loss=0.122, over 21168.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3386, pruned_loss=0.09652, over 4255332.74 frames. ], batch size: 143, lr: 6.12e-03, grad_scale: 32.0 2023-06-21 02:08:43,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=854334.0, ans=0.125 2023-06-21 02:09:05,784 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.219e+02 3.145e+02 3.695e+02 4.616e+02 6.973e+02, threshold=7.390e+02, percent-clipped=6.0 2023-06-21 02:09:21,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=854454.0, ans=0.125 2023-06-21 02:09:30,118 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=12.0 2023-06-21 02:09:56,725 INFO [train.py:996] (2/4) Epoch 5, batch 20450, loss[loss=0.2478, simple_loss=0.314, pruned_loss=0.09081, over 21681.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.339, pruned_loss=0.09844, over 4256255.55 frames. ], batch size: 263, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 02:10:22,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=854634.0, ans=0.035 2023-06-21 02:11:34,432 INFO [train.py:996] (2/4) Epoch 5, batch 20500, loss[loss=0.2614, simple_loss=0.3235, pruned_loss=0.09966, over 21467.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3344, pruned_loss=0.09861, over 4255806.16 frames. ], batch size: 131, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 02:11:36,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=854874.0, ans=0.125 2023-06-21 02:11:48,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=854934.0, ans=0.125 2023-06-21 02:12:00,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=854934.0, ans=0.125 2023-06-21 02:12:11,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=854994.0, ans=0.0 2023-06-21 02:12:17,448 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.219e+02 2.857e+02 3.263e+02 3.907e+02 6.416e+02, threshold=6.525e+02, percent-clipped=0.0 2023-06-21 02:12:35,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=855054.0, ans=0.125 2023-06-21 02:12:37,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=855054.0, ans=0.0 2023-06-21 02:12:42,783 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=12.0 2023-06-21 02:13:09,823 INFO [train.py:996] (2/4) Epoch 5, batch 20550, loss[loss=0.3096, simple_loss=0.3631, pruned_loss=0.1281, over 21383.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3265, pruned_loss=0.09681, over 4239446.39 frames. ], batch size: 508, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 02:13:16,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=855174.0, ans=0.0 2023-06-21 02:13:32,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=855234.0, ans=0.125 2023-06-21 02:13:41,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=855234.0, ans=0.125 2023-06-21 02:13:42,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=855234.0, ans=0.125 2023-06-21 02:13:44,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=855234.0, ans=0.125 2023-06-21 02:13:56,920 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=22.5 2023-06-21 02:14:23,114 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-21 02:14:32,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=855414.0, ans=0.1 2023-06-21 02:14:46,296 INFO [train.py:996] (2/4) Epoch 5, batch 20600, loss[loss=0.2815, simple_loss=0.3432, pruned_loss=0.1099, over 21764.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3288, pruned_loss=0.09425, over 4240064.75 frames. ], batch size: 112, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 02:14:54,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=855474.0, ans=0.125 2023-06-21 02:14:54,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=855474.0, ans=0.125 2023-06-21 02:15:12,609 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=15.0 2023-06-21 02:15:26,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=855594.0, ans=0.2 2023-06-21 02:15:35,928 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.303e+02 2.804e+02 3.280e+02 3.757e+02 7.089e+02, threshold=6.559e+02, percent-clipped=1.0 2023-06-21 02:15:41,912 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-21 02:16:07,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=855654.0, ans=0.125 2023-06-21 02:16:17,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=855714.0, ans=0.125 2023-06-21 02:16:26,183 INFO [train.py:996] (2/4) Epoch 5, batch 20650, loss[loss=0.2102, simple_loss=0.2585, pruned_loss=0.08097, over 20879.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3246, pruned_loss=0.09528, over 4246393.20 frames. ], batch size: 608, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 02:16:48,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=855834.0, ans=0.0 2023-06-21 02:17:06,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=855894.0, ans=0.125 2023-06-21 02:17:32,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=855954.0, ans=0.125 2023-06-21 02:18:06,959 INFO [train.py:996] (2/4) Epoch 5, batch 20700, loss[loss=0.2085, simple_loss=0.2829, pruned_loss=0.06709, over 21421.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3181, pruned_loss=0.09195, over 4245544.64 frames. ], batch size: 211, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 02:18:44,230 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:18:57,428 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.852e+02 2.666e+02 3.123e+02 3.714e+02 6.425e+02, threshold=6.247e+02, percent-clipped=0.0 2023-06-21 02:19:41,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=856314.0, ans=0.125 2023-06-21 02:19:53,123 INFO [train.py:996] (2/4) Epoch 5, batch 20750, loss[loss=0.3931, simple_loss=0.4649, pruned_loss=0.1607, over 21488.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3227, pruned_loss=0.09141, over 4247971.47 frames. ], batch size: 507, lr: 6.11e-03, grad_scale: 16.0 2023-06-21 02:19:58,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=856374.0, ans=0.02 2023-06-21 02:19:58,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=856374.0, ans=0.125 2023-06-21 02:20:17,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=856434.0, ans=0.125 2023-06-21 02:20:29,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=856494.0, ans=0.2 2023-06-21 02:21:03,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=856554.0, ans=0.0 2023-06-21 02:21:29,575 INFO [train.py:996] (2/4) Epoch 5, batch 20800, loss[loss=0.2186, simple_loss=0.281, pruned_loss=0.0781, over 21238.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3246, pruned_loss=0.09207, over 4253438.65 frames. ], batch size: 549, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 02:21:53,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=856734.0, ans=0.125 2023-06-21 02:22:03,518 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.91 vs. limit=5.0 2023-06-21 02:22:25,033 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.500e+02 3.108e+02 3.901e+02 5.599e+02 9.709e+02, threshold=7.803e+02, percent-clipped=19.0 2023-06-21 02:22:29,969 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.88 vs. limit=15.0 2023-06-21 02:22:40,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=856854.0, ans=0.05 2023-06-21 02:23:14,846 INFO [train.py:996] (2/4) Epoch 5, batch 20850, loss[loss=0.2205, simple_loss=0.288, pruned_loss=0.07651, over 21794.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3162, pruned_loss=0.08965, over 4251067.85 frames. ], batch size: 282, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 02:24:00,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=857094.0, ans=0.125 2023-06-21 02:24:55,588 INFO [train.py:996] (2/4) Epoch 5, batch 20900, loss[loss=0.3186, simple_loss=0.3794, pruned_loss=0.1289, over 21646.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3182, pruned_loss=0.09115, over 4259364.03 frames. ], batch size: 509, lr: 6.11e-03, grad_scale: 32.0 2023-06-21 02:24:59,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=857274.0, ans=0.0 2023-06-21 02:25:16,142 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:25:29,976 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.42 vs. limit=10.0 2023-06-21 02:25:44,709 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.849e+02 3.550e+02 4.829e+02 8.716e+02, threshold=7.101e+02, percent-clipped=1.0 2023-06-21 02:25:45,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=857394.0, ans=0.0 2023-06-21 02:25:45,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=857394.0, ans=0.1 2023-06-21 02:25:49,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=857394.0, ans=0.025 2023-06-21 02:26:10,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=857514.0, ans=0.0 2023-06-21 02:26:16,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=857514.0, ans=0.125 2023-06-21 02:26:24,236 INFO [train.py:996] (2/4) Epoch 5, batch 20950, loss[loss=0.2093, simple_loss=0.283, pruned_loss=0.06784, over 21810.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3128, pruned_loss=0.08645, over 4252482.76 frames. ], batch size: 316, lr: 6.10e-03, grad_scale: 32.0 2023-06-21 02:27:57,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=857814.0, ans=0.0 2023-06-21 02:28:02,757 INFO [train.py:996] (2/4) Epoch 5, batch 21000, loss[loss=0.2141, simple_loss=0.2887, pruned_loss=0.06974, over 21084.00 frames. ], tot_loss[loss=0.242, simple_loss=0.311, pruned_loss=0.08652, over 4256540.94 frames. ], batch size: 608, lr: 6.10e-03, grad_scale: 32.0 2023-06-21 02:28:02,758 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 02:28:15,856 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.1550, 1.3709, 2.0128, 1.7183, 1.3297, 2.0795, 2.0392, 0.9831], device='cuda:2') 2023-06-21 02:28:23,293 INFO [train.py:1028] (2/4) Epoch 5, validation: loss=0.2707, simple_loss=0.3706, pruned_loss=0.0854, over 1796401.00 frames. 2023-06-21 02:28:23,293 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-21 02:28:43,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=857934.0, ans=0.025 2023-06-21 02:29:06,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=857994.0, ans=0.1 2023-06-21 02:29:08,050 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.747e+02 2.446e+02 2.941e+02 3.372e+02 5.847e+02, threshold=5.881e+02, percent-clipped=0.0 2023-06-21 02:29:26,950 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.79 vs. limit=12.0 2023-06-21 02:29:40,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=858114.0, ans=0.125 2023-06-21 02:29:52,545 INFO [train.py:996] (2/4) Epoch 5, batch 21050, loss[loss=0.2504, simple_loss=0.3174, pruned_loss=0.09175, over 21653.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3094, pruned_loss=0.08753, over 4256923.03 frames. ], batch size: 282, lr: 6.10e-03, grad_scale: 32.0 2023-06-21 02:30:04,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=858174.0, ans=0.0 2023-06-21 02:30:53,341 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.10 vs. limit=12.0 2023-06-21 02:30:55,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=858354.0, ans=0.0 2023-06-21 02:31:31,898 INFO [train.py:996] (2/4) Epoch 5, batch 21100, loss[loss=0.2292, simple_loss=0.2921, pruned_loss=0.0832, over 21583.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3062, pruned_loss=0.08761, over 4264567.98 frames. ], batch size: 332, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 02:31:35,166 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.39 vs. limit=15.0 2023-06-21 02:31:37,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=858474.0, ans=0.0 2023-06-21 02:31:58,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=858534.0, ans=0.125 2023-06-21 02:32:09,537 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-21 02:32:23,049 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.002e+02 2.646e+02 3.104e+02 3.741e+02 7.727e+02, threshold=6.208e+02, percent-clipped=4.0 2023-06-21 02:32:42,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=858654.0, ans=0.125 2023-06-21 02:33:10,727 INFO [train.py:996] (2/4) Epoch 5, batch 21150, loss[loss=0.2125, simple_loss=0.2742, pruned_loss=0.07539, over 21322.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3036, pruned_loss=0.08837, over 4266640.75 frames. ], batch size: 131, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 02:33:22,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=858774.0, ans=0.125 2023-06-21 02:33:58,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=858894.0, ans=0.1 2023-06-21 02:34:10,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=858954.0, ans=0.0 2023-06-21 02:34:49,192 INFO [train.py:996] (2/4) Epoch 5, batch 21200, loss[loss=0.2201, simple_loss=0.287, pruned_loss=0.07656, over 21833.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.2996, pruned_loss=0.08707, over 4254944.74 frames. ], batch size: 318, lr: 6.10e-03, grad_scale: 32.0 2023-06-21 02:35:03,043 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.55 vs. limit=22.5 2023-06-21 02:35:04,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=859134.0, ans=0.0 2023-06-21 02:35:05,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=859134.0, ans=0.0 2023-06-21 02:35:42,264 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.049e+02 2.556e+02 2.983e+02 3.477e+02 7.677e+02, threshold=5.965e+02, percent-clipped=1.0 2023-06-21 02:36:20,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=859314.0, ans=0.025 2023-06-21 02:36:30,232 INFO [train.py:996] (2/4) Epoch 5, batch 21250, loss[loss=0.2581, simple_loss=0.3304, pruned_loss=0.09294, over 21679.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.2985, pruned_loss=0.08721, over 4257988.08 frames. ], batch size: 247, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 02:37:39,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=859554.0, ans=0.0 2023-06-21 02:37:39,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=859554.0, ans=0.2 2023-06-21 02:38:09,633 INFO [train.py:996] (2/4) Epoch 5, batch 21300, loss[loss=0.2614, simple_loss=0.3319, pruned_loss=0.09548, over 21813.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3058, pruned_loss=0.08919, over 4258305.86 frames. ], batch size: 298, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 02:38:29,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=859734.0, ans=0.125 2023-06-21 02:38:34,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=859734.0, ans=0.125 2023-06-21 02:39:02,886 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.372e+02 2.969e+02 3.329e+02 4.486e+02 8.975e+02, threshold=6.657e+02, percent-clipped=6.0 2023-06-21 02:39:03,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=859794.0, ans=0.2 2023-06-21 02:39:05,876 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.90 vs. limit=6.0 2023-06-21 02:39:18,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=859854.0, ans=0.125 2023-06-21 02:39:44,468 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.06 vs. limit=12.0 2023-06-21 02:39:50,037 INFO [train.py:996] (2/4) Epoch 5, batch 21350, loss[loss=0.2637, simple_loss=0.3244, pruned_loss=0.1015, over 21685.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.31, pruned_loss=0.08909, over 4262096.12 frames. ], batch size: 263, lr: 6.10e-03, grad_scale: 16.0 2023-06-21 02:40:10,137 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=15.0 2023-06-21 02:40:50,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=860154.0, ans=0.125 2023-06-21 02:41:13,790 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.59 vs. limit=15.0 2023-06-21 02:41:29,870 INFO [train.py:996] (2/4) Epoch 5, batch 21400, loss[loss=0.2561, simple_loss=0.3299, pruned_loss=0.09111, over 21756.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3114, pruned_loss=0.0877, over 4256935.66 frames. ], batch size: 332, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 02:41:31,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=860274.0, ans=0.0 2023-06-21 02:41:40,684 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-06-21 02:42:15,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=860394.0, ans=0.125 2023-06-21 02:42:22,708 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 2.768e+02 3.163e+02 3.686e+02 6.049e+02, threshold=6.326e+02, percent-clipped=0.0 2023-06-21 02:42:29,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=860454.0, ans=0.0 2023-06-21 02:42:30,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=860454.0, ans=0.125 2023-06-21 02:43:08,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=860574.0, ans=0.1 2023-06-21 02:43:09,490 INFO [train.py:996] (2/4) Epoch 5, batch 21450, loss[loss=0.2226, simple_loss=0.2927, pruned_loss=0.07629, over 21363.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3134, pruned_loss=0.08862, over 4261051.68 frames. ], batch size: 176, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 02:43:46,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=860634.0, ans=0.0 2023-06-21 02:44:08,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=860754.0, ans=0.125 2023-06-21 02:44:48,128 INFO [train.py:996] (2/4) Epoch 5, batch 21500, loss[loss=0.2189, simple_loss=0.2758, pruned_loss=0.08098, over 21576.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3126, pruned_loss=0.09041, over 4261675.65 frames. ], batch size: 247, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 02:44:55,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=860874.0, ans=10.0 2023-06-21 02:45:07,589 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.60 vs. limit=10.0 2023-06-21 02:45:22,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=860934.0, ans=0.125 2023-06-21 02:45:27,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=860934.0, ans=0.2 2023-06-21 02:45:31,753 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:45:37,007 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.06 vs. limit=10.0 2023-06-21 02:45:40,395 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.027e+02 3.006e+02 3.483e+02 4.242e+02 6.315e+02, threshold=6.966e+02, percent-clipped=0.0 2023-06-21 02:45:42,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=860994.0, ans=0.0 2023-06-21 02:45:53,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=861054.0, ans=0.125 2023-06-21 02:46:10,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=861114.0, ans=0.125 2023-06-21 02:46:15,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=861114.0, ans=0.0 2023-06-21 02:46:26,543 INFO [train.py:996] (2/4) Epoch 5, batch 21550, loss[loss=0.1955, simple_loss=0.2612, pruned_loss=0.06489, over 21648.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.307, pruned_loss=0.08815, over 4270472.39 frames. ], batch size: 282, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 02:47:07,708 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-06-21 02:47:25,058 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 02:47:56,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=861414.0, ans=0.1 2023-06-21 02:48:07,452 INFO [train.py:996] (2/4) Epoch 5, batch 21600, loss[loss=0.2203, simple_loss=0.2968, pruned_loss=0.07191, over 21494.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3028, pruned_loss=0.08706, over 4269787.57 frames. ], batch size: 230, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 02:49:05,662 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.000e+02 2.906e+02 3.367e+02 4.103e+02 7.141e+02, threshold=6.734e+02, percent-clipped=1.0 2023-06-21 02:49:33,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=861714.0, ans=0.125 2023-06-21 02:49:46,657 INFO [train.py:996] (2/4) Epoch 5, batch 21650, loss[loss=0.2281, simple_loss=0.3097, pruned_loss=0.07324, over 21434.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3069, pruned_loss=0.08452, over 4265646.91 frames. ], batch size: 194, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 02:49:55,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=861774.0, ans=0.0 2023-06-21 02:50:03,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=861774.0, ans=0.0 2023-06-21 02:50:11,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=861834.0, ans=0.125 2023-06-21 02:51:02,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=861954.0, ans=0.125 2023-06-21 02:51:13,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=862014.0, ans=0.0 2023-06-21 02:51:26,197 INFO [train.py:996] (2/4) Epoch 5, batch 21700, loss[loss=0.1961, simple_loss=0.2576, pruned_loss=0.06727, over 17064.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3073, pruned_loss=0.08281, over 4264629.49 frames. ], batch size: 67, lr: 6.09e-03, grad_scale: 32.0 2023-06-21 02:51:59,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=862134.0, ans=0.1 2023-06-21 02:52:06,459 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.89 vs. limit=10.0 2023-06-21 02:52:09,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=862194.0, ans=0.1 2023-06-21 02:52:22,498 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.845e+02 2.607e+02 2.964e+02 3.424e+02 5.516e+02, threshold=5.928e+02, percent-clipped=0.0 2023-06-21 02:52:33,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=862254.0, ans=0.125 2023-06-21 02:53:10,468 INFO [train.py:996] (2/4) Epoch 5, batch 21750, loss[loss=0.2353, simple_loss=0.2861, pruned_loss=0.09226, over 21526.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3041, pruned_loss=0.08382, over 4265772.89 frames. ], batch size: 442, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 02:53:39,997 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.17 vs. limit=15.0 2023-06-21 02:54:35,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=862614.0, ans=0.0 2023-06-21 02:54:40,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=862614.0, ans=0.125 2023-06-21 02:54:49,591 INFO [train.py:996] (2/4) Epoch 5, batch 21800, loss[loss=0.2718, simple_loss=0.3662, pruned_loss=0.08872, over 21750.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.304, pruned_loss=0.08507, over 4274100.72 frames. ], batch size: 351, lr: 6.09e-03, grad_scale: 16.0 2023-06-21 02:55:21,290 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=12.0 2023-06-21 02:55:32,533 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=15.0 2023-06-21 02:55:41,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=862794.0, ans=0.1 2023-06-21 02:55:43,908 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.221e+02 2.757e+02 3.112e+02 3.604e+02 5.308e+02, threshold=6.224e+02, percent-clipped=0.0 2023-06-21 02:56:29,370 INFO [train.py:996] (2/4) Epoch 5, batch 21850, loss[loss=0.246, simple_loss=0.2989, pruned_loss=0.09652, over 16567.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3081, pruned_loss=0.08537, over 4266360.06 frames. ], batch size: 60, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 02:58:08,183 INFO [train.py:996] (2/4) Epoch 5, batch 21900, loss[loss=0.2245, simple_loss=0.2773, pruned_loss=0.08586, over 21641.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3091, pruned_loss=0.08692, over 4267060.43 frames. ], batch size: 247, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 02:58:25,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=863274.0, ans=0.0 2023-06-21 02:59:06,543 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.269e+02 2.823e+02 3.229e+02 3.710e+02 5.018e+02, threshold=6.457e+02, percent-clipped=0.0 2023-06-21 02:59:28,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=863514.0, ans=0.0 2023-06-21 02:59:46,630 INFO [train.py:996] (2/4) Epoch 5, batch 21950, loss[loss=0.2111, simple_loss=0.2803, pruned_loss=0.07099, over 21736.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3042, pruned_loss=0.08582, over 4254596.55 frames. ], batch size: 351, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 02:59:50,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=863574.0, ans=0.125 2023-06-21 03:00:18,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=863634.0, ans=0.125 2023-06-21 03:00:56,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=863754.0, ans=0.125 2023-06-21 03:01:27,641 INFO [train.py:996] (2/4) Epoch 5, batch 22000, loss[loss=0.2124, simple_loss=0.274, pruned_loss=0.07538, over 21609.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.298, pruned_loss=0.08263, over 4256540.02 frames. ], batch size: 247, lr: 6.08e-03, grad_scale: 32.0 2023-06-21 03:01:54,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=863934.0, ans=0.1 2023-06-21 03:02:00,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=863934.0, ans=0.0 2023-06-21 03:02:16,817 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-06-21 03:02:29,959 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.909e+02 2.512e+02 2.928e+02 3.420e+02 5.826e+02, threshold=5.856e+02, percent-clipped=0.0 2023-06-21 03:02:51,454 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.98 vs. limit=15.0 2023-06-21 03:03:08,619 INFO [train.py:996] (2/4) Epoch 5, batch 22050, loss[loss=0.2935, simple_loss=0.3684, pruned_loss=0.1092, over 21740.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3034, pruned_loss=0.08335, over 4258231.92 frames. ], batch size: 351, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:03:21,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=864174.0, ans=0.125 2023-06-21 03:03:35,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=864234.0, ans=0.1 2023-06-21 03:04:39,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=864414.0, ans=0.2 2023-06-21 03:04:52,781 INFO [train.py:996] (2/4) Epoch 5, batch 22100, loss[loss=0.262, simple_loss=0.3218, pruned_loss=0.101, over 21710.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3131, pruned_loss=0.08823, over 4252148.93 frames. ], batch size: 230, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:04:55,232 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=22.5 2023-06-21 03:05:48,630 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.231e+02 3.305e+02 3.693e+02 4.258e+02 6.395e+02, threshold=7.386e+02, percent-clipped=3.0 2023-06-21 03:05:53,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=864654.0, ans=0.125 2023-06-21 03:06:00,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=864654.0, ans=0.025 2023-06-21 03:06:31,563 INFO [train.py:996] (2/4) Epoch 5, batch 22150, loss[loss=0.2319, simple_loss=0.299, pruned_loss=0.08239, over 21902.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3181, pruned_loss=0.09043, over 4253635.64 frames. ], batch size: 98, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:07:30,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=864954.0, ans=0.09899494936611666 2023-06-21 03:07:39,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=864954.0, ans=0.0 2023-06-21 03:08:10,598 INFO [train.py:996] (2/4) Epoch 5, batch 22200, loss[loss=0.2584, simple_loss=0.3432, pruned_loss=0.08683, over 21425.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3212, pruned_loss=0.09219, over 4265785.31 frames. ], batch size: 194, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:08:50,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=865134.0, ans=0.0 2023-06-21 03:09:02,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=865194.0, ans=0.0 2023-06-21 03:09:09,744 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 3.019e+02 3.347e+02 3.956e+02 6.093e+02, threshold=6.693e+02, percent-clipped=0.0 2023-06-21 03:09:16,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=865254.0, ans=0.125 2023-06-21 03:09:57,880 INFO [train.py:996] (2/4) Epoch 5, batch 22250, loss[loss=0.2577, simple_loss=0.3144, pruned_loss=0.1005, over 21427.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3267, pruned_loss=0.09424, over 4269891.53 frames. ], batch size: 211, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:10:02,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=865374.0, ans=0.015 2023-06-21 03:10:04,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=865374.0, ans=0.2 2023-06-21 03:10:19,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=865434.0, ans=0.1 2023-06-21 03:10:45,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=865494.0, ans=0.125 2023-06-21 03:10:47,308 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=22.5 2023-06-21 03:11:02,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=865554.0, ans=0.0 2023-06-21 03:11:32,239 INFO [train.py:996] (2/4) Epoch 5, batch 22300, loss[loss=0.2585, simple_loss=0.3192, pruned_loss=0.09888, over 21216.00 frames. ], tot_loss[loss=0.2609, simple_loss=0.3293, pruned_loss=0.09624, over 4267461.58 frames. ], batch size: 143, lr: 6.08e-03, grad_scale: 16.0 2023-06-21 03:11:40,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=865674.0, ans=0.1 2023-06-21 03:12:03,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=865734.0, ans=0.015 2023-06-21 03:12:21,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=865794.0, ans=0.0 2023-06-21 03:12:27,221 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.377e+02 3.190e+02 3.753e+02 5.122e+02 1.002e+03, threshold=7.506e+02, percent-clipped=11.0 2023-06-21 03:12:32,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=865854.0, ans=0.125 2023-06-21 03:13:14,458 INFO [train.py:996] (2/4) Epoch 5, batch 22350, loss[loss=0.2225, simple_loss=0.2853, pruned_loss=0.07979, over 21229.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3267, pruned_loss=0.09645, over 4279501.78 frames. ], batch size: 608, lr: 6.07e-03, grad_scale: 16.0 2023-06-21 03:13:15,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=865974.0, ans=0.2 2023-06-21 03:13:26,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=865974.0, ans=0.0 2023-06-21 03:13:52,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=866094.0, ans=0.0 2023-06-21 03:13:52,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=866094.0, ans=0.125 2023-06-21 03:14:05,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=866094.0, ans=0.0 2023-06-21 03:14:45,822 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.06 vs. limit=12.0 2023-06-21 03:14:51,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=866214.0, ans=0.125 2023-06-21 03:14:53,829 INFO [train.py:996] (2/4) Epoch 5, batch 22400, loss[loss=0.2759, simple_loss=0.3293, pruned_loss=0.1113, over 21628.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3236, pruned_loss=0.09314, over 4284424.93 frames. ], batch size: 332, lr: 6.07e-03, grad_scale: 32.0 2023-06-21 03:15:21,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=866334.0, ans=0.2 2023-06-21 03:15:45,054 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.706e+02 3.089e+02 3.768e+02 7.797e+02, threshold=6.178e+02, percent-clipped=1.0 2023-06-21 03:16:32,972 INFO [train.py:996] (2/4) Epoch 5, batch 22450, loss[loss=0.3199, simple_loss=0.3706, pruned_loss=0.1346, over 20059.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3181, pruned_loss=0.09181, over 4267857.46 frames. ], batch size: 703, lr: 6.07e-03, grad_scale: 32.0 2023-06-21 03:17:49,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=866754.0, ans=0.0 2023-06-21 03:18:14,435 INFO [train.py:996] (2/4) Epoch 5, batch 22500, loss[loss=0.2975, simple_loss=0.3526, pruned_loss=0.1212, over 21364.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3126, pruned_loss=0.09119, over 4264926.89 frames. ], batch size: 507, lr: 6.07e-03, grad_scale: 32.0 2023-06-21 03:18:24,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=866874.0, ans=0.125 2023-06-21 03:18:51,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=866994.0, ans=0.0 2023-06-21 03:19:05,433 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.870e+02 3.254e+02 4.012e+02 8.224e+02, threshold=6.508e+02, percent-clipped=4.0 2023-06-21 03:19:16,478 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.83 vs. limit=22.5 2023-06-21 03:19:53,754 INFO [train.py:996] (2/4) Epoch 5, batch 22550, loss[loss=0.2551, simple_loss=0.315, pruned_loss=0.09757, over 21487.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.315, pruned_loss=0.09133, over 4267940.01 frames. ], batch size: 194, lr: 6.07e-03, grad_scale: 32.0 2023-06-21 03:19:59,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=867174.0, ans=0.125 2023-06-21 03:20:05,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=867174.0, ans=0.125 2023-06-21 03:20:29,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=867234.0, ans=10.0 2023-06-21 03:20:31,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=867294.0, ans=0.0 2023-06-21 03:20:36,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=867294.0, ans=0.2 2023-06-21 03:20:46,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=867294.0, ans=0.0 2023-06-21 03:21:18,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=867414.0, ans=0.125 2023-06-21 03:21:21,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=867414.0, ans=0.2 2023-06-21 03:21:40,221 INFO [train.py:996] (2/4) Epoch 5, batch 22600, loss[loss=0.2019, simple_loss=0.2713, pruned_loss=0.0662, over 21370.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3168, pruned_loss=0.09136, over 4275717.56 frames. ], batch size: 194, lr: 6.07e-03, grad_scale: 32.0 2023-06-21 03:21:53,950 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-21 03:21:54,174 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.85 vs. limit=6.0 2023-06-21 03:21:58,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=867534.0, ans=0.04949747468305833 2023-06-21 03:21:59,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=867534.0, ans=0.0 2023-06-21 03:22:40,289 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.232e+02 3.062e+02 3.535e+02 4.633e+02 8.415e+02, threshold=7.070e+02, percent-clipped=6.0 2023-06-21 03:22:44,584 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=15.0 2023-06-21 03:23:18,532 INFO [train.py:996] (2/4) Epoch 5, batch 22650, loss[loss=0.2864, simple_loss=0.3714, pruned_loss=0.1007, over 19921.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3141, pruned_loss=0.09088, over 4268314.87 frames. ], batch size: 702, lr: 6.07e-03, grad_scale: 16.0 2023-06-21 03:23:33,865 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.04 vs. limit=12.0 2023-06-21 03:24:26,271 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:24:26,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=867954.0, ans=0.125 2023-06-21 03:24:57,524 INFO [train.py:996] (2/4) Epoch 5, batch 22700, loss[loss=0.2166, simple_loss=0.2713, pruned_loss=0.08091, over 21733.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3083, pruned_loss=0.09074, over 4255958.49 frames. ], batch size: 124, lr: 6.07e-03, grad_scale: 16.0 2023-06-21 03:25:37,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=868194.0, ans=0.125 2023-06-21 03:25:56,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=868194.0, ans=0.125 2023-06-21 03:25:57,557 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-21 03:25:59,507 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.134e+02 2.710e+02 3.113e+02 3.866e+02 5.786e+02, threshold=6.226e+02, percent-clipped=0.0 2023-06-21 03:26:01,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=868254.0, ans=0.1 2023-06-21 03:26:22,770 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.68 vs. limit=15.0 2023-06-21 03:26:30,988 INFO [train.py:996] (2/4) Epoch 5, batch 22750, loss[loss=0.2877, simple_loss=0.3486, pruned_loss=0.1135, over 21747.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3094, pruned_loss=0.09249, over 4263253.43 frames. ], batch size: 332, lr: 6.07e-03, grad_scale: 16.0 2023-06-21 03:26:39,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=868374.0, ans=0.2 2023-06-21 03:26:46,949 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.97 vs. limit=15.0 2023-06-21 03:26:54,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=868434.0, ans=0.0 2023-06-21 03:27:39,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=868554.0, ans=0.0 2023-06-21 03:28:15,797 INFO [train.py:996] (2/4) Epoch 5, batch 22800, loss[loss=0.228, simple_loss=0.2921, pruned_loss=0.0819, over 21185.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3153, pruned_loss=0.09563, over 4271763.70 frames. ], batch size: 608, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:28:18,677 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-21 03:28:19,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=868674.0, ans=0.125 2023-06-21 03:28:32,650 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.88 vs. limit=10.0 2023-06-21 03:28:35,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=868734.0, ans=0.0 2023-06-21 03:28:54,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=868794.0, ans=0.2 2023-06-21 03:28:59,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=868794.0, ans=0.0 2023-06-21 03:29:17,224 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.397e+02 2.823e+02 3.345e+02 3.974e+02 6.068e+02, threshold=6.691e+02, percent-clipped=0.0 2023-06-21 03:29:28,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=868854.0, ans=0.125 2023-06-21 03:29:49,108 INFO [train.py:996] (2/4) Epoch 5, batch 22850, loss[loss=0.2361, simple_loss=0.2911, pruned_loss=0.09048, over 21813.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3113, pruned_loss=0.09443, over 4275341.20 frames. ], batch size: 118, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:29:49,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=868974.0, ans=0.125 2023-06-21 03:30:04,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=868974.0, ans=0.0 2023-06-21 03:30:45,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=869094.0, ans=0.125 2023-06-21 03:30:45,650 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-21 03:30:56,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=869154.0, ans=0.125 2023-06-21 03:31:12,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=869214.0, ans=0.1 2023-06-21 03:31:35,942 INFO [train.py:996] (2/4) Epoch 5, batch 22900, loss[loss=0.3125, simple_loss=0.4355, pruned_loss=0.09478, over 19803.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.313, pruned_loss=0.09306, over 4263280.85 frames. ], batch size: 702, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:31:47,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=869274.0, ans=0.2 2023-06-21 03:32:15,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=869334.0, ans=0.125 2023-06-21 03:32:32,206 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.80 vs. limit=15.0 2023-06-21 03:32:39,205 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.436e+02 3.293e+02 3.917e+02 5.124e+02 7.831e+02, threshold=7.834e+02, percent-clipped=10.0 2023-06-21 03:32:44,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=869454.0, ans=0.125 2023-06-21 03:33:14,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=869574.0, ans=0.125 2023-06-21 03:33:15,496 INFO [train.py:996] (2/4) Epoch 5, batch 22950, loss[loss=0.264, simple_loss=0.3752, pruned_loss=0.07646, over 21725.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3262, pruned_loss=0.09077, over 4267296.34 frames. ], batch size: 332, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:33:18,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=869574.0, ans=0.0 2023-06-21 03:33:34,066 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.47 vs. limit=22.5 2023-06-21 03:33:34,876 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:34:29,606 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:34:34,741 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=15.0 2023-06-21 03:34:52,770 INFO [train.py:996] (2/4) Epoch 5, batch 23000, loss[loss=0.2433, simple_loss=0.3116, pruned_loss=0.08748, over 21867.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3233, pruned_loss=0.08861, over 4265106.42 frames. ], batch size: 118, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:35:36,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=869934.0, ans=0.0 2023-06-21 03:35:53,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=869994.0, ans=0.125 2023-06-21 03:35:53,700 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=22.5 2023-06-21 03:35:55,381 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=15.0 2023-06-21 03:35:56,033 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.067e+02 2.793e+02 3.379e+02 3.965e+02 7.564e+02, threshold=6.759e+02, percent-clipped=0.0 2023-06-21 03:36:43,315 INFO [train.py:996] (2/4) Epoch 5, batch 23050, loss[loss=0.2386, simple_loss=0.3106, pruned_loss=0.08329, over 20666.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3264, pruned_loss=0.09173, over 4268942.74 frames. ], batch size: 607, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:36:49,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=870174.0, ans=0.0 2023-06-21 03:37:03,197 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=22.5 2023-06-21 03:37:08,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=870234.0, ans=0.0 2023-06-21 03:38:10,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=870414.0, ans=0.0 2023-06-21 03:38:22,825 INFO [train.py:996] (2/4) Epoch 5, batch 23100, loss[loss=0.1924, simple_loss=0.2523, pruned_loss=0.06628, over 21583.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3211, pruned_loss=0.09163, over 4268770.99 frames. ], batch size: 247, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:38:37,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=870474.0, ans=0.07 2023-06-21 03:38:53,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=870534.0, ans=0.125 2023-06-21 03:39:20,135 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 2.864e+02 3.390e+02 4.261e+02 7.523e+02, threshold=6.780e+02, percent-clipped=3.0 2023-06-21 03:39:42,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=870714.0, ans=0.125 2023-06-21 03:40:00,817 INFO [train.py:996] (2/4) Epoch 5, batch 23150, loss[loss=0.2456, simple_loss=0.3155, pruned_loss=0.0879, over 21848.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.314, pruned_loss=0.09012, over 4271738.63 frames. ], batch size: 118, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:40:32,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=870834.0, ans=0.125 2023-06-21 03:41:28,444 INFO [train.py:996] (2/4) Epoch 5, batch 23200, loss[loss=0.2485, simple_loss=0.316, pruned_loss=0.09049, over 21742.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3136, pruned_loss=0.09128, over 4278906.29 frames. ], batch size: 389, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:41:35,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=871074.0, ans=0.1 2023-06-21 03:42:29,788 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.149e+02 2.864e+02 3.235e+02 3.730e+02 5.431e+02, threshold=6.469e+02, percent-clipped=0.0 2023-06-21 03:42:31,055 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-21 03:42:44,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=871254.0, ans=0.125 2023-06-21 03:42:51,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=871314.0, ans=0.125 2023-06-21 03:43:11,498 INFO [train.py:996] (2/4) Epoch 5, batch 23250, loss[loss=0.2585, simple_loss=0.3224, pruned_loss=0.0973, over 21754.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3152, pruned_loss=0.0929, over 4281808.52 frames. ], batch size: 112, lr: 6.06e-03, grad_scale: 32.0 2023-06-21 03:44:16,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=871554.0, ans=0.1 2023-06-21 03:44:57,872 INFO [train.py:996] (2/4) Epoch 5, batch 23300, loss[loss=0.313, simple_loss=0.4128, pruned_loss=0.1067, over 21668.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3244, pruned_loss=0.09503, over 4283839.97 frames. ], batch size: 389, lr: 6.05e-03, grad_scale: 16.0 2023-06-21 03:45:27,611 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.96 vs. limit=15.0 2023-06-21 03:45:30,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=871794.0, ans=0.0 2023-06-21 03:45:52,634 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 2.979e+02 3.447e+02 3.938e+02 6.103e+02, threshold=6.894e+02, percent-clipped=0.0 2023-06-21 03:45:55,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=871854.0, ans=0.1 2023-06-21 03:46:16,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=871914.0, ans=0.1 2023-06-21 03:46:32,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=871974.0, ans=0.1 2023-06-21 03:46:32,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=871974.0, ans=0.2 2023-06-21 03:46:33,354 INFO [train.py:996] (2/4) Epoch 5, batch 23350, loss[loss=0.242, simple_loss=0.3284, pruned_loss=0.07778, over 19933.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3279, pruned_loss=0.09304, over 4274595.42 frames. ], batch size: 702, lr: 6.05e-03, grad_scale: 16.0 2023-06-21 03:46:33,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=871974.0, ans=0.1 2023-06-21 03:46:44,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=871974.0, ans=0.125 2023-06-21 03:46:48,224 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:46:59,964 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=15.0 2023-06-21 03:47:00,051 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-21 03:48:05,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=872214.0, ans=0.125 2023-06-21 03:48:11,197 INFO [train.py:996] (2/4) Epoch 5, batch 23400, loss[loss=0.2368, simple_loss=0.3078, pruned_loss=0.08292, over 21827.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3207, pruned_loss=0.08913, over 4276830.83 frames. ], batch size: 282, lr: 6.05e-03, grad_scale: 16.0 2023-06-21 03:48:37,144 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.86 vs. limit=15.0 2023-06-21 03:49:17,353 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.694e+02 2.664e+02 3.181e+02 4.182e+02 6.937e+02, threshold=6.362e+02, percent-clipped=1.0 2023-06-21 03:49:51,806 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 03:49:52,805 INFO [train.py:996] (2/4) Epoch 5, batch 23450, loss[loss=0.2886, simple_loss=0.3443, pruned_loss=0.1165, over 21951.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3223, pruned_loss=0.09224, over 4284544.70 frames. ], batch size: 316, lr: 6.05e-03, grad_scale: 16.0 2023-06-21 03:50:18,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=872634.0, ans=0.1 2023-06-21 03:50:24,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=872634.0, ans=0.125 2023-06-21 03:51:30,839 INFO [train.py:996] (2/4) Epoch 5, batch 23500, loss[loss=0.2199, simple_loss=0.2801, pruned_loss=0.07983, over 21252.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3227, pruned_loss=0.09368, over 4279049.88 frames. ], batch size: 608, lr: 6.05e-03, grad_scale: 16.0 2023-06-21 03:51:54,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=872934.0, ans=0.125 2023-06-21 03:52:38,546 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.401e+02 3.046e+02 3.693e+02 4.776e+02 9.117e+02, threshold=7.385e+02, percent-clipped=5.0 2023-06-21 03:52:40,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=873054.0, ans=0.125 2023-06-21 03:53:06,054 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=22.5 2023-06-21 03:53:08,223 INFO [train.py:996] (2/4) Epoch 5, batch 23550, loss[loss=0.2048, simple_loss=0.2675, pruned_loss=0.07102, over 21687.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.319, pruned_loss=0.09322, over 4273003.03 frames. ], batch size: 333, lr: 6.05e-03, grad_scale: 16.0 2023-06-21 03:53:10,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=873174.0, ans=0.0 2023-06-21 03:53:35,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=873234.0, ans=0.1 2023-06-21 03:54:23,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=873354.0, ans=0.125 2023-06-21 03:54:46,881 INFO [train.py:996] (2/4) Epoch 5, batch 23600, loss[loss=0.2512, simple_loss=0.3233, pruned_loss=0.0895, over 21746.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3186, pruned_loss=0.09295, over 4276497.75 frames. ], batch size: 332, lr: 6.05e-03, grad_scale: 32.0 2023-06-21 03:54:47,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=873474.0, ans=0.125 2023-06-21 03:55:08,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=873534.0, ans=0.0 2023-06-21 03:55:08,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=873534.0, ans=0.1 2023-06-21 03:55:52,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=873654.0, ans=0.125 2023-06-21 03:55:53,506 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.72 vs. limit=22.5 2023-06-21 03:55:55,540 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.963e+02 2.629e+02 3.088e+02 3.713e+02 7.100e+02, threshold=6.175e+02, percent-clipped=0.0 2023-06-21 03:56:00,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=873654.0, ans=0.125 2023-06-21 03:56:06,181 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=22.5 2023-06-21 03:56:08,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=873714.0, ans=0.125 2023-06-21 03:56:32,184 INFO [train.py:996] (2/4) Epoch 5, batch 23650, loss[loss=0.2235, simple_loss=0.3147, pruned_loss=0.06618, over 21264.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3188, pruned_loss=0.09108, over 4281168.33 frames. ], batch size: 548, lr: 6.05e-03, grad_scale: 32.0 2023-06-21 03:57:02,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=873834.0, ans=0.0 2023-06-21 03:57:28,275 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.81 vs. limit=22.5 2023-06-21 03:57:29,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=873894.0, ans=0.125 2023-06-21 03:57:31,114 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.73 vs. limit=6.0 2023-06-21 03:58:04,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=874014.0, ans=0.125 2023-06-21 03:58:13,472 INFO [train.py:996] (2/4) Epoch 5, batch 23700, loss[loss=0.2171, simple_loss=0.3118, pruned_loss=0.06119, over 20696.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3219, pruned_loss=0.09039, over 4282603.81 frames. ], batch size: 607, lr: 6.05e-03, grad_scale: 32.0 2023-06-21 03:59:12,099 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-21 03:59:17,503 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 3.032e+02 3.536e+02 4.190e+02 7.050e+02, threshold=7.071e+02, percent-clipped=3.0 2023-06-21 03:59:32,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=874254.0, ans=0.1 2023-06-21 03:59:52,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=874374.0, ans=0.2 2023-06-21 03:59:53,570 INFO [train.py:996] (2/4) Epoch 5, batch 23750, loss[loss=0.2175, simple_loss=0.3138, pruned_loss=0.06059, over 21636.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3242, pruned_loss=0.09137, over 4276932.08 frames. ], batch size: 263, lr: 6.05e-03, grad_scale: 32.0 2023-06-21 04:00:12,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=874374.0, ans=0.0 2023-06-21 04:00:23,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=874434.0, ans=0.2 2023-06-21 04:00:24,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=874434.0, ans=0.5 2023-06-21 04:00:40,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=874494.0, ans=0.125 2023-06-21 04:00:52,688 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.54 vs. limit=22.5 2023-06-21 04:00:55,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=874554.0, ans=0.125 2023-06-21 04:01:03,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=874554.0, ans=0.0 2023-06-21 04:01:06,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=874554.0, ans=0.125 2023-06-21 04:01:36,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=874614.0, ans=0.125 2023-06-21 04:01:38,917 INFO [train.py:996] (2/4) Epoch 5, batch 23800, loss[loss=0.2589, simple_loss=0.327, pruned_loss=0.09537, over 21253.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3221, pruned_loss=0.08924, over 4270220.78 frames. ], batch size: 159, lr: 6.04e-03, grad_scale: 32.0 2023-06-21 04:01:55,770 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=22.5 2023-06-21 04:02:43,708 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.837e+02 2.704e+02 3.208e+02 4.045e+02 9.409e+02, threshold=6.416e+02, percent-clipped=3.0 2023-06-21 04:03:29,776 INFO [train.py:996] (2/4) Epoch 5, batch 23850, loss[loss=0.2904, simple_loss=0.3519, pruned_loss=0.1144, over 21361.00 frames. ], tot_loss[loss=0.258, simple_loss=0.3317, pruned_loss=0.09213, over 4275429.19 frames. ], batch size: 176, lr: 6.04e-03, grad_scale: 32.0 2023-06-21 04:03:40,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=874974.0, ans=0.125 2023-06-21 04:04:04,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=875034.0, ans=0.04949747468305833 2023-06-21 04:04:21,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=875154.0, ans=0.1 2023-06-21 04:04:25,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=875154.0, ans=0.125 2023-06-21 04:05:04,030 INFO [train.py:996] (2/4) Epoch 5, batch 23900, loss[loss=0.2627, simple_loss=0.3363, pruned_loss=0.09451, over 21986.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3395, pruned_loss=0.09528, over 4278418.70 frames. ], batch size: 103, lr: 6.04e-03, grad_scale: 32.0 2023-06-21 04:06:02,872 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.265e+02 3.113e+02 3.559e+02 4.462e+02 8.067e+02, threshold=7.118e+02, percent-clipped=8.0 2023-06-21 04:06:41,998 INFO [train.py:996] (2/4) Epoch 5, batch 23950, loss[loss=0.2576, simple_loss=0.3098, pruned_loss=0.1027, over 21699.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3322, pruned_loss=0.09497, over 4270719.32 frames. ], batch size: 247, lr: 6.04e-03, grad_scale: 32.0 2023-06-21 04:07:08,491 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:07:09,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=875634.0, ans=0.07 2023-06-21 04:07:27,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=875694.0, ans=0.125 2023-06-21 04:07:37,530 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.06 vs. limit=10.0 2023-06-21 04:08:20,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=875874.0, ans=0.0 2023-06-21 04:08:21,285 INFO [train.py:996] (2/4) Epoch 5, batch 24000, loss[loss=0.2986, simple_loss=0.3621, pruned_loss=0.1176, over 21838.00 frames. ], tot_loss[loss=0.2659, simple_loss=0.3346, pruned_loss=0.09856, over 4266578.45 frames. ], batch size: 441, lr: 6.04e-03, grad_scale: 32.0 2023-06-21 04:08:21,286 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 04:08:35,586 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.6751, 2.9091, 2.8744, 1.6153], device='cuda:2') 2023-06-21 04:08:38,096 INFO [train.py:1028] (2/4) Epoch 5, validation: loss=0.2683, simple_loss=0.3693, pruned_loss=0.08367, over 1796401.00 frames. 2023-06-21 04:08:38,097 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-21 04:08:49,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=875874.0, ans=0.125 2023-06-21 04:08:55,310 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:09:03,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=875934.0, ans=0.0 2023-06-21 04:09:43,034 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.369e+02 3.178e+02 3.721e+02 4.593e+02 6.442e+02, threshold=7.441e+02, percent-clipped=0.0 2023-06-21 04:09:54,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=876054.0, ans=0.0 2023-06-21 04:09:54,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=876054.0, ans=0.125 2023-06-21 04:10:17,555 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:10:18,501 INFO [train.py:996] (2/4) Epoch 5, batch 24050, loss[loss=0.2086, simple_loss=0.2975, pruned_loss=0.05983, over 21711.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3359, pruned_loss=0.09857, over 4270451.14 frames. ], batch size: 247, lr: 6.04e-03, grad_scale: 16.0 2023-06-21 04:10:33,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=876234.0, ans=0.125 2023-06-21 04:10:34,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=876234.0, ans=0.125 2023-06-21 04:10:39,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=876234.0, ans=0.2 2023-06-21 04:11:57,815 INFO [train.py:996] (2/4) Epoch 5, batch 24100, loss[loss=0.2273, simple_loss=0.3003, pruned_loss=0.07717, over 21156.00 frames. ], tot_loss[loss=0.264, simple_loss=0.3353, pruned_loss=0.09638, over 4273601.82 frames. ], batch size: 143, lr: 6.04e-03, grad_scale: 16.0 2023-06-21 04:12:01,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=876474.0, ans=0.125 2023-06-21 04:12:57,147 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.320e+02 2.915e+02 3.291e+02 4.001e+02 6.877e+02, threshold=6.582e+02, percent-clipped=0.0 2023-06-21 04:13:10,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=876714.0, ans=0.2 2023-06-21 04:13:26,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=876714.0, ans=0.1 2023-06-21 04:13:31,058 INFO [train.py:996] (2/4) Epoch 5, batch 24150, loss[loss=0.2644, simple_loss=0.3254, pruned_loss=0.1017, over 21927.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.3335, pruned_loss=0.09691, over 4273974.89 frames. ], batch size: 316, lr: 6.04e-03, grad_scale: 16.0 2023-06-21 04:13:38,692 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-06-21 04:14:17,125 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=12.0 2023-06-21 04:14:29,350 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:14:40,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=876954.0, ans=0.025 2023-06-21 04:15:11,012 INFO [train.py:996] (2/4) Epoch 5, batch 24200, loss[loss=0.3782, simple_loss=0.4384, pruned_loss=0.159, over 21507.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.337, pruned_loss=0.09891, over 4276763.54 frames. ], batch size: 508, lr: 6.04e-03, grad_scale: 16.0 2023-06-21 04:15:55,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=877134.0, ans=0.0 2023-06-21 04:16:05,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=877194.0, ans=0.0 2023-06-21 04:16:17,777 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.953e+02 2.992e+02 3.434e+02 4.148e+02 5.774e+02, threshold=6.868e+02, percent-clipped=0.0 2023-06-21 04:16:47,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=877314.0, ans=0.125 2023-06-21 04:16:58,535 INFO [train.py:996] (2/4) Epoch 5, batch 24250, loss[loss=0.2094, simple_loss=0.3128, pruned_loss=0.05299, over 21857.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.334, pruned_loss=0.09174, over 4277386.11 frames. ], batch size: 371, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:17:21,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=877374.0, ans=0.0 2023-06-21 04:17:43,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=877494.0, ans=0.0 2023-06-21 04:18:22,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=877614.0, ans=0.125 2023-06-21 04:18:37,950 INFO [train.py:996] (2/4) Epoch 5, batch 24300, loss[loss=0.2071, simple_loss=0.2725, pruned_loss=0.0709, over 21816.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3255, pruned_loss=0.0852, over 4277393.67 frames. ], batch size: 107, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:18:44,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=877674.0, ans=0.1 2023-06-21 04:18:45,222 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=15.0 2023-06-21 04:19:02,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=877734.0, ans=0.125 2023-06-21 04:19:07,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=877734.0, ans=0.0 2023-06-21 04:19:42,623 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=12.0 2023-06-21 04:19:42,949 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.429e+02 3.041e+02 4.140e+02 6.830e+02, threshold=6.081e+02, percent-clipped=0.0 2023-06-21 04:19:51,715 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=22.5 2023-06-21 04:20:20,914 INFO [train.py:996] (2/4) Epoch 5, batch 24350, loss[loss=0.2751, simple_loss=0.3339, pruned_loss=0.1082, over 20941.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3211, pruned_loss=0.08514, over 4284489.21 frames. ], batch size: 607, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:20:27,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_ff2.min_abs, batch_count=877974.0, ans=0.1 2023-06-21 04:20:49,945 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 04:21:48,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=878214.0, ans=0.125 2023-06-21 04:22:04,679 INFO [train.py:996] (2/4) Epoch 5, batch 24400, loss[loss=0.252, simple_loss=0.3238, pruned_loss=0.09013, over 21691.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3252, pruned_loss=0.08889, over 4283421.15 frames. ], batch size: 298, lr: 6.03e-03, grad_scale: 32.0 2023-06-21 04:22:21,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=878334.0, ans=0.2 2023-06-21 04:22:39,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=878334.0, ans=0.0 2023-06-21 04:22:42,457 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=22.5 2023-06-21 04:23:06,242 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 3.330e+02 3.732e+02 4.584e+02 7.697e+02, threshold=7.464e+02, percent-clipped=2.0 2023-06-21 04:23:18,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=878454.0, ans=0.125 2023-06-21 04:23:44,691 INFO [train.py:996] (2/4) Epoch 5, batch 24450, loss[loss=0.3302, simple_loss=0.4118, pruned_loss=0.1243, over 21458.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3286, pruned_loss=0.0912, over 4281310.15 frames. ], batch size: 508, lr: 6.03e-03, grad_scale: 32.0 2023-06-21 04:23:45,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=878574.0, ans=0.125 2023-06-21 04:24:02,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=878634.0, ans=0.125 2023-06-21 04:24:13,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=878634.0, ans=0.0 2023-06-21 04:24:28,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=878694.0, ans=0.5 2023-06-21 04:24:29,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=878694.0, ans=0.035 2023-06-21 04:24:35,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=878754.0, ans=0.125 2023-06-21 04:24:45,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=878754.0, ans=0.025 2023-06-21 04:25:14,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=878814.0, ans=0.0 2023-06-21 04:25:20,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=878814.0, ans=0.125 2023-06-21 04:25:22,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=878874.0, ans=0.0 2023-06-21 04:25:23,358 INFO [train.py:996] (2/4) Epoch 5, batch 24500, loss[loss=0.2244, simple_loss=0.2958, pruned_loss=0.07649, over 21219.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3283, pruned_loss=0.091, over 4278107.62 frames. ], batch size: 608, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:25:23,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=878874.0, ans=0.125 2023-06-21 04:25:56,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=878934.0, ans=0.125 2023-06-21 04:26:30,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=879054.0, ans=0.0 2023-06-21 04:26:31,228 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.806e+02 3.370e+02 4.048e+02 6.223e+02, threshold=6.740e+02, percent-clipped=0.0 2023-06-21 04:26:52,431 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=22.5 2023-06-21 04:27:02,341 INFO [train.py:996] (2/4) Epoch 5, batch 24550, loss[loss=0.3411, simple_loss=0.3961, pruned_loss=0.143, over 21237.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3334, pruned_loss=0.09533, over 4284234.80 frames. ], batch size: 143, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:28:39,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=879414.0, ans=0.0 2023-06-21 04:28:42,266 INFO [train.py:996] (2/4) Epoch 5, batch 24600, loss[loss=0.2063, simple_loss=0.2618, pruned_loss=0.07538, over 21149.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.329, pruned_loss=0.09613, over 4276963.35 frames. ], batch size: 143, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:29:13,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=879534.0, ans=0.0 2023-06-21 04:29:31,403 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.96 vs. limit=6.0 2023-06-21 04:29:52,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=879654.0, ans=0.125 2023-06-21 04:29:53,775 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.443e+02 3.125e+02 3.627e+02 4.480e+02 7.581e+02, threshold=7.254e+02, percent-clipped=2.0 2023-06-21 04:30:21,440 INFO [train.py:996] (2/4) Epoch 5, batch 24650, loss[loss=0.263, simple_loss=0.3929, pruned_loss=0.06659, over 19850.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3214, pruned_loss=0.09397, over 4273677.99 frames. ], batch size: 702, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:30:35,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=879774.0, ans=0.1 2023-06-21 04:31:13,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=879894.0, ans=0.125 2023-06-21 04:31:31,787 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-21 04:32:07,121 INFO [train.py:996] (2/4) Epoch 5, batch 24700, loss[loss=0.2018, simple_loss=0.258, pruned_loss=0.07285, over 20667.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3185, pruned_loss=0.09146, over 4261630.97 frames. ], batch size: 607, lr: 6.03e-03, grad_scale: 16.0 2023-06-21 04:32:28,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=880134.0, ans=0.125 2023-06-21 04:32:38,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=880134.0, ans=0.125 2023-06-21 04:33:13,145 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.127e+02 2.830e+02 3.084e+02 3.762e+02 5.962e+02, threshold=6.167e+02, percent-clipped=0.0 2023-06-21 04:33:31,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=880314.0, ans=0.125 2023-06-21 04:33:39,758 INFO [train.py:996] (2/4) Epoch 5, batch 24750, loss[loss=0.1978, simple_loss=0.2602, pruned_loss=0.06771, over 21207.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3124, pruned_loss=0.08916, over 4262793.29 frames. ], batch size: 159, lr: 6.02e-03, grad_scale: 8.0 2023-06-21 04:33:55,383 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.29 vs. limit=15.0 2023-06-21 04:34:05,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=15.0 2023-06-21 04:35:18,242 INFO [train.py:996] (2/4) Epoch 5, batch 24800, loss[loss=0.2214, simple_loss=0.2888, pruned_loss=0.07699, over 21391.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3068, pruned_loss=0.08823, over 4271108.91 frames. ], batch size: 194, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:35:40,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=880734.0, ans=0.5 2023-06-21 04:35:40,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=880734.0, ans=0.0 2023-06-21 04:36:31,840 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.672e+02 2.950e+02 3.460e+02 6.225e+02, threshold=5.900e+02, percent-clipped=1.0 2023-06-21 04:36:32,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=880854.0, ans=0.125 2023-06-21 04:36:57,253 INFO [train.py:996] (2/4) Epoch 5, batch 24850, loss[loss=0.2954, simple_loss=0.367, pruned_loss=0.1119, over 21544.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3069, pruned_loss=0.0898, over 4270314.85 frames. ], batch size: 471, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:37:58,991 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.66 vs. limit=10.0 2023-06-21 04:38:00,758 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-06-21 04:38:17,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=881154.0, ans=0.125 2023-06-21 04:38:24,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=881214.0, ans=0.0 2023-06-21 04:38:36,651 INFO [train.py:996] (2/4) Epoch 5, batch 24900, loss[loss=0.2848, simple_loss=0.3553, pruned_loss=0.1072, over 21484.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3091, pruned_loss=0.08967, over 4275232.76 frames. ], batch size: 131, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:39:51,282 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.047e+02 3.066e+02 3.454e+02 4.012e+02 6.143e+02, threshold=6.909e+02, percent-clipped=1.0 2023-06-21 04:40:22,254 INFO [train.py:996] (2/4) Epoch 5, batch 24950, loss[loss=0.2847, simple_loss=0.3555, pruned_loss=0.107, over 21388.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3185, pruned_loss=0.09467, over 4277307.99 frames. ], batch size: 159, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:41:21,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=881694.0, ans=0.125 2023-06-21 04:41:23,522 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.42 vs. limit=15.0 2023-06-21 04:41:27,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=881754.0, ans=0.125 2023-06-21 04:41:41,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=881754.0, ans=0.125 2023-06-21 04:41:54,109 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=15.0 2023-06-21 04:42:02,089 INFO [train.py:996] (2/4) Epoch 5, batch 25000, loss[loss=0.2772, simple_loss=0.3449, pruned_loss=0.1047, over 20744.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3265, pruned_loss=0.09721, over 4280515.81 frames. ], batch size: 607, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:42:18,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=881874.0, ans=0.0 2023-06-21 04:42:56,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=881994.0, ans=0.0 2023-06-21 04:43:10,523 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.862e+02 3.406e+02 4.060e+02 6.504e+02, threshold=6.812e+02, percent-clipped=0.0 2023-06-21 04:43:12,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=882054.0, ans=0.125 2023-06-21 04:43:40,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=882174.0, ans=0.125 2023-06-21 04:43:46,048 INFO [train.py:996] (2/4) Epoch 5, batch 25050, loss[loss=0.2606, simple_loss=0.3127, pruned_loss=0.1043, over 21865.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3189, pruned_loss=0.09581, over 4278236.25 frames. ], batch size: 373, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:44:45,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=882354.0, ans=0.125 2023-06-21 04:45:20,796 INFO [train.py:996] (2/4) Epoch 5, batch 25100, loss[loss=0.2232, simple_loss=0.2873, pruned_loss=0.07961, over 21688.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3115, pruned_loss=0.09384, over 4280003.08 frames. ], batch size: 417, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:45:27,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=882474.0, ans=0.1 2023-06-21 04:46:12,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=882594.0, ans=0.125 2023-06-21 04:46:29,034 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 2.734e+02 3.137e+02 3.918e+02 6.199e+02, threshold=6.274e+02, percent-clipped=0.0 2023-06-21 04:46:59,088 INFO [train.py:996] (2/4) Epoch 5, batch 25150, loss[loss=0.2587, simple_loss=0.3366, pruned_loss=0.09043, over 21916.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3156, pruned_loss=0.09127, over 4261645.18 frames. ], batch size: 316, lr: 6.02e-03, grad_scale: 16.0 2023-06-21 04:47:18,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=882834.0, ans=0.0 2023-06-21 04:47:21,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=882834.0, ans=0.125 2023-06-21 04:47:36,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=882834.0, ans=0.125 2023-06-21 04:48:11,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=882954.0, ans=0.125 2023-06-21 04:48:15,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=883014.0, ans=0.125 2023-06-21 04:48:37,298 INFO [train.py:996] (2/4) Epoch 5, batch 25200, loss[loss=0.2119, simple_loss=0.2848, pruned_loss=0.06948, over 21363.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3155, pruned_loss=0.08883, over 4268701.70 frames. ], batch size: 131, lr: 6.02e-03, grad_scale: 32.0 2023-06-21 04:48:42,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=883074.0, ans=0.0 2023-06-21 04:48:43,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=883074.0, ans=0.1 2023-06-21 04:49:04,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=883134.0, ans=0.2 2023-06-21 04:49:46,909 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.947e+02 2.669e+02 3.257e+02 4.012e+02 7.318e+02, threshold=6.513e+02, percent-clipped=2.0 2023-06-21 04:50:01,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=883314.0, ans=0.1 2023-06-21 04:50:13,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=883314.0, ans=0.125 2023-06-21 04:50:17,422 INFO [train.py:996] (2/4) Epoch 5, batch 25250, loss[loss=0.2408, simple_loss=0.3017, pruned_loss=0.08997, over 21521.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3132, pruned_loss=0.08707, over 4252323.38 frames. ], batch size: 414, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 04:50:54,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=883434.0, ans=0.0 2023-06-21 04:51:18,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=883554.0, ans=0.1 2023-06-21 04:51:22,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=883554.0, ans=0.125 2023-06-21 04:51:57,249 INFO [train.py:996] (2/4) Epoch 5, batch 25300, loss[loss=0.2804, simple_loss=0.3403, pruned_loss=0.1102, over 21311.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.311, pruned_loss=0.0859, over 4259568.23 frames. ], batch size: 176, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 04:52:02,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=883674.0, ans=0.95 2023-06-21 04:52:15,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=883674.0, ans=0.0 2023-06-21 04:52:19,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=883734.0, ans=0.0 2023-06-21 04:52:36,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=883734.0, ans=0.125 2023-06-21 04:53:02,445 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.159e+02 2.770e+02 3.140e+02 3.813e+02 4.907e+02, threshold=6.281e+02, percent-clipped=0.0 2023-06-21 04:53:33,697 INFO [train.py:996] (2/4) Epoch 5, batch 25350, loss[loss=0.2636, simple_loss=0.3395, pruned_loss=0.09383, over 21465.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3157, pruned_loss=0.08626, over 4260083.27 frames. ], batch size: 471, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 04:53:42,657 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=15.0 2023-06-21 04:54:01,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=884034.0, ans=0.2 2023-06-21 04:54:11,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=884034.0, ans=0.2 2023-06-21 04:54:25,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=884094.0, ans=0.0 2023-06-21 04:54:59,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=884214.0, ans=0.125 2023-06-21 04:55:07,856 INFO [train.py:996] (2/4) Epoch 5, batch 25400, loss[loss=0.2291, simple_loss=0.2989, pruned_loss=0.07966, over 21606.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3109, pruned_loss=0.08555, over 4265891.25 frames. ], batch size: 298, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 04:55:31,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=884334.0, ans=0.1 2023-06-21 04:56:11,993 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-21 04:56:15,910 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.763e+02 3.058e+02 3.669e+02 6.374e+02, threshold=6.116e+02, percent-clipped=1.0 2023-06-21 04:56:46,811 INFO [train.py:996] (2/4) Epoch 5, batch 25450, loss[loss=0.2292, simple_loss=0.3094, pruned_loss=0.07451, over 21269.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3124, pruned_loss=0.08728, over 4267018.60 frames. ], batch size: 176, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 04:57:06,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=884634.0, ans=0.0 2023-06-21 04:57:16,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=884634.0, ans=0.04949747468305833 2023-06-21 04:57:25,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=884634.0, ans=0.125 2023-06-21 04:58:31,851 INFO [train.py:996] (2/4) Epoch 5, batch 25500, loss[loss=0.2157, simple_loss=0.2958, pruned_loss=0.06778, over 21415.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3116, pruned_loss=0.08366, over 4252816.43 frames. ], batch size: 194, lr: 6.01e-03, grad_scale: 16.0 2023-06-21 04:59:28,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=884994.0, ans=0.125 2023-06-21 04:59:44,015 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.921e+02 2.815e+02 3.207e+02 3.771e+02 6.756e+02, threshold=6.413e+02, percent-clipped=1.0 2023-06-21 04:59:56,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=885114.0, ans=0.04949747468305833 2023-06-21 05:00:13,122 INFO [train.py:996] (2/4) Epoch 5, batch 25550, loss[loss=0.2814, simple_loss=0.3736, pruned_loss=0.09458, over 21566.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3201, pruned_loss=0.08447, over 4247320.28 frames. ], batch size: 471, lr: 6.01e-03, grad_scale: 16.0 2023-06-21 05:00:23,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=885174.0, ans=0.125 2023-06-21 05:02:02,411 INFO [train.py:996] (2/4) Epoch 5, batch 25600, loss[loss=0.2555, simple_loss=0.3286, pruned_loss=0.09122, over 21946.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3239, pruned_loss=0.08583, over 4249993.13 frames. ], batch size: 316, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 05:02:08,441 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.10 vs. limit=6.0 2023-06-21 05:02:14,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=885474.0, ans=0.125 2023-06-21 05:02:30,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=885534.0, ans=0.2 2023-06-21 05:02:59,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=885654.0, ans=0.125 2023-06-21 05:03:03,965 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.014e+02 2.775e+02 3.286e+02 3.783e+02 5.833e+02, threshold=6.573e+02, percent-clipped=0.0 2023-06-21 05:03:41,883 INFO [train.py:996] (2/4) Epoch 5, batch 25650, loss[loss=0.2497, simple_loss=0.309, pruned_loss=0.09526, over 21800.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3254, pruned_loss=0.0893, over 4251506.20 frames. ], batch size: 317, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 05:03:55,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=885774.0, ans=0.0 2023-06-21 05:04:01,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=885834.0, ans=0.0 2023-06-21 05:04:06,679 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=12.0 2023-06-21 05:04:20,175 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2023-06-21 05:04:56,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=886014.0, ans=0.1 2023-06-21 05:05:21,312 INFO [train.py:996] (2/4) Epoch 5, batch 25700, loss[loss=0.2466, simple_loss=0.3016, pruned_loss=0.09586, over 21494.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3209, pruned_loss=0.09012, over 4251799.71 frames. ], batch size: 212, lr: 6.01e-03, grad_scale: 32.0 2023-06-21 05:05:29,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=886074.0, ans=0.0 2023-06-21 05:06:23,808 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.196e+02 2.865e+02 3.376e+02 4.055e+02 7.604e+02, threshold=6.752e+02, percent-clipped=2.0 2023-06-21 05:06:49,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=886314.0, ans=0.1 2023-06-21 05:06:50,095 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-21 05:06:58,904 INFO [train.py:996] (2/4) Epoch 5, batch 25750, loss[loss=0.4043, simple_loss=0.4783, pruned_loss=0.1652, over 21469.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3287, pruned_loss=0.09462, over 4258185.56 frames. ], batch size: 471, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:08:45,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=886674.0, ans=0.2 2023-06-21 05:08:45,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=886674.0, ans=0.2 2023-06-21 05:08:46,840 INFO [train.py:996] (2/4) Epoch 5, batch 25800, loss[loss=0.2502, simple_loss=0.3216, pruned_loss=0.08943, over 21625.00 frames. ], tot_loss[loss=0.2702, simple_loss=0.3412, pruned_loss=0.09958, over 4259958.57 frames. ], batch size: 263, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:08:49,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=886674.0, ans=0.2 2023-06-21 05:08:54,715 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.00 vs. limit=6.0 2023-06-21 05:09:24,296 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:09:25,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=886734.0, ans=0.04949747468305833 2023-06-21 05:09:59,475 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.058e+02 2.944e+02 3.581e+02 4.306e+02 8.254e+02, threshold=7.162e+02, percent-clipped=3.0 2023-06-21 05:10:26,655 INFO [train.py:996] (2/4) Epoch 5, batch 25850, loss[loss=0.293, simple_loss=0.3531, pruned_loss=0.1164, over 21783.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.3409, pruned_loss=0.09788, over 4264377.67 frames. ], batch size: 441, lr: 6.00e-03, grad_scale: 16.0 2023-06-21 05:10:48,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=887034.0, ans=0.0 2023-06-21 05:12:07,876 INFO [train.py:996] (2/4) Epoch 5, batch 25900, loss[loss=0.2714, simple_loss=0.3613, pruned_loss=0.09069, over 21691.00 frames. ], tot_loss[loss=0.2691, simple_loss=0.3415, pruned_loss=0.09838, over 4269914.19 frames. ], batch size: 247, lr: 6.00e-03, grad_scale: 16.0 2023-06-21 05:12:27,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=887274.0, ans=0.125 2023-06-21 05:12:32,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=887334.0, ans=0.2 2023-06-21 05:12:56,364 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-21 05:13:26,832 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 3.088e+02 3.549e+02 4.240e+02 5.933e+02, threshold=7.098e+02, percent-clipped=0.0 2023-06-21 05:13:27,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=887454.0, ans=0.0 2023-06-21 05:13:58,657 INFO [train.py:996] (2/4) Epoch 5, batch 25950, loss[loss=0.2435, simple_loss=0.3177, pruned_loss=0.08464, over 21320.00 frames. ], tot_loss[loss=0.2758, simple_loss=0.3477, pruned_loss=0.1019, over 4275546.37 frames. ], batch size: 549, lr: 6.00e-03, grad_scale: 16.0 2023-06-21 05:14:01,086 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-21 05:14:19,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=887634.0, ans=0.1 2023-06-21 05:14:24,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=887634.0, ans=0.0 2023-06-21 05:14:28,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=887634.0, ans=0.0 2023-06-21 05:14:45,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=887694.0, ans=0.0 2023-06-21 05:15:15,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=887814.0, ans=0.125 2023-06-21 05:15:15,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=887814.0, ans=0.1 2023-06-21 05:15:40,022 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-21 05:15:40,359 INFO [train.py:996] (2/4) Epoch 5, batch 26000, loss[loss=0.2252, simple_loss=0.3095, pruned_loss=0.0705, over 21821.00 frames. ], tot_loss[loss=0.2719, simple_loss=0.3457, pruned_loss=0.09904, over 4271112.87 frames. ], batch size: 282, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:15:51,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=887874.0, ans=0.95 2023-06-21 05:16:30,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=887994.0, ans=0.1 2023-06-21 05:16:35,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=887994.0, ans=0.125 2023-06-21 05:16:48,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=888054.0, ans=0.0 2023-06-21 05:16:49,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=888054.0, ans=0.125 2023-06-21 05:16:51,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=888054.0, ans=0.1 2023-06-21 05:16:52,269 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.179e+02 2.994e+02 3.502e+02 4.127e+02 6.076e+02, threshold=7.004e+02, percent-clipped=0.0 2023-06-21 05:16:57,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=888114.0, ans=0.0 2023-06-21 05:17:19,591 INFO [train.py:996] (2/4) Epoch 5, batch 26050, loss[loss=0.2434, simple_loss=0.3061, pruned_loss=0.09036, over 21421.00 frames. ], tot_loss[loss=0.2737, simple_loss=0.3459, pruned_loss=0.1007, over 4271432.55 frames. ], batch size: 211, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:17:23,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=888174.0, ans=0.1 2023-06-21 05:17:39,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=888234.0, ans=0.1 2023-06-21 05:17:57,416 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:17:59,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=888234.0, ans=0.0 2023-06-21 05:18:16,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=888354.0, ans=0.1 2023-06-21 05:18:33,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=888414.0, ans=0.0 2023-06-21 05:18:51,521 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2023-06-21 05:18:58,131 INFO [train.py:996] (2/4) Epoch 5, batch 26100, loss[loss=0.2277, simple_loss=0.296, pruned_loss=0.07972, over 21961.00 frames. ], tot_loss[loss=0.2698, simple_loss=0.3407, pruned_loss=0.09942, over 4276053.75 frames. ], batch size: 333, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:19:11,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=888474.0, ans=0.125 2023-06-21 05:20:05,881 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.461e+02 2.983e+02 3.615e+02 4.836e+02 1.225e+03, threshold=7.230e+02, percent-clipped=7.0 2023-06-21 05:20:39,133 INFO [train.py:996] (2/4) Epoch 5, batch 26150, loss[loss=0.3061, simple_loss=0.3727, pruned_loss=0.1197, over 21293.00 frames. ], tot_loss[loss=0.269, simple_loss=0.3389, pruned_loss=0.09949, over 4280988.49 frames. ], batch size: 143, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:21:06,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=888834.0, ans=0.125 2023-06-21 05:21:06,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=888834.0, ans=0.2 2023-06-21 05:21:12,523 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:21:24,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=888894.0, ans=0.0 2023-06-21 05:21:37,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=888954.0, ans=0.0 2023-06-21 05:22:20,398 INFO [train.py:996] (2/4) Epoch 5, batch 26200, loss[loss=0.2284, simple_loss=0.3328, pruned_loss=0.06198, over 21290.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3384, pruned_loss=0.09656, over 4281699.76 frames. ], batch size: 548, lr: 6.00e-03, grad_scale: 32.0 2023-06-21 05:22:24,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=889074.0, ans=0.04949747468305833 2023-06-21 05:23:12,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=889194.0, ans=0.125 2023-06-21 05:23:33,380 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.145e+02 2.909e+02 3.359e+02 4.257e+02 6.778e+02, threshold=6.718e+02, percent-clipped=0.0 2023-06-21 05:23:38,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=889314.0, ans=0.04949747468305833 2023-06-21 05:23:50,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=889314.0, ans=0.125 2023-06-21 05:24:01,214 INFO [train.py:996] (2/4) Epoch 5, batch 26250, loss[loss=0.2366, simple_loss=0.3212, pruned_loss=0.07599, over 21840.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3418, pruned_loss=0.0957, over 4278338.91 frames. ], batch size: 332, lr: 5.99e-03, grad_scale: 32.0 2023-06-21 05:24:25,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=889434.0, ans=0.1 2023-06-21 05:24:29,267 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.54 vs. limit=15.0 2023-06-21 05:25:39,650 INFO [train.py:996] (2/4) Epoch 5, batch 26300, loss[loss=0.2806, simple_loss=0.3369, pruned_loss=0.1121, over 21752.00 frames. ], tot_loss[loss=0.2671, simple_loss=0.3396, pruned_loss=0.09731, over 4288109.04 frames. ], batch size: 389, lr: 5.99e-03, grad_scale: 32.0 2023-06-21 05:25:54,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=889674.0, ans=0.1 2023-06-21 05:25:55,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=889674.0, ans=0.2 2023-06-21 05:26:45,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=889854.0, ans=0.0 2023-06-21 05:26:55,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=889854.0, ans=10.0 2023-06-21 05:26:58,704 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.177e+02 2.888e+02 3.226e+02 3.870e+02 6.035e+02, threshold=6.451e+02, percent-clipped=0.0 2023-06-21 05:26:59,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=889854.0, ans=0.1 2023-06-21 05:27:25,020 INFO [train.py:996] (2/4) Epoch 5, batch 26350, loss[loss=0.2907, simple_loss=0.3523, pruned_loss=0.1145, over 21579.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3385, pruned_loss=0.0987, over 4292890.83 frames. ], batch size: 263, lr: 5.99e-03, grad_scale: 16.0 2023-06-21 05:27:28,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=889974.0, ans=0.0 2023-06-21 05:27:45,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=890034.0, ans=0.2 2023-06-21 05:27:52,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=890034.0, ans=0.0 2023-06-21 05:28:04,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=890094.0, ans=0.07 2023-06-21 05:28:05,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=890094.0, ans=0.05 2023-06-21 05:28:06,599 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.95 vs. limit=15.0 2023-06-21 05:28:13,542 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.10 vs. limit=15.0 2023-06-21 05:28:59,015 INFO [train.py:996] (2/4) Epoch 5, batch 26400, loss[loss=0.2448, simple_loss=0.2946, pruned_loss=0.09746, over 21242.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3317, pruned_loss=0.09837, over 4286471.49 frames. ], batch size: 159, lr: 5.99e-03, grad_scale: 32.0 2023-06-21 05:29:13,518 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=11.01 vs. limit=15.0 2023-06-21 05:29:16,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=890334.0, ans=0.125 2023-06-21 05:29:41,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=890394.0, ans=0.125 2023-06-21 05:29:42,188 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.58 vs. limit=15.0 2023-06-21 05:30:10,885 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.298e+02 2.951e+02 3.748e+02 4.421e+02 1.228e+03, threshold=7.496e+02, percent-clipped=6.0 2023-06-21 05:30:37,997 INFO [train.py:996] (2/4) Epoch 5, batch 26450, loss[loss=0.3041, simple_loss=0.3896, pruned_loss=0.1092, over 21684.00 frames. ], tot_loss[loss=0.2662, simple_loss=0.3354, pruned_loss=0.09843, over 4285489.55 frames. ], batch size: 298, lr: 5.99e-03, grad_scale: 32.0 2023-06-21 05:31:36,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=890754.0, ans=0.0 2023-06-21 05:31:56,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=890814.0, ans=0.1 2023-06-21 05:32:01,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=890814.0, ans=0.125 2023-06-21 05:32:19,188 INFO [train.py:996] (2/4) Epoch 5, batch 26500, loss[loss=0.2603, simple_loss=0.3517, pruned_loss=0.08444, over 21651.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3368, pruned_loss=0.09697, over 4278178.80 frames. ], batch size: 441, lr: 5.99e-03, grad_scale: 16.0 2023-06-21 05:32:30,417 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=12.0 2023-06-21 05:32:41,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=890934.0, ans=0.2 2023-06-21 05:33:42,187 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.071e+02 3.801e+02 4.543e+02 1.004e+03, threshold=7.603e+02, percent-clipped=5.0 2023-06-21 05:33:49,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=891114.0, ans=0.125 2023-06-21 05:34:00,995 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-06-21 05:34:01,429 INFO [train.py:996] (2/4) Epoch 5, batch 26550, loss[loss=0.2143, simple_loss=0.3201, pruned_loss=0.05428, over 21104.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3319, pruned_loss=0.09323, over 4262319.43 frames. ], batch size: 548, lr: 5.99e-03, grad_scale: 16.0 2023-06-21 05:34:07,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=891174.0, ans=0.0 2023-06-21 05:34:40,104 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.98 vs. limit=12.0 2023-06-21 05:34:49,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=891294.0, ans=0.125 2023-06-21 05:35:23,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=891414.0, ans=0.125 2023-06-21 05:35:40,883 INFO [train.py:996] (2/4) Epoch 5, batch 26600, loss[loss=0.254, simple_loss=0.3235, pruned_loss=0.0922, over 21493.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3302, pruned_loss=0.09022, over 4257827.78 frames. ], batch size: 389, lr: 5.99e-03, grad_scale: 16.0 2023-06-21 05:35:52,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=891474.0, ans=0.125 2023-06-21 05:36:06,413 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.62 vs. limit=15.0 2023-06-21 05:36:15,495 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.02 vs. limit=15.0 2023-06-21 05:36:18,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=891534.0, ans=0.2 2023-06-21 05:36:37,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=891594.0, ans=0.2 2023-06-21 05:36:59,733 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.044e+02 3.559e+02 4.505e+02 6.702e+02, threshold=7.118e+02, percent-clipped=0.0 2023-06-21 05:37:05,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=891714.0, ans=0.0 2023-06-21 05:37:08,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=891714.0, ans=0.125 2023-06-21 05:37:15,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=891714.0, ans=0.1 2023-06-21 05:37:23,237 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-21 05:37:23,637 INFO [train.py:996] (2/4) Epoch 5, batch 26650, loss[loss=0.2094, simple_loss=0.2952, pruned_loss=0.06176, over 21591.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3226, pruned_loss=0.08861, over 4251573.39 frames. ], batch size: 442, lr: 5.99e-03, grad_scale: 16.0 2023-06-21 05:37:31,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=891774.0, ans=0.1 2023-06-21 05:38:06,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=891894.0, ans=0.0 2023-06-21 05:38:08,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=891894.0, ans=0.1 2023-06-21 05:38:19,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=891954.0, ans=0.125 2023-06-21 05:38:29,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=891954.0, ans=0.1 2023-06-21 05:38:44,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=892014.0, ans=0.2 2023-06-21 05:39:01,926 INFO [train.py:996] (2/4) Epoch 5, batch 26700, loss[loss=0.2476, simple_loss=0.3031, pruned_loss=0.09607, over 20005.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3161, pruned_loss=0.08619, over 4256016.22 frames. ], batch size: 702, lr: 5.99e-03, grad_scale: 16.0 2023-06-21 05:39:25,268 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=22.5 2023-06-21 05:40:01,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=892194.0, ans=0.1 2023-06-21 05:40:01,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=892194.0, ans=0.09899494936611666 2023-06-21 05:40:03,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=892194.0, ans=0.125 2023-06-21 05:40:15,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=892254.0, ans=0.0 2023-06-21 05:40:18,710 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.683e+02 2.509e+02 2.895e+02 3.334e+02 4.980e+02, threshold=5.790e+02, percent-clipped=0.0 2023-06-21 05:40:47,751 INFO [train.py:996] (2/4) Epoch 5, batch 26750, loss[loss=0.2514, simple_loss=0.3292, pruned_loss=0.08682, over 21456.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3143, pruned_loss=0.08434, over 4263757.27 frames. ], batch size: 211, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:41:29,334 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=22.5 2023-06-21 05:42:02,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=892614.0, ans=0.0 2023-06-21 05:42:23,665 INFO [train.py:996] (2/4) Epoch 5, batch 26800, loss[loss=0.2751, simple_loss=0.3398, pruned_loss=0.1052, over 21618.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3212, pruned_loss=0.08839, over 4267969.40 frames. ], batch size: 263, lr: 5.98e-03, grad_scale: 32.0 2023-06-21 05:43:45,169 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 3.040e+02 3.566e+02 4.548e+02 6.934e+02, threshold=7.132e+02, percent-clipped=4.0 2023-06-21 05:43:45,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=892854.0, ans=0.125 2023-06-21 05:44:03,878 INFO [train.py:996] (2/4) Epoch 5, batch 26850, loss[loss=0.2192, simple_loss=0.2726, pruned_loss=0.08291, over 21264.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3234, pruned_loss=0.09172, over 4262756.61 frames. ], batch size: 176, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:44:35,216 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.64 vs. limit=15.0 2023-06-21 05:44:55,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=893094.0, ans=0.0 2023-06-21 05:44:55,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=893094.0, ans=0.0 2023-06-21 05:45:19,616 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=12.0 2023-06-21 05:45:32,364 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.47 vs. limit=12.0 2023-06-21 05:45:39,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=893214.0, ans=0.125 2023-06-21 05:45:43,444 INFO [train.py:996] (2/4) Epoch 5, batch 26900, loss[loss=0.2169, simple_loss=0.2789, pruned_loss=0.07751, over 21535.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3154, pruned_loss=0.09112, over 4258678.08 frames. ], batch size: 391, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:46:32,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=893394.0, ans=0.05 2023-06-21 05:46:47,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=893454.0, ans=0.2 2023-06-21 05:47:04,565 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.809e+02 3.300e+02 3.699e+02 7.956e+02, threshold=6.601e+02, percent-clipped=1.0 2023-06-21 05:47:10,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=893514.0, ans=0.125 2023-06-21 05:47:10,686 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.54 vs. limit=22.5 2023-06-21 05:47:22,082 INFO [train.py:996] (2/4) Epoch 5, batch 26950, loss[loss=0.2333, simple_loss=0.318, pruned_loss=0.07432, over 21504.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3149, pruned_loss=0.09084, over 4257415.46 frames. ], batch size: 212, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:47:37,176 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=22.5 2023-06-21 05:48:19,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=893694.0, ans=0.0 2023-06-21 05:49:02,220 INFO [train.py:996] (2/4) Epoch 5, batch 27000, loss[loss=0.2725, simple_loss=0.3564, pruned_loss=0.09426, over 21572.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3148, pruned_loss=0.08829, over 4263688.46 frames. ], batch size: 442, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:49:02,221 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 05:49:20,490 INFO [train.py:1028] (2/4) Epoch 5, validation: loss=0.2444, simple_loss=0.3449, pruned_loss=0.07195, over 1796401.00 frames. 2023-06-21 05:49:20,491 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-21 05:49:26,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=893874.0, ans=0.2 2023-06-21 05:49:54,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=893934.0, ans=0.2 2023-06-21 05:49:57,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=893934.0, ans=0.0 2023-06-21 05:50:05,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=893994.0, ans=0.0 2023-06-21 05:50:38,952 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.497e+02 2.990e+02 3.496e+02 4.876e+02, threshold=5.980e+02, percent-clipped=0.0 2023-06-21 05:50:41,941 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-06-21 05:50:56,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=894114.0, ans=0.0 2023-06-21 05:51:01,169 INFO [train.py:996] (2/4) Epoch 5, batch 27050, loss[loss=0.2793, simple_loss=0.3429, pruned_loss=0.1079, over 21629.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.316, pruned_loss=0.08473, over 4270176.59 frames. ], batch size: 471, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:51:40,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=894234.0, ans=0.125 2023-06-21 05:51:46,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=894294.0, ans=0.125 2023-06-21 05:51:54,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=894294.0, ans=0.2 2023-06-21 05:51:57,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=894294.0, ans=0.125 2023-06-21 05:52:06,732 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-21 05:52:07,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=894354.0, ans=0.1 2023-06-21 05:52:42,216 INFO [train.py:996] (2/4) Epoch 5, batch 27100, loss[loss=0.2612, simple_loss=0.3296, pruned_loss=0.09637, over 21842.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3193, pruned_loss=0.08602, over 4272782.27 frames. ], batch size: 124, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:53:17,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=894534.0, ans=0.125 2023-06-21 05:53:26,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=894594.0, ans=0.125 2023-06-21 05:53:35,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=894654.0, ans=0.125 2023-06-21 05:54:00,321 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.220e+02 3.188e+02 3.922e+02 5.852e+02 9.183e+02, threshold=7.845e+02, percent-clipped=23.0 2023-06-21 05:54:18,417 INFO [train.py:996] (2/4) Epoch 5, batch 27150, loss[loss=0.2283, simple_loss=0.3409, pruned_loss=0.05788, over 20092.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3278, pruned_loss=0.0882, over 4274763.39 frames. ], batch size: 703, lr: 5.98e-03, grad_scale: 16.0 2023-06-21 05:54:24,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=894774.0, ans=0.2 2023-06-21 05:54:27,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=894774.0, ans=0.07 2023-06-21 05:54:28,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=894774.0, ans=0.125 2023-06-21 05:54:50,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=894834.0, ans=0.2 2023-06-21 05:55:26,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=894954.0, ans=0.1 2023-06-21 05:55:39,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=894954.0, ans=0.04949747468305833 2023-06-21 05:55:50,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=895014.0, ans=0.1 2023-06-21 05:55:58,211 INFO [train.py:996] (2/4) Epoch 5, batch 27200, loss[loss=0.2473, simple_loss=0.3211, pruned_loss=0.08677, over 21440.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3367, pruned_loss=0.09175, over 4276369.41 frames. ], batch size: 211, lr: 5.98e-03, grad_scale: 32.0 2023-06-21 05:56:19,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=895074.0, ans=0.125 2023-06-21 05:56:45,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=895194.0, ans=0.125 2023-06-21 05:57:11,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=895254.0, ans=0.1 2023-06-21 05:57:23,627 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.367e+02 3.260e+02 3.703e+02 4.553e+02 9.386e+02, threshold=7.407e+02, percent-clipped=2.0 2023-06-21 05:57:52,288 INFO [train.py:996] (2/4) Epoch 5, batch 27250, loss[loss=0.2696, simple_loss=0.3339, pruned_loss=0.1026, over 21357.00 frames. ], tot_loss[loss=0.2674, simple_loss=0.3415, pruned_loss=0.09666, over 4281576.64 frames. ], batch size: 143, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 05:58:07,882 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.24 vs. limit=22.5 2023-06-21 05:58:14,027 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 05:58:44,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=895494.0, ans=0.125 2023-06-21 05:59:33,995 INFO [train.py:996] (2/4) Epoch 5, batch 27300, loss[loss=0.2863, simple_loss=0.3592, pruned_loss=0.1067, over 21750.00 frames. ], tot_loss[loss=0.27, simple_loss=0.3441, pruned_loss=0.09793, over 4280488.44 frames. ], batch size: 247, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 05:59:49,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=895674.0, ans=0.125 2023-06-21 06:00:06,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=895734.0, ans=0.125 2023-06-21 06:00:48,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=895854.0, ans=0.1 2023-06-21 06:00:49,593 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-21 06:00:57,905 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.467e+02 2.999e+02 3.424e+02 4.068e+02 6.879e+02, threshold=6.848e+02, percent-clipped=0.0 2023-06-21 06:00:59,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=895914.0, ans=0.0 2023-06-21 06:01:04,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=895914.0, ans=0.125 2023-06-21 06:01:15,082 INFO [train.py:996] (2/4) Epoch 5, batch 27350, loss[loss=0.2056, simple_loss=0.2958, pruned_loss=0.05773, over 21400.00 frames. ], tot_loss[loss=0.2725, simple_loss=0.3469, pruned_loss=0.09908, over 4282517.74 frames. ], batch size: 194, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 06:01:17,637 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-06-21 06:01:54,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=896094.0, ans=0.0 2023-06-21 06:02:54,659 INFO [train.py:996] (2/4) Epoch 5, batch 27400, loss[loss=0.2604, simple_loss=0.3181, pruned_loss=0.1014, over 21756.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3407, pruned_loss=0.09822, over 4285869.92 frames. ], batch size: 351, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 06:03:01,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=896274.0, ans=0.125 2023-06-21 06:03:12,427 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.29 vs. limit=22.5 2023-06-21 06:03:23,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=896334.0, ans=0.035 2023-06-21 06:03:52,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=896454.0, ans=0.125 2023-06-21 06:04:08,150 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.080e+02 2.765e+02 3.152e+02 3.980e+02 5.730e+02, threshold=6.304e+02, percent-clipped=0.0 2023-06-21 06:04:23,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=896514.0, ans=0.04949747468305833 2023-06-21 06:04:34,754 INFO [train.py:996] (2/4) Epoch 5, batch 27450, loss[loss=0.257, simple_loss=0.3475, pruned_loss=0.08325, over 21660.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.334, pruned_loss=0.09616, over 4285731.47 frames. ], batch size: 414, lr: 5.97e-03, grad_scale: 16.0 2023-06-21 06:04:52,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=896634.0, ans=0.125 2023-06-21 06:05:40,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=896754.0, ans=0.125 2023-06-21 06:06:14,874 INFO [train.py:996] (2/4) Epoch 5, batch 27500, loss[loss=0.2183, simple_loss=0.2981, pruned_loss=0.06918, over 21620.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.332, pruned_loss=0.09642, over 4288638.43 frames. ], batch size: 231, lr: 5.97e-03, grad_scale: 16.0 2023-06-21 06:07:01,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=896994.0, ans=0.125 2023-06-21 06:07:26,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=897054.0, ans=0.125 2023-06-21 06:07:29,139 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 2.898e+02 3.228e+02 3.815e+02 7.854e+02, threshold=6.456e+02, percent-clipped=2.0 2023-06-21 06:07:51,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=897114.0, ans=0.125 2023-06-21 06:07:54,344 INFO [train.py:996] (2/4) Epoch 5, batch 27550, loss[loss=0.2243, simple_loss=0.2846, pruned_loss=0.08202, over 21721.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.326, pruned_loss=0.09263, over 4285778.73 frames. ], batch size: 124, lr: 5.97e-03, grad_scale: 16.0 2023-06-21 06:08:03,789 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.77 vs. limit=15.0 2023-06-21 06:08:58,343 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-21 06:09:02,489 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.80 vs. limit=8.0 2023-06-21 06:09:04,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=897354.0, ans=0.125 2023-06-21 06:09:24,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=897414.0, ans=0.1 2023-06-21 06:09:29,816 INFO [train.py:996] (2/4) Epoch 5, batch 27600, loss[loss=0.2365, simple_loss=0.2925, pruned_loss=0.09031, over 21181.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3191, pruned_loss=0.09142, over 4285681.96 frames. ], batch size: 144, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 06:09:47,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=897534.0, ans=0.0 2023-06-21 06:10:19,714 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 06:10:28,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=897654.0, ans=0.0 2023-06-21 06:10:43,734 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.396e+02 2.760e+02 3.130e+02 3.904e+02 5.692e+02, threshold=6.260e+02, percent-clipped=0.0 2023-06-21 06:10:55,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=897714.0, ans=0.2 2023-06-21 06:10:55,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=897714.0, ans=0.125 2023-06-21 06:11:08,646 INFO [train.py:996] (2/4) Epoch 5, batch 27650, loss[loss=0.2507, simple_loss=0.3245, pruned_loss=0.08845, over 21637.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3137, pruned_loss=0.09068, over 4283349.56 frames. ], batch size: 389, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 06:11:10,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=897774.0, ans=0.0 2023-06-21 06:11:34,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=897834.0, ans=0.125 2023-06-21 06:11:39,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=897894.0, ans=0.125 2023-06-21 06:12:10,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=897954.0, ans=0.0 2023-06-21 06:12:40,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=898014.0, ans=0.04949747468305833 2023-06-21 06:12:48,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=898074.0, ans=0.125 2023-06-21 06:12:49,027 INFO [train.py:996] (2/4) Epoch 5, batch 27700, loss[loss=0.2188, simple_loss=0.2919, pruned_loss=0.07285, over 16635.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.314, pruned_loss=0.0892, over 4281802.09 frames. ], batch size: 62, lr: 5.97e-03, grad_scale: 32.0 2023-06-21 06:13:21,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=898134.0, ans=0.125 2023-06-21 06:13:23,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=898134.0, ans=0.125 2023-06-21 06:13:30,805 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=15.0 2023-06-21 06:14:07,895 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.243e+02 3.066e+02 3.761e+02 4.326e+02 8.310e+02, threshold=7.523e+02, percent-clipped=4.0 2023-06-21 06:14:28,411 INFO [train.py:996] (2/4) Epoch 5, batch 27750, loss[loss=0.2461, simple_loss=0.324, pruned_loss=0.0841, over 21911.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3173, pruned_loss=0.08852, over 4286896.76 frames. ], batch size: 316, lr: 5.96e-03, grad_scale: 32.0 2023-06-21 06:14:59,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=898434.0, ans=0.125 2023-06-21 06:15:02,765 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 06:15:07,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=898494.0, ans=0.125 2023-06-21 06:16:06,811 INFO [train.py:996] (2/4) Epoch 5, batch 27800, loss[loss=0.2733, simple_loss=0.3348, pruned_loss=0.1059, over 21856.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3166, pruned_loss=0.08964, over 4289239.89 frames. ], batch size: 332, lr: 5.96e-03, grad_scale: 32.0 2023-06-21 06:16:22,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=898734.0, ans=0.125 2023-06-21 06:16:46,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=898794.0, ans=0.1 2023-06-21 06:16:49,812 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.91 vs. limit=6.0 2023-06-21 06:16:54,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=898794.0, ans=0.0 2023-06-21 06:17:02,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=898854.0, ans=0.1 2023-06-21 06:17:19,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=898854.0, ans=0.0 2023-06-21 06:17:26,844 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.337e+02 2.753e+02 3.252e+02 3.951e+02 6.290e+02, threshold=6.504e+02, percent-clipped=0.0 2023-06-21 06:17:32,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=898914.0, ans=0.5 2023-06-21 06:17:45,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=898914.0, ans=0.1 2023-06-21 06:17:48,422 INFO [train.py:996] (2/4) Epoch 5, batch 27850, loss[loss=0.2529, simple_loss=0.3349, pruned_loss=0.08543, over 21788.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3163, pruned_loss=0.09138, over 4290702.94 frames. ], batch size: 298, lr: 5.96e-03, grad_scale: 32.0 2023-06-21 06:17:53,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=898974.0, ans=0.125 2023-06-21 06:18:07,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=899034.0, ans=0.125 2023-06-21 06:18:58,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=899154.0, ans=0.125 2023-06-21 06:19:17,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=899214.0, ans=0.015 2023-06-21 06:19:31,891 INFO [train.py:996] (2/4) Epoch 5, batch 27900, loss[loss=0.3098, simple_loss=0.3934, pruned_loss=0.1131, over 21482.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3262, pruned_loss=0.09222, over 4286544.59 frames. ], batch size: 471, lr: 5.96e-03, grad_scale: 32.0 2023-06-21 06:19:37,479 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.51 vs. limit=15.0 2023-06-21 06:19:52,424 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 06:20:31,004 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-21 06:20:36,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=899454.0, ans=0.125 2023-06-21 06:20:48,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=899454.0, ans=0.0 2023-06-21 06:20:57,631 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.188e+02 2.911e+02 3.342e+02 3.967e+02 6.742e+02, threshold=6.683e+02, percent-clipped=1.0 2023-06-21 06:21:19,110 INFO [train.py:996] (2/4) Epoch 5, batch 27950, loss[loss=0.234, simple_loss=0.3154, pruned_loss=0.07626, over 21643.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3252, pruned_loss=0.08821, over 4284200.95 frames. ], batch size: 263, lr: 5.96e-03, grad_scale: 32.0 2023-06-21 06:21:21,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=899574.0, ans=0.125 2023-06-21 06:22:13,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=899694.0, ans=0.125 2023-06-21 06:22:24,133 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 06:22:28,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=899754.0, ans=0.125 2023-06-21 06:22:38,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=899814.0, ans=0.125 2023-06-21 06:22:50,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=899814.0, ans=0.1 2023-06-21 06:22:59,492 INFO [train.py:996] (2/4) Epoch 5, batch 28000, loss[loss=0.2275, simple_loss=0.2936, pruned_loss=0.08071, over 21714.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3232, pruned_loss=0.08601, over 4289847.57 frames. ], batch size: 230, lr: 5.96e-03, grad_scale: 32.0 2023-06-21 06:23:00,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=899874.0, ans=0.1 2023-06-21 06:23:11,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=899874.0, ans=0.125 2023-06-21 06:23:18,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=899874.0, ans=0.05 2023-06-21 06:23:59,407 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=12.0 2023-06-21 06:24:19,324 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 2.872e+02 3.186e+02 3.800e+02 5.572e+02, threshold=6.373e+02, percent-clipped=0.0 2023-06-21 06:24:40,561 INFO [train.py:996] (2/4) Epoch 5, batch 28050, loss[loss=0.2174, simple_loss=0.2748, pruned_loss=0.07996, over 21770.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3209, pruned_loss=0.08757, over 4292984.21 frames. ], batch size: 118, lr: 5.96e-03, grad_scale: 16.0 2023-06-21 06:25:43,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=900354.0, ans=0.0 2023-06-21 06:26:05,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=900414.0, ans=0.0 2023-06-21 06:26:19,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=900474.0, ans=0.0 2023-06-21 06:26:20,981 INFO [train.py:996] (2/4) Epoch 5, batch 28100, loss[loss=0.2384, simple_loss=0.2961, pruned_loss=0.09038, over 21643.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3199, pruned_loss=0.08783, over 4293385.32 frames. ], batch size: 282, lr: 5.96e-03, grad_scale: 16.0 2023-06-21 06:27:47,117 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-21 06:27:47,391 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.346e+02 3.042e+02 3.636e+02 4.421e+02 1.163e+03, threshold=7.272e+02, percent-clipped=7.0 2023-06-21 06:28:07,019 INFO [train.py:996] (2/4) Epoch 5, batch 28150, loss[loss=0.2187, simple_loss=0.2755, pruned_loss=0.08097, over 21498.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3156, pruned_loss=0.08844, over 4284374.57 frames. ], batch size: 212, lr: 5.96e-03, grad_scale: 16.0 2023-06-21 06:28:37,662 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.71 vs. limit=5.0 2023-06-21 06:29:48,187 INFO [train.py:996] (2/4) Epoch 5, batch 28200, loss[loss=0.2841, simple_loss=0.3365, pruned_loss=0.1158, over 21900.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3148, pruned_loss=0.09016, over 4281588.74 frames. ], batch size: 372, lr: 5.96e-03, grad_scale: 16.0 2023-06-21 06:30:46,378 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.02 vs. limit=15.0 2023-06-21 06:30:58,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=901254.0, ans=0.0 2023-06-21 06:31:06,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=901314.0, ans=0.2 2023-06-21 06:31:14,333 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 3.109e+02 3.691e+02 4.482e+02 7.045e+02, threshold=7.382e+02, percent-clipped=0.0 2023-06-21 06:31:20,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=901314.0, ans=0.1 2023-06-21 06:31:22,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=901314.0, ans=0.125 2023-06-21 06:31:33,987 INFO [train.py:996] (2/4) Epoch 5, batch 28250, loss[loss=0.2186, simple_loss=0.2817, pruned_loss=0.07774, over 21811.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3181, pruned_loss=0.09357, over 4285030.92 frames. ], batch size: 118, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:31:58,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=901434.0, ans=0.1 2023-06-21 06:32:15,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=901494.0, ans=0.125 2023-06-21 06:32:34,274 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-06-21 06:33:15,402 INFO [train.py:996] (2/4) Epoch 5, batch 28300, loss[loss=0.1828, simple_loss=0.2613, pruned_loss=0.05215, over 21215.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3158, pruned_loss=0.09028, over 4270726.81 frames. ], batch size: 176, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:33:15,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=901674.0, ans=0.125 2023-06-21 06:33:41,156 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.20 vs. limit=15.0 2023-06-21 06:34:41,916 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.736e+02 2.744e+02 3.366e+02 4.135e+02 8.525e+02, threshold=6.731e+02, percent-clipped=3.0 2023-06-21 06:34:56,284 INFO [train.py:996] (2/4) Epoch 5, batch 28350, loss[loss=0.2093, simple_loss=0.2668, pruned_loss=0.07588, over 21844.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3104, pruned_loss=0.0836, over 4272843.86 frames. ], batch size: 98, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:35:05,168 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-06-21 06:35:12,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=901974.0, ans=0.125 2023-06-21 06:35:16,515 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-06-21 06:35:43,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=902094.0, ans=0.1 2023-06-21 06:36:40,536 INFO [train.py:996] (2/4) Epoch 5, batch 28400, loss[loss=0.2541, simple_loss=0.2975, pruned_loss=0.1054, over 21830.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3071, pruned_loss=0.08506, over 4267937.64 frames. ], batch size: 98, lr: 5.95e-03, grad_scale: 32.0 2023-06-21 06:36:43,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=902274.0, ans=0.2 2023-06-21 06:36:43,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=902274.0, ans=0.125 2023-06-21 06:36:57,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=902334.0, ans=0.1 2023-06-21 06:36:58,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=902334.0, ans=0.125 2023-06-21 06:37:36,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=902394.0, ans=0.125 2023-06-21 06:37:50,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=902454.0, ans=15.0 2023-06-21 06:37:54,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=902454.0, ans=0.2 2023-06-21 06:38:00,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=902514.0, ans=0.2 2023-06-21 06:38:03,534 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 3.069e+02 3.636e+02 4.494e+02 7.236e+02, threshold=7.272e+02, percent-clipped=3.0 2023-06-21 06:38:19,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=902574.0, ans=0.07 2023-06-21 06:38:20,765 INFO [train.py:996] (2/4) Epoch 5, batch 28450, loss[loss=0.255, simple_loss=0.3174, pruned_loss=0.09634, over 21868.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3128, pruned_loss=0.08865, over 4260326.22 frames. ], batch size: 351, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:38:22,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=902574.0, ans=0.0 2023-06-21 06:38:23,277 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.49 vs. limit=15.0 2023-06-21 06:38:59,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=902694.0, ans=0.125 2023-06-21 06:39:17,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=902754.0, ans=0.125 2023-06-21 06:39:59,740 INFO [train.py:996] (2/4) Epoch 5, batch 28500, loss[loss=0.262, simple_loss=0.327, pruned_loss=0.09844, over 21938.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3161, pruned_loss=0.09162, over 4272061.89 frames. ], batch size: 316, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:40:16,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=902934.0, ans=0.0 2023-06-21 06:40:53,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=902994.0, ans=0.125 2023-06-21 06:41:28,911 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.140e+02 2.871e+02 3.406e+02 3.870e+02 6.038e+02, threshold=6.812e+02, percent-clipped=0.0 2023-06-21 06:41:36,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=903114.0, ans=0.0 2023-06-21 06:41:41,929 INFO [train.py:996] (2/4) Epoch 5, batch 28550, loss[loss=0.2611, simple_loss=0.3508, pruned_loss=0.08572, over 21566.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3252, pruned_loss=0.09458, over 4275179.40 frames. ], batch size: 230, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:41:51,117 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=15.0 2023-06-21 06:42:15,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=903234.0, ans=0.2 2023-06-21 06:42:30,498 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.67 vs. limit=12.0 2023-06-21 06:43:24,813 INFO [train.py:996] (2/4) Epoch 5, batch 28600, loss[loss=0.3028, simple_loss=0.3707, pruned_loss=0.1175, over 21368.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3325, pruned_loss=0.0969, over 4274729.09 frames. ], batch size: 549, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:44:40,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=903654.0, ans=0.125 2023-06-21 06:44:54,244 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.310e+02 2.937e+02 3.367e+02 4.039e+02 6.744e+02, threshold=6.734e+02, percent-clipped=0.0 2023-06-21 06:45:11,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=903774.0, ans=0.04949747468305833 2023-06-21 06:45:12,153 INFO [train.py:996] (2/4) Epoch 5, batch 28650, loss[loss=0.238, simple_loss=0.2901, pruned_loss=0.09296, over 21580.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.326, pruned_loss=0.09585, over 4275394.94 frames. ], batch size: 415, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:45:38,397 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=15.0 2023-06-21 06:46:56,188 INFO [train.py:996] (2/4) Epoch 5, batch 28700, loss[loss=0.2598, simple_loss=0.3225, pruned_loss=0.09856, over 21658.00 frames. ], tot_loss[loss=0.2581, simple_loss=0.3239, pruned_loss=0.09617, over 4276280.15 frames. ], batch size: 263, lr: 5.95e-03, grad_scale: 16.0 2023-06-21 06:47:15,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=904134.0, ans=0.125 2023-06-21 06:48:08,465 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.14 vs. limit=15.0 2023-06-21 06:48:13,795 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.313e+02 2.955e+02 3.205e+02 3.884e+02 6.833e+02, threshold=6.409e+02, percent-clipped=1.0 2023-06-21 06:48:37,266 INFO [train.py:996] (2/4) Epoch 5, batch 28750, loss[loss=0.2372, simple_loss=0.3247, pruned_loss=0.07489, over 21850.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3241, pruned_loss=0.0968, over 4286312.36 frames. ], batch size: 351, lr: 5.94e-03, grad_scale: 16.0 2023-06-21 06:50:17,692 INFO [train.py:996] (2/4) Epoch 5, batch 28800, loss[loss=0.274, simple_loss=0.3429, pruned_loss=0.1025, over 21758.00 frames. ], tot_loss[loss=0.262, simple_loss=0.3287, pruned_loss=0.09767, over 4283989.64 frames. ], batch size: 332, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 06:50:59,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=904794.0, ans=0.0 2023-06-21 06:51:13,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=904794.0, ans=0.05 2023-06-21 06:51:45,695 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.317e+02 2.918e+02 3.315e+02 4.122e+02 9.599e+02, threshold=6.630e+02, percent-clipped=10.0 2023-06-21 06:52:08,212 INFO [train.py:996] (2/4) Epoch 5, batch 28850, loss[loss=0.2103, simple_loss=0.2712, pruned_loss=0.0747, over 20213.00 frames. ], tot_loss[loss=0.2638, simple_loss=0.3301, pruned_loss=0.09877, over 4284219.88 frames. ], batch size: 702, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 06:52:33,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=905034.0, ans=0.0 2023-06-21 06:52:44,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=905094.0, ans=0.0 2023-06-21 06:52:53,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=905094.0, ans=0.0 2023-06-21 06:53:22,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=905214.0, ans=0.125 2023-06-21 06:53:40,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=905214.0, ans=0.2 2023-06-21 06:53:48,806 INFO [train.py:996] (2/4) Epoch 5, batch 28900, loss[loss=0.2483, simple_loss=0.3227, pruned_loss=0.08696, over 21437.00 frames. ], tot_loss[loss=0.2673, simple_loss=0.3331, pruned_loss=0.1007, over 4283245.72 frames. ], batch size: 131, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 06:54:43,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=905454.0, ans=0.0 2023-06-21 06:55:17,786 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.237e+02 3.129e+02 3.502e+02 4.010e+02 6.253e+02, threshold=7.003e+02, percent-clipped=0.0 2023-06-21 06:55:28,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=905514.0, ans=0.125 2023-06-21 06:55:31,415 INFO [train.py:996] (2/4) Epoch 5, batch 28950, loss[loss=0.2777, simple_loss=0.3845, pruned_loss=0.08546, over 20762.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.336, pruned_loss=0.1003, over 4274511.39 frames. ], batch size: 607, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 06:55:41,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=905574.0, ans=0.5 2023-06-21 06:56:24,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=905694.0, ans=0.0 2023-06-21 06:57:06,907 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-21 06:57:12,957 INFO [train.py:996] (2/4) Epoch 5, batch 29000, loss[loss=0.2651, simple_loss=0.3443, pruned_loss=0.09295, over 21389.00 frames. ], tot_loss[loss=0.2688, simple_loss=0.3396, pruned_loss=0.09903, over 4269843.93 frames. ], batch size: 548, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 06:57:57,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=905994.0, ans=0.0 2023-06-21 06:58:39,409 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 2.974e+02 3.473e+02 4.065e+02 5.758e+02, threshold=6.947e+02, percent-clipped=0.0 2023-06-21 06:58:52,027 INFO [train.py:996] (2/4) Epoch 5, batch 29050, loss[loss=0.2583, simple_loss=0.3068, pruned_loss=0.1049, over 20235.00 frames. ], tot_loss[loss=0.2683, simple_loss=0.337, pruned_loss=0.09973, over 4278619.92 frames. ], batch size: 707, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 06:59:03,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=906174.0, ans=0.1 2023-06-21 06:59:41,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=906294.0, ans=0.1 2023-06-21 07:00:15,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=906414.0, ans=0.0 2023-06-21 07:00:19,615 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:00:32,321 INFO [train.py:996] (2/4) Epoch 5, batch 29100, loss[loss=0.2338, simple_loss=0.2921, pruned_loss=0.08772, over 21536.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3276, pruned_loss=0.09648, over 4274257.19 frames. ], batch size: 441, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 07:01:09,460 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=16.71 vs. limit=15.0 2023-06-21 07:01:12,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=906534.0, ans=0.0 2023-06-21 07:02:00,032 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.771e+02 3.124e+02 3.774e+02 6.095e+02, threshold=6.248e+02, percent-clipped=0.0 2023-06-21 07:02:13,089 INFO [train.py:996] (2/4) Epoch 5, batch 29150, loss[loss=0.2493, simple_loss=0.3253, pruned_loss=0.08663, over 21769.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3264, pruned_loss=0.09515, over 4271745.80 frames. ], batch size: 316, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 07:02:15,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=906774.0, ans=0.125 2023-06-21 07:02:19,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=906774.0, ans=0.2 2023-06-21 07:02:36,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=906834.0, ans=0.0 2023-06-21 07:03:05,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=906894.0, ans=0.125 2023-06-21 07:03:17,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=906894.0, ans=0.0 2023-06-21 07:03:34,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=906954.0, ans=0.125 2023-06-21 07:03:43,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=907014.0, ans=0.2 2023-06-21 07:03:53,193 INFO [train.py:996] (2/4) Epoch 5, batch 29200, loss[loss=0.2376, simple_loss=0.3001, pruned_loss=0.08753, over 21858.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.322, pruned_loss=0.09442, over 4263019.29 frames. ], batch size: 125, lr: 5.94e-03, grad_scale: 32.0 2023-06-21 07:04:04,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=907074.0, ans=0.125 2023-06-21 07:04:05,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=907074.0, ans=0.0 2023-06-21 07:04:44,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=907194.0, ans=0.025 2023-06-21 07:04:52,494 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.08 vs. limit=6.0 2023-06-21 07:05:22,447 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.247e+02 2.832e+02 3.192e+02 3.760e+02 6.246e+02, threshold=6.385e+02, percent-clipped=0.0 2023-06-21 07:05:36,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=907314.0, ans=0.0 2023-06-21 07:05:40,672 INFO [train.py:996] (2/4) Epoch 5, batch 29250, loss[loss=0.2868, simple_loss=0.3606, pruned_loss=0.1065, over 21406.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3203, pruned_loss=0.09187, over 4267849.29 frames. ], batch size: 471, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:05:59,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=907374.0, ans=0.1 2023-06-21 07:06:29,067 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.86 vs. limit=6.0 2023-06-21 07:06:31,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=907494.0, ans=0.125 2023-06-21 07:06:46,139 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:06:47,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=907554.0, ans=0.2 2023-06-21 07:06:47,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=907554.0, ans=0.0 2023-06-21 07:07:17,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=907614.0, ans=0.07 2023-06-21 07:07:21,318 INFO [train.py:996] (2/4) Epoch 5, batch 29300, loss[loss=0.2513, simple_loss=0.3175, pruned_loss=0.09253, over 21791.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3219, pruned_loss=0.09103, over 4268422.94 frames. ], batch size: 351, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:07:38,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=907674.0, ans=0.1 2023-06-21 07:07:42,019 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=12.0 2023-06-21 07:07:51,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=907734.0, ans=0.125 2023-06-21 07:08:28,717 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:08:46,695 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.162e+02 2.808e+02 3.408e+02 4.024e+02 6.878e+02, threshold=6.816e+02, percent-clipped=1.0 2023-06-21 07:09:05,127 INFO [train.py:996] (2/4) Epoch 5, batch 29350, loss[loss=0.2361, simple_loss=0.3014, pruned_loss=0.08543, over 20045.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.318, pruned_loss=0.08982, over 4268776.23 frames. ], batch size: 702, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:10:43,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=908214.0, ans=0.0 2023-06-21 07:10:53,140 INFO [train.py:996] (2/4) Epoch 5, batch 29400, loss[loss=0.288, simple_loss=0.37, pruned_loss=0.103, over 21679.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3182, pruned_loss=0.08774, over 4266197.94 frames. ], batch size: 415, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:10:58,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=908274.0, ans=0.0 2023-06-21 07:11:13,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=908334.0, ans=0.0 2023-06-21 07:11:34,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=908394.0, ans=10.0 2023-06-21 07:12:20,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=908514.0, ans=0.125 2023-06-21 07:12:26,263 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.120e+02 2.910e+02 3.307e+02 3.988e+02 6.309e+02, threshold=6.614e+02, percent-clipped=0.0 2023-06-21 07:12:41,840 INFO [train.py:996] (2/4) Epoch 5, batch 29450, loss[loss=0.2401, simple_loss=0.32, pruned_loss=0.08013, over 21480.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3163, pruned_loss=0.08682, over 4264102.33 frames. ], batch size: 131, lr: 5.93e-03, grad_scale: 16.0 2023-06-21 07:13:04,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=908634.0, ans=0.125 2023-06-21 07:14:07,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=908814.0, ans=0.125 2023-06-21 07:14:21,525 INFO [train.py:996] (2/4) Epoch 5, batch 29500, loss[loss=0.2139, simple_loss=0.2774, pruned_loss=0.07518, over 21586.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3205, pruned_loss=0.09047, over 4272642.18 frames. ], batch size: 212, lr: 5.93e-03, grad_scale: 16.0 2023-06-21 07:15:00,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=908994.0, ans=0.0 2023-06-21 07:15:36,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=909114.0, ans=0.05 2023-06-21 07:15:45,749 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.324e+02 2.991e+02 3.477e+02 4.425e+02 6.921e+02, threshold=6.954e+02, percent-clipped=2.0 2023-06-21 07:15:57,591 INFO [train.py:996] (2/4) Epoch 5, batch 29550, loss[loss=0.2437, simple_loss=0.2963, pruned_loss=0.09553, over 21285.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3198, pruned_loss=0.09238, over 4281666.07 frames. ], batch size: 608, lr: 5.93e-03, grad_scale: 16.0 2023-06-21 07:16:07,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=909174.0, ans=0.125 2023-06-21 07:16:07,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=909174.0, ans=0.1 2023-06-21 07:16:16,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=909174.0, ans=0.015 2023-06-21 07:16:23,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=909234.0, ans=0.125 2023-06-21 07:16:28,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=909234.0, ans=0.125 2023-06-21 07:16:48,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=909294.0, ans=0.5 2023-06-21 07:17:07,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=909354.0, ans=0.0 2023-06-21 07:17:20,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=909354.0, ans=0.125 2023-06-21 07:17:32,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=909414.0, ans=0.1 2023-06-21 07:17:43,955 INFO [train.py:996] (2/4) Epoch 5, batch 29600, loss[loss=0.3647, simple_loss=0.4319, pruned_loss=0.1487, over 21557.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3267, pruned_loss=0.09425, over 4290433.89 frames. ], batch size: 508, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:17:56,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=909474.0, ans=0.5 2023-06-21 07:19:08,638 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.750e+02 3.191e+02 3.922e+02 6.335e+02, threshold=6.382e+02, percent-clipped=0.0 2023-06-21 07:19:28,029 INFO [train.py:996] (2/4) Epoch 5, batch 29650, loss[loss=0.2093, simple_loss=0.2782, pruned_loss=0.07018, over 21797.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3241, pruned_loss=0.09097, over 4283006.46 frames. ], batch size: 247, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:20:14,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=909894.0, ans=0.125 2023-06-21 07:20:18,550 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.67 vs. limit=15.0 2023-06-21 07:21:09,372 INFO [train.py:996] (2/4) Epoch 5, batch 29700, loss[loss=0.2591, simple_loss=0.351, pruned_loss=0.08364, over 21435.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3263, pruned_loss=0.0914, over 4284150.86 frames. ], batch size: 194, lr: 5.93e-03, grad_scale: 32.0 2023-06-21 07:21:14,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=910074.0, ans=0.5 2023-06-21 07:21:22,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=910074.0, ans=0.09899494936611666 2023-06-21 07:21:51,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=910194.0, ans=0.1 2023-06-21 07:21:59,705 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=22.5 2023-06-21 07:22:02,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=910254.0, ans=0.0 2023-06-21 07:22:29,349 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 2.944e+02 3.350e+02 4.483e+02 9.156e+02, threshold=6.700e+02, percent-clipped=7.0 2023-06-21 07:22:45,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=910314.0, ans=0.125 2023-06-21 07:22:45,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=910314.0, ans=0.0 2023-06-21 07:22:50,092 INFO [train.py:996] (2/4) Epoch 5, batch 29750, loss[loss=0.2391, simple_loss=0.3262, pruned_loss=0.07595, over 21422.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3298, pruned_loss=0.09097, over 4279743.34 frames. ], batch size: 194, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:23:08,282 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:23:18,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=910434.0, ans=0.0 2023-06-21 07:23:28,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=910494.0, ans=0.1 2023-06-21 07:23:54,112 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.80 vs. limit=15.0 2023-06-21 07:24:14,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=910614.0, ans=0.2 2023-06-21 07:24:31,258 INFO [train.py:996] (2/4) Epoch 5, batch 29800, loss[loss=0.2348, simple_loss=0.3125, pruned_loss=0.07857, over 21654.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3308, pruned_loss=0.09201, over 4285416.38 frames. ], batch size: 230, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:24:33,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=910674.0, ans=0.125 2023-06-21 07:25:41,586 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=15.0 2023-06-21 07:25:56,580 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.132e+02 2.642e+02 3.044e+02 3.598e+02 6.041e+02, threshold=6.089e+02, percent-clipped=0.0 2023-06-21 07:26:11,952 INFO [train.py:996] (2/4) Epoch 5, batch 29850, loss[loss=0.2396, simple_loss=0.3128, pruned_loss=0.08323, over 21516.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3278, pruned_loss=0.09093, over 4290919.27 frames. ], batch size: 131, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:26:12,443 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 07:26:17,985 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=15.0 2023-06-21 07:26:19,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=910974.0, ans=0.0 2023-06-21 07:26:34,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=911034.0, ans=0.125 2023-06-21 07:26:35,418 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=12.0 2023-06-21 07:27:00,729 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=22.5 2023-06-21 07:27:53,671 INFO [train.py:996] (2/4) Epoch 5, batch 29900, loss[loss=0.2633, simple_loss=0.334, pruned_loss=0.09624, over 17254.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3261, pruned_loss=0.09203, over 4290812.80 frames. ], batch size: 60, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:27:59,313 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-06-21 07:28:30,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=911394.0, ans=0.0 2023-06-21 07:29:00,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=911454.0, ans=0.2 2023-06-21 07:29:18,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=911514.0, ans=0.1 2023-06-21 07:29:22,828 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.918e+02 3.287e+02 3.972e+02 7.501e+02, threshold=6.574e+02, percent-clipped=2.0 2023-06-21 07:29:34,469 INFO [train.py:996] (2/4) Epoch 5, batch 29950, loss[loss=0.2822, simple_loss=0.356, pruned_loss=0.1042, over 21826.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3291, pruned_loss=0.09553, over 4290623.75 frames. ], batch size: 124, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:30:14,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=911694.0, ans=0.125 2023-06-21 07:31:16,756 INFO [train.py:996] (2/4) Epoch 5, batch 30000, loss[loss=0.1996, simple_loss=0.2552, pruned_loss=0.07205, over 16432.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3308, pruned_loss=0.09564, over 4281679.76 frames. ], batch size: 61, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:31:16,757 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 07:31:33,710 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.7625, 3.0916, 3.1355, 3.0318], device='cuda:2') 2023-06-21 07:31:38,135 INFO [train.py:1028] (2/4) Epoch 5, validation: loss=0.2485, simple_loss=0.3493, pruned_loss=0.0739, over 1796401.00 frames. 2023-06-21 07:31:38,136 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-21 07:31:42,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=911874.0, ans=0.125 2023-06-21 07:31:46,812 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.66 vs. limit=15.0 2023-06-21 07:32:22,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=911934.0, ans=0.0 2023-06-21 07:32:33,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=911994.0, ans=0.0 2023-06-21 07:33:14,711 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 3.033e+02 3.680e+02 4.795e+02 8.556e+02, threshold=7.360e+02, percent-clipped=8.0 2023-06-21 07:33:20,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=912114.0, ans=0.125 2023-06-21 07:33:36,614 INFO [train.py:996] (2/4) Epoch 5, batch 30050, loss[loss=0.2712, simple_loss=0.3598, pruned_loss=0.09132, over 21607.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3345, pruned_loss=0.09209, over 4277895.59 frames. ], batch size: 263, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:34:01,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=912234.0, ans=0.125 2023-06-21 07:34:06,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=912234.0, ans=0.125 2023-06-21 07:34:36,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=912354.0, ans=0.0 2023-06-21 07:34:46,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=912354.0, ans=0.125 2023-06-21 07:35:16,538 INFO [train.py:996] (2/4) Epoch 5, batch 30100, loss[loss=0.2347, simple_loss=0.2898, pruned_loss=0.08983, over 21553.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.3329, pruned_loss=0.09137, over 4270852.51 frames. ], batch size: 247, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:35:57,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=912594.0, ans=0.125 2023-06-21 07:36:25,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=912654.0, ans=0.1 2023-06-21 07:36:40,821 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.471e+02 3.174e+02 3.797e+02 4.451e+02 9.370e+02, threshold=7.593e+02, percent-clipped=1.0 2023-06-21 07:37:02,705 INFO [train.py:996] (2/4) Epoch 5, batch 30150, loss[loss=0.2505, simple_loss=0.3255, pruned_loss=0.0877, over 21692.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3278, pruned_loss=0.09238, over 4264114.25 frames. ], batch size: 332, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:38:41,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=913014.0, ans=0.035 2023-06-21 07:38:45,883 INFO [train.py:996] (2/4) Epoch 5, batch 30200, loss[loss=0.2496, simple_loss=0.3336, pruned_loss=0.08282, over 21780.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3302, pruned_loss=0.09158, over 4265693.63 frames. ], batch size: 247, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:40:17,010 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.228e+02 2.970e+02 3.553e+02 4.438e+02 6.781e+02, threshold=7.107e+02, percent-clipped=0.0 2023-06-21 07:40:28,581 INFO [train.py:996] (2/4) Epoch 5, batch 30250, loss[loss=0.3001, simple_loss=0.3867, pruned_loss=0.1067, over 21462.00 frames. ], tot_loss[loss=0.2642, simple_loss=0.3394, pruned_loss=0.09448, over 4269728.93 frames. ], batch size: 211, lr: 5.92e-03, grad_scale: 32.0 2023-06-21 07:40:37,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=913374.0, ans=0.0 2023-06-21 07:40:55,951 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.10 vs. limit=22.5 2023-06-21 07:41:06,879 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-21 07:41:20,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=913494.0, ans=0.125 2023-06-21 07:42:08,266 INFO [train.py:996] (2/4) Epoch 5, batch 30300, loss[loss=0.2195, simple_loss=0.2804, pruned_loss=0.07933, over 21350.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3359, pruned_loss=0.09372, over 4271525.92 frames. ], batch size: 177, lr: 5.91e-03, grad_scale: 32.0 2023-06-21 07:42:31,558 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0 2023-06-21 07:43:07,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=913794.0, ans=0.0 2023-06-21 07:43:34,405 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 3.375e+02 4.066e+02 5.117e+02 7.478e+02, threshold=8.132e+02, percent-clipped=2.0 2023-06-21 07:43:51,194 INFO [train.py:996] (2/4) Epoch 5, batch 30350, loss[loss=0.35, simple_loss=0.4231, pruned_loss=0.1385, over 21544.00 frames. ], tot_loss[loss=0.2633, simple_loss=0.3359, pruned_loss=0.09531, over 4264550.47 frames. ], batch size: 473, lr: 5.91e-03, grad_scale: 32.0 2023-06-21 07:43:53,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=913974.0, ans=0.125 2023-06-21 07:44:08,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=913974.0, ans=0.04949747468305833 2023-06-21 07:44:12,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=914034.0, ans=0.125 2023-06-21 07:44:13,575 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=12.0 2023-06-21 07:44:19,837 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.09 vs. limit=15.0 2023-06-21 07:44:23,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=914034.0, ans=0.0 2023-06-21 07:44:26,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=914094.0, ans=0.125 2023-06-21 07:44:30,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=914094.0, ans=0.0 2023-06-21 07:44:30,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=914094.0, ans=0.125 2023-06-21 07:45:20,115 INFO [train.py:996] (2/4) Epoch 5, batch 30400, loss[loss=0.2577, simple_loss=0.3025, pruned_loss=0.1064, over 20136.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.329, pruned_loss=0.09297, over 4254480.03 frames. ], batch size: 702, lr: 5.91e-03, grad_scale: 32.0 2023-06-21 07:45:26,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=914274.0, ans=0.125 2023-06-21 07:45:34,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=914334.0, ans=0.125 2023-06-21 07:45:51,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=914394.0, ans=0.2 2023-06-21 07:45:51,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=914394.0, ans=0.125 2023-06-21 07:46:17,914 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.00 vs. limit=15.0 2023-06-21 07:46:20,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.whiten.whitening_limit, batch_count=914454.0, ans=12.0 2023-06-21 07:46:35,438 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.883e+02 3.783e+02 4.866e+02 6.156e+02 1.756e+03, threshold=9.731e+02, percent-clipped=9.0 2023-06-21 07:46:43,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=914514.0, ans=0.0 2023-06-21 07:46:46,044 INFO [train.py:996] (2/4) Epoch 5, batch 30450, loss[loss=0.3112, simple_loss=0.4234, pruned_loss=0.09951, over 19883.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3318, pruned_loss=0.09335, over 4196896.00 frames. ], batch size: 702, lr: 5.91e-03, grad_scale: 32.0 2023-06-21 07:46:49,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=914574.0, ans=0.125 2023-06-21 07:46:52,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=914574.0, ans=0.125 2023-06-21 07:47:00,529 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.39 vs. limit=6.0 2023-06-21 07:47:51,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=914814.0, ans=0.1 2023-06-21 07:49:37,152 INFO [train.py:996] (2/4) Epoch 6, batch 0, loss[loss=0.2703, simple_loss=0.332, pruned_loss=0.1043, over 21735.00 frames. ], tot_loss[loss=0.2703, simple_loss=0.332, pruned_loss=0.1043, over 21735.00 frames. ], batch size: 124, lr: 5.35e-03, grad_scale: 32.0 2023-06-21 07:49:37,154 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 07:49:52,712 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2457, simple_loss=0.3531, pruned_loss=0.06922, over 1796401.00 frames. 2023-06-21 07:49:52,712 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-21 07:50:05,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=914838.0, ans=0.125 2023-06-21 07:50:22,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=914898.0, ans=0.1 2023-06-21 07:50:33,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=914898.0, ans=0.09899494936611666 2023-06-21 07:50:55,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=915018.0, ans=0.125 2023-06-21 07:51:16,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=915078.0, ans=0.125 2023-06-21 07:51:27,049 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.195e+02 3.626e+02 5.784e+02 9.951e+02 2.861e+03, threshold=1.157e+03, percent-clipped=26.0 2023-06-21 07:51:28,601 INFO [train.py:996] (2/4) Epoch 6, batch 50, loss[loss=0.2174, simple_loss=0.3, pruned_loss=0.06736, over 21364.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.338, pruned_loss=0.0953, over 964838.42 frames. ], batch size: 194, lr: 5.35e-03, grad_scale: 32.0 2023-06-21 07:51:48,467 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.24 vs. limit=15.0 2023-06-21 07:51:49,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=915198.0, ans=0.125 2023-06-21 07:52:03,034 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=15.0 2023-06-21 07:52:21,947 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2023-06-21 07:52:55,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=915378.0, ans=0.125 2023-06-21 07:53:05,799 INFO [train.py:996] (2/4) Epoch 6, batch 100, loss[loss=0.26, simple_loss=0.3527, pruned_loss=0.08368, over 21748.00 frames. ], tot_loss[loss=0.2722, simple_loss=0.3501, pruned_loss=0.09713, over 1688946.96 frames. ], batch size: 332, lr: 5.34e-03, grad_scale: 32.0 2023-06-21 07:53:06,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=915438.0, ans=0.1 2023-06-21 07:53:55,665 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.03 vs. limit=10.0 2023-06-21 07:54:18,554 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-21 07:54:41,355 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.061e+02 2.736e+02 3.116e+02 3.564e+02 7.052e+02, threshold=6.231e+02, percent-clipped=0.0 2023-06-21 07:54:42,890 INFO [train.py:996] (2/4) Epoch 6, batch 150, loss[loss=0.2739, simple_loss=0.3655, pruned_loss=0.09118, over 21798.00 frames. ], tot_loss[loss=0.2697, simple_loss=0.3494, pruned_loss=0.09504, over 2256278.63 frames. ], batch size: 332, lr: 5.34e-03, grad_scale: 32.0 2023-06-21 07:55:56,235 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=15.0 2023-06-21 07:56:09,156 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=15.0 2023-06-21 07:56:22,169 INFO [train.py:996] (2/4) Epoch 6, batch 200, loss[loss=0.3122, simple_loss=0.3765, pruned_loss=0.1239, over 21798.00 frames. ], tot_loss[loss=0.2698, simple_loss=0.3479, pruned_loss=0.09585, over 2702145.21 frames. ], batch size: 441, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 07:56:38,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=916038.0, ans=0.125 2023-06-21 07:56:39,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=916038.0, ans=0.125 2023-06-21 07:57:52,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=916278.0, ans=0.0 2023-06-21 07:58:01,355 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 3.028e+02 3.538e+02 4.112e+02 1.174e+03, threshold=7.076e+02, percent-clipped=8.0 2023-06-21 07:58:01,385 INFO [train.py:996] (2/4) Epoch 6, batch 250, loss[loss=0.2769, simple_loss=0.3424, pruned_loss=0.1056, over 21515.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3425, pruned_loss=0.09314, over 3046810.57 frames. ], batch size: 131, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 07:58:40,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=916398.0, ans=0.1 2023-06-21 07:59:02,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=916518.0, ans=0.0 2023-06-21 07:59:16,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=916518.0, ans=0.0 2023-06-21 07:59:39,977 INFO [train.py:996] (2/4) Epoch 6, batch 300, loss[loss=0.238, simple_loss=0.3098, pruned_loss=0.08311, over 21600.00 frames. ], tot_loss[loss=0.2606, simple_loss=0.3368, pruned_loss=0.09218, over 3315569.69 frames. ], batch size: 263, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 07:59:42,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=916638.0, ans=0.125 2023-06-21 07:59:59,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=916698.0, ans=0.1 2023-06-21 08:00:01,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=916698.0, ans=0.025 2023-06-21 08:00:03,576 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.99 vs. limit=15.0 2023-06-21 08:00:19,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=916698.0, ans=0.0 2023-06-21 08:00:30,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=916758.0, ans=0.2 2023-06-21 08:01:19,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=916938.0, ans=0.125 2023-06-21 08:01:19,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=916938.0, ans=0.0 2023-06-21 08:01:20,513 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.256e+02 3.010e+02 3.563e+02 4.495e+02 6.815e+02, threshold=7.126e+02, percent-clipped=0.0 2023-06-21 08:01:20,542 INFO [train.py:996] (2/4) Epoch 6, batch 350, loss[loss=0.2193, simple_loss=0.2807, pruned_loss=0.07893, over 21833.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3309, pruned_loss=0.09206, over 3524855.47 frames. ], batch size: 352, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 08:01:37,290 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-21 08:01:39,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=916998.0, ans=0.2 2023-06-21 08:02:07,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=917058.0, ans=0.0 2023-06-21 08:02:54,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=917178.0, ans=0.0 2023-06-21 08:02:58,479 INFO [train.py:996] (2/4) Epoch 6, batch 400, loss[loss=0.2268, simple_loss=0.2866, pruned_loss=0.08346, over 21635.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3245, pruned_loss=0.09058, over 3687412.62 frames. ], batch size: 298, lr: 5.34e-03, grad_scale: 32.0 2023-06-21 08:03:14,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=917238.0, ans=0.125 2023-06-21 08:04:36,496 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.827e+02 3.421e+02 4.074e+02 6.754e+02, threshold=6.843e+02, percent-clipped=0.0 2023-06-21 08:04:36,524 INFO [train.py:996] (2/4) Epoch 6, batch 450, loss[loss=0.2275, simple_loss=0.2886, pruned_loss=0.08326, over 21673.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3208, pruned_loss=0.08933, over 3823011.46 frames. ], batch size: 417, lr: 5.34e-03, grad_scale: 32.0 2023-06-21 08:05:10,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=917598.0, ans=0.125 2023-06-21 08:05:26,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=917658.0, ans=0.0 2023-06-21 08:06:09,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=917778.0, ans=0.125 2023-06-21 08:06:18,062 INFO [train.py:996] (2/4) Epoch 6, batch 500, loss[loss=0.2454, simple_loss=0.3637, pruned_loss=0.06352, over 19816.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3225, pruned_loss=0.08786, over 3929782.34 frames. ], batch size: 703, lr: 5.34e-03, grad_scale: 32.0 2023-06-21 08:06:38,188 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.44 vs. limit=10.0 2023-06-21 08:07:26,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=918018.0, ans=0.0 2023-06-21 08:07:42,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=918078.0, ans=0.125 2023-06-21 08:07:51,199 INFO [train.py:996] (2/4) Epoch 6, batch 550, loss[loss=0.3413, simple_loss=0.4308, pruned_loss=0.1259, over 21542.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3285, pruned_loss=0.08893, over 4005236.73 frames. ], batch size: 471, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 08:07:57,434 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.995e+02 3.563e+02 4.699e+02 8.861e+02, threshold=7.125e+02, percent-clipped=10.0 2023-06-21 08:08:35,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=918258.0, ans=0.5 2023-06-21 08:08:42,326 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=15.0 2023-06-21 08:09:31,312 INFO [train.py:996] (2/4) Epoch 6, batch 600, loss[loss=0.2472, simple_loss=0.2944, pruned_loss=0.1, over 21997.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3301, pruned_loss=0.08911, over 4070114.11 frames. ], batch size: 103, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 08:09:52,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=918438.0, ans=0.125 2023-06-21 08:09:59,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=918498.0, ans=0.2 2023-06-21 08:10:00,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=918498.0, ans=0.125 2023-06-21 08:10:06,294 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.20 vs. limit=15.0 2023-06-21 08:10:36,878 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=15.0 2023-06-21 08:10:45,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=918618.0, ans=0.1 2023-06-21 08:10:50,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=918678.0, ans=0.125 2023-06-21 08:11:07,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=918678.0, ans=0.0 2023-06-21 08:11:09,683 INFO [train.py:996] (2/4) Epoch 6, batch 650, loss[loss=0.2564, simple_loss=0.3077, pruned_loss=0.1026, over 21854.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3313, pruned_loss=0.09011, over 4124592.28 frames. ], batch size: 107, lr: 5.34e-03, grad_scale: 16.0 2023-06-21 08:11:11,271 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.996e+02 2.881e+02 3.396e+02 3.907e+02 7.469e+02, threshold=6.792e+02, percent-clipped=1.0 2023-06-21 08:11:32,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=918738.0, ans=0.125 2023-06-21 08:11:40,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=918798.0, ans=0.1 2023-06-21 08:12:23,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=918918.0, ans=0.2 2023-06-21 08:12:32,493 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-06-21 08:12:42,323 INFO [train.py:996] (2/4) Epoch 6, batch 700, loss[loss=0.2533, simple_loss=0.3294, pruned_loss=0.08863, over 21797.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3302, pruned_loss=0.09016, over 4162035.06 frames. ], batch size: 107, lr: 5.33e-03, grad_scale: 16.0 2023-06-21 08:13:23,127 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=15.0 2023-06-21 08:13:24,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=919098.0, ans=0.1 2023-06-21 08:13:27,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=919158.0, ans=0.2 2023-06-21 08:13:36,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=919158.0, ans=0.2 2023-06-21 08:13:46,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=919158.0, ans=0.125 2023-06-21 08:14:14,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=919278.0, ans=0.0 2023-06-21 08:14:20,559 INFO [train.py:996] (2/4) Epoch 6, batch 750, loss[loss=0.2316, simple_loss=0.2913, pruned_loss=0.08592, over 21985.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3326, pruned_loss=0.09105, over 4191219.85 frames. ], batch size: 103, lr: 5.33e-03, grad_scale: 16.0 2023-06-21 08:14:26,740 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 3.276e+02 4.088e+02 4.962e+02 1.159e+03, threshold=8.176e+02, percent-clipped=5.0 2023-06-21 08:15:17,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=919458.0, ans=0.1 2023-06-21 08:15:58,065 INFO [train.py:996] (2/4) Epoch 6, batch 800, loss[loss=0.2673, simple_loss=0.3239, pruned_loss=0.1054, over 21682.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.328, pruned_loss=0.09161, over 4210684.00 frames. ], batch size: 263, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:16:11,514 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=22.5 2023-06-21 08:16:31,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=919698.0, ans=0.0 2023-06-21 08:16:58,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=919818.0, ans=0.125 2023-06-21 08:17:19,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=919818.0, ans=0.2 2023-06-21 08:17:37,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=919938.0, ans=0.125 2023-06-21 08:17:38,701 INFO [train.py:996] (2/4) Epoch 6, batch 850, loss[loss=0.2702, simple_loss=0.3941, pruned_loss=0.07316, over 19723.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3257, pruned_loss=0.09174, over 4230507.18 frames. ], batch size: 703, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:17:40,201 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.357e+02 2.947e+02 3.491e+02 3.933e+02 7.622e+02, threshold=6.983e+02, percent-clipped=0.0 2023-06-21 08:18:02,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=919938.0, ans=0.2 2023-06-21 08:18:57,955 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=22.5 2023-06-21 08:18:59,642 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.16 vs. limit=10.0 2023-06-21 08:19:21,808 INFO [train.py:996] (2/4) Epoch 6, batch 900, loss[loss=0.2413, simple_loss=0.2994, pruned_loss=0.09158, over 21471.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3227, pruned_loss=0.09091, over 4248176.28 frames. ], batch size: 194, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:20:23,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=920418.0, ans=0.0 2023-06-21 08:20:28,612 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-06-21 08:21:05,304 INFO [train.py:996] (2/4) Epoch 6, batch 950, loss[loss=0.2673, simple_loss=0.3336, pruned_loss=0.1005, over 21877.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3195, pruned_loss=0.09032, over 4260853.64 frames. ], batch size: 107, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:21:06,940 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.955e+02 2.884e+02 3.289e+02 4.152e+02 6.570e+02, threshold=6.579e+02, percent-clipped=0.0 2023-06-21 08:21:34,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=920598.0, ans=0.125 2023-06-21 08:22:13,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=920778.0, ans=0.125 2023-06-21 08:22:39,436 INFO [train.py:996] (2/4) Epoch 6, batch 1000, loss[loss=0.267, simple_loss=0.3576, pruned_loss=0.08816, over 21628.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3179, pruned_loss=0.09002, over 4270078.92 frames. ], batch size: 414, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:22:56,852 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=15.0 2023-06-21 08:23:04,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=920898.0, ans=0.015 2023-06-21 08:23:31,672 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-21 08:23:43,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=921018.0, ans=0.04949747468305833 2023-06-21 08:24:08,551 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-21 08:24:13,755 INFO [train.py:996] (2/4) Epoch 6, batch 1050, loss[loss=0.2145, simple_loss=0.3005, pruned_loss=0.06426, over 21742.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3213, pruned_loss=0.09143, over 4279929.74 frames. ], batch size: 247, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:24:15,065 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.75 vs. limit=15.0 2023-06-21 08:24:15,324 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 3.022e+02 3.396e+02 3.710e+02 5.985e+02, threshold=6.792e+02, percent-clipped=0.0 2023-06-21 08:25:13,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=921318.0, ans=0.125 2023-06-21 08:25:13,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=921318.0, ans=0.125 2023-06-21 08:25:38,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=921378.0, ans=0.0 2023-06-21 08:25:48,846 INFO [train.py:996] (2/4) Epoch 6, batch 1100, loss[loss=0.2378, simple_loss=0.3229, pruned_loss=0.07634, over 21531.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3226, pruned_loss=0.09081, over 4280617.68 frames. ], batch size: 471, lr: 5.33e-03, grad_scale: 16.0 2023-06-21 08:27:18,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=921678.0, ans=0.125 2023-06-21 08:27:25,430 INFO [train.py:996] (2/4) Epoch 6, batch 1150, loss[loss=0.2609, simple_loss=0.322, pruned_loss=0.09988, over 21301.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3208, pruned_loss=0.08906, over 4285331.17 frames. ], batch size: 143, lr: 5.33e-03, grad_scale: 16.0 2023-06-21 08:27:28,762 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 3.136e+02 3.809e+02 5.209e+02 8.344e+02, threshold=7.619e+02, percent-clipped=5.0 2023-06-21 08:27:45,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=921798.0, ans=0.1 2023-06-21 08:28:57,210 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-21 08:29:05,586 INFO [train.py:996] (2/4) Epoch 6, batch 1200, loss[loss=0.2652, simple_loss=0.3616, pruned_loss=0.08443, over 21649.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.321, pruned_loss=0.08895, over 4280157.64 frames. ], batch size: 414, lr: 5.33e-03, grad_scale: 32.0 2023-06-21 08:29:11,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=922038.0, ans=0.125 2023-06-21 08:29:11,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=922038.0, ans=0.0 2023-06-21 08:29:36,812 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:30:43,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=922338.0, ans=0.1 2023-06-21 08:30:44,983 INFO [train.py:996] (2/4) Epoch 6, batch 1250, loss[loss=0.261, simple_loss=0.3434, pruned_loss=0.0893, over 21677.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3233, pruned_loss=0.09094, over 4287135.53 frames. ], batch size: 389, lr: 5.32e-03, grad_scale: 32.0 2023-06-21 08:30:47,958 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.229e+02 2.807e+02 3.082e+02 3.703e+02 6.160e+02, threshold=6.164e+02, percent-clipped=0.0 2023-06-21 08:31:01,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=922338.0, ans=0.125 2023-06-21 08:31:20,428 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-21 08:32:04,306 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.43 vs. limit=15.0 2023-06-21 08:32:21,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=922578.0, ans=0.125 2023-06-21 08:32:25,638 INFO [train.py:996] (2/4) Epoch 6, batch 1300, loss[loss=0.2652, simple_loss=0.3265, pruned_loss=0.1019, over 21359.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3247, pruned_loss=0.09179, over 4293375.22 frames. ], batch size: 159, lr: 5.32e-03, grad_scale: 32.0 2023-06-21 08:32:50,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=922698.0, ans=0.125 2023-06-21 08:33:25,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=922758.0, ans=0.2 2023-06-21 08:33:57,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=922878.0, ans=0.125 2023-06-21 08:34:12,142 INFO [train.py:996] (2/4) Epoch 6, batch 1350, loss[loss=0.2649, simple_loss=0.3131, pruned_loss=0.1084, over 21327.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.325, pruned_loss=0.09223, over 4291585.93 frames. ], batch size: 159, lr: 5.32e-03, grad_scale: 32.0 2023-06-21 08:34:15,482 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.151e+02 2.950e+02 3.402e+02 4.327e+02 7.422e+02, threshold=6.804e+02, percent-clipped=3.0 2023-06-21 08:34:33,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=922998.0, ans=0.125 2023-06-21 08:34:56,629 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.13 vs. limit=6.0 2023-06-21 08:35:10,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=923058.0, ans=0.125 2023-06-21 08:35:13,886 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-21 08:35:50,727 INFO [train.py:996] (2/4) Epoch 6, batch 1400, loss[loss=0.2778, simple_loss=0.3351, pruned_loss=0.1102, over 21315.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3239, pruned_loss=0.09207, over 4293678.24 frames. ], batch size: 159, lr: 5.32e-03, grad_scale: 16.0 2023-06-21 08:35:57,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=923238.0, ans=0.125 2023-06-21 08:37:02,046 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-21 08:37:31,197 INFO [train.py:996] (2/4) Epoch 6, batch 1450, loss[loss=0.2491, simple_loss=0.3222, pruned_loss=0.088, over 21384.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3251, pruned_loss=0.09271, over 4290911.07 frames. ], batch size: 549, lr: 5.32e-03, grad_scale: 8.0 2023-06-21 08:37:37,342 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.855e+02 3.384e+02 3.937e+02 6.877e+02, threshold=6.768e+02, percent-clipped=1.0 2023-06-21 08:37:37,981 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:38:09,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=923658.0, ans=0.125 2023-06-21 08:39:11,670 INFO [train.py:996] (2/4) Epoch 6, batch 1500, loss[loss=0.2816, simple_loss=0.3375, pruned_loss=0.1129, over 21615.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3281, pruned_loss=0.09483, over 4295979.45 frames. ], batch size: 548, lr: 5.32e-03, grad_scale: 8.0 2023-06-21 08:39:15,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=923838.0, ans=0.125 2023-06-21 08:39:32,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=923898.0, ans=0.95 2023-06-21 08:39:43,486 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:39:54,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=923958.0, ans=0.125 2023-06-21 08:40:38,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=924078.0, ans=0.0 2023-06-21 08:40:53,834 INFO [train.py:996] (2/4) Epoch 6, batch 1550, loss[loss=0.2921, simple_loss=0.3423, pruned_loss=0.121, over 21866.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3239, pruned_loss=0.09194, over 4301294.13 frames. ], batch size: 351, lr: 5.32e-03, grad_scale: 8.0 2023-06-21 08:41:00,460 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.183e+02 2.808e+02 3.171e+02 3.740e+02 6.860e+02, threshold=6.342e+02, percent-clipped=1.0 2023-06-21 08:41:01,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=924138.0, ans=0.05 2023-06-21 08:41:05,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=924138.0, ans=0.125 2023-06-21 08:42:36,168 INFO [train.py:996] (2/4) Epoch 6, batch 1600, loss[loss=0.2572, simple_loss=0.3325, pruned_loss=0.09096, over 20039.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3211, pruned_loss=0.09098, over 4299322.92 frames. ], batch size: 702, lr: 5.32e-03, grad_scale: 16.0 2023-06-21 08:42:58,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=924498.0, ans=0.125 2023-06-21 08:43:33,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=924558.0, ans=0.025 2023-06-21 08:43:54,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=924618.0, ans=0.125 2023-06-21 08:44:01,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=924678.0, ans=0.125 2023-06-21 08:44:25,206 INFO [train.py:996] (2/4) Epoch 6, batch 1650, loss[loss=0.2839, simple_loss=0.3416, pruned_loss=0.1131, over 21607.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3212, pruned_loss=0.09133, over 4287659.94 frames. ], batch size: 471, lr: 5.32e-03, grad_scale: 16.0 2023-06-21 08:44:31,644 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 3.183e+02 3.962e+02 4.475e+02 7.912e+02, threshold=7.925e+02, percent-clipped=6.0 2023-06-21 08:45:40,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=924918.0, ans=0.2 2023-06-21 08:45:42,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=924918.0, ans=0.125 2023-06-21 08:45:44,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=924978.0, ans=0.0 2023-06-21 08:45:51,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=924978.0, ans=0.125 2023-06-21 08:45:51,364 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 08:45:59,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=924978.0, ans=0.125 2023-06-21 08:46:07,227 INFO [train.py:996] (2/4) Epoch 6, batch 1700, loss[loss=0.2211, simple_loss=0.3085, pruned_loss=0.06688, over 21647.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3229, pruned_loss=0.09184, over 4286813.99 frames. ], batch size: 263, lr: 5.32e-03, grad_scale: 16.0 2023-06-21 08:46:35,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=925098.0, ans=0.125 2023-06-21 08:46:43,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=925098.0, ans=0.2 2023-06-21 08:47:02,238 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.20 vs. limit=12.0 2023-06-21 08:47:12,131 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-21 08:47:21,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=925218.0, ans=0.1 2023-06-21 08:47:40,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=925278.0, ans=0.125 2023-06-21 08:47:44,296 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-21 08:47:54,715 INFO [train.py:996] (2/4) Epoch 6, batch 1750, loss[loss=0.1615, simple_loss=0.2301, pruned_loss=0.04642, over 21377.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3232, pruned_loss=0.08963, over 4288823.31 frames. ], batch size: 131, lr: 5.32e-03, grad_scale: 16.0 2023-06-21 08:47:55,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=925338.0, ans=0.125 2023-06-21 08:48:05,876 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.100e+02 3.133e+02 3.705e+02 4.363e+02 7.096e+02, threshold=7.410e+02, percent-clipped=0.0 2023-06-21 08:48:06,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=925338.0, ans=0.125 2023-06-21 08:48:11,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=925338.0, ans=0.0 2023-06-21 08:48:38,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=925458.0, ans=0.2 2023-06-21 08:48:40,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=925458.0, ans=0.125 2023-06-21 08:48:50,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=925458.0, ans=0.125 2023-06-21 08:49:05,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=925518.0, ans=0.0 2023-06-21 08:49:05,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=925518.0, ans=0.125 2023-06-21 08:49:07,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=925518.0, ans=0.125 2023-06-21 08:49:43,009 INFO [train.py:996] (2/4) Epoch 6, batch 1800, loss[loss=0.2356, simple_loss=0.3131, pruned_loss=0.07899, over 21737.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3211, pruned_loss=0.08661, over 4292322.62 frames. ], batch size: 298, lr: 5.32e-03, grad_scale: 16.0 2023-06-21 08:49:45,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=925638.0, ans=0.125 2023-06-21 08:50:05,677 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-06-21 08:50:17,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=925698.0, ans=0.125 2023-06-21 08:50:19,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=925758.0, ans=0.125 2023-06-21 08:50:21,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=925758.0, ans=0.125 2023-06-21 08:50:24,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=925758.0, ans=0.2 2023-06-21 08:50:48,314 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.16 vs. limit=12.0 2023-06-21 08:51:13,622 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-06-21 08:51:23,707 INFO [train.py:996] (2/4) Epoch 6, batch 1850, loss[loss=0.2309, simple_loss=0.3244, pruned_loss=0.06875, over 21753.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3235, pruned_loss=0.0856, over 4293696.66 frames. ], batch size: 351, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 08:51:30,094 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 2.862e+02 3.405e+02 4.274e+02 8.543e+02, threshold=6.809e+02, percent-clipped=2.0 2023-06-21 08:51:59,689 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-21 08:52:00,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=926058.0, ans=0.0 2023-06-21 08:52:54,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=926178.0, ans=0.09899494936611666 2023-06-21 08:53:01,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=926178.0, ans=0.2 2023-06-21 08:53:04,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=926238.0, ans=0.125 2023-06-21 08:53:05,037 INFO [train.py:996] (2/4) Epoch 6, batch 1900, loss[loss=0.2566, simple_loss=0.328, pruned_loss=0.0926, over 21193.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3216, pruned_loss=0.08539, over 4293664.90 frames. ], batch size: 548, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 08:53:10,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=926238.0, ans=0.125 2023-06-21 08:53:56,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=926358.0, ans=0.125 2023-06-21 08:54:23,753 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.11 vs. limit=12.0 2023-06-21 08:54:48,428 INFO [train.py:996] (2/4) Epoch 6, batch 1950, loss[loss=0.2299, simple_loss=0.3296, pruned_loss=0.06515, over 21698.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3187, pruned_loss=0.08437, over 4291527.35 frames. ], batch size: 298, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 08:54:54,637 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-21 08:54:55,267 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.168e+02 2.902e+02 3.428e+02 4.161e+02 7.529e+02, threshold=6.855e+02, percent-clipped=4.0 2023-06-21 08:55:08,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=926598.0, ans=0.1 2023-06-21 08:55:44,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=926658.0, ans=0.2 2023-06-21 08:56:18,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=926778.0, ans=0.0 2023-06-21 08:56:27,464 INFO [train.py:996] (2/4) Epoch 6, batch 2000, loss[loss=0.2034, simple_loss=0.2761, pruned_loss=0.06537, over 21294.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.315, pruned_loss=0.08239, over 4293619.36 frames. ], batch size: 159, lr: 5.31e-03, grad_scale: 32.0 2023-06-21 08:57:36,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=927018.0, ans=0.125 2023-06-21 08:58:08,468 INFO [train.py:996] (2/4) Epoch 6, batch 2050, loss[loss=0.1963, simple_loss=0.266, pruned_loss=0.06328, over 21530.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3146, pruned_loss=0.0824, over 4292253.80 frames. ], batch size: 195, lr: 5.31e-03, grad_scale: 32.0 2023-06-21 08:58:19,911 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 3.151e+02 3.657e+02 4.300e+02 8.922e+02, threshold=7.314e+02, percent-clipped=4.0 2023-06-21 08:58:21,079 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=15.0 2023-06-21 08:59:35,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=927378.0, ans=0.125 2023-06-21 08:59:37,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=927378.0, ans=0.125 2023-06-21 08:59:49,733 INFO [train.py:996] (2/4) Epoch 6, batch 2100, loss[loss=0.2275, simple_loss=0.3063, pruned_loss=0.07431, over 21657.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3173, pruned_loss=0.08487, over 4289683.09 frames. ], batch size: 263, lr: 5.31e-03, grad_scale: 32.0 2023-06-21 09:00:18,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=927498.0, ans=0.0 2023-06-21 09:00:44,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=927558.0, ans=0.125 2023-06-21 09:01:09,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=927618.0, ans=0.125 2023-06-21 09:01:15,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=927678.0, ans=0.125 2023-06-21 09:01:17,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=927678.0, ans=0.1 2023-06-21 09:01:30,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=927738.0, ans=0.0 2023-06-21 09:01:31,398 INFO [train.py:996] (2/4) Epoch 6, batch 2150, loss[loss=0.2549, simple_loss=0.3326, pruned_loss=0.08865, over 21374.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3185, pruned_loss=0.08675, over 4287645.47 frames. ], batch size: 211, lr: 5.31e-03, grad_scale: 32.0 2023-06-21 09:01:43,417 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.262e+02 2.848e+02 3.301e+02 4.038e+02 6.672e+02, threshold=6.603e+02, percent-clipped=0.0 2023-06-21 09:02:23,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=927858.0, ans=0.1 2023-06-21 09:02:26,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=927858.0, ans=0.0 2023-06-21 09:03:13,679 INFO [train.py:996] (2/4) Epoch 6, batch 2200, loss[loss=0.2525, simple_loss=0.3059, pruned_loss=0.09956, over 21291.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3199, pruned_loss=0.08755, over 4281163.79 frames. ], batch size: 608, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 09:03:15,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=928038.0, ans=0.125 2023-06-21 09:03:28,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=928038.0, ans=0.125 2023-06-21 09:04:26,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=928218.0, ans=0.0 2023-06-21 09:04:44,713 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.23 vs. limit=15.0 2023-06-21 09:04:53,504 INFO [train.py:996] (2/4) Epoch 6, batch 2250, loss[loss=0.2145, simple_loss=0.2702, pruned_loss=0.07938, over 21764.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3175, pruned_loss=0.08604, over 4277261.15 frames. ], batch size: 112, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 09:05:04,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=928338.0, ans=0.125 2023-06-21 09:05:06,972 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.750e+02 3.169e+02 3.694e+02 5.600e+02, threshold=6.338e+02, percent-clipped=0.0 2023-06-21 09:05:14,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=928398.0, ans=0.0 2023-06-21 09:05:22,287 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:05:49,242 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.51 vs. limit=6.0 2023-06-21 09:06:12,409 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.72 vs. limit=6.0 2023-06-21 09:06:35,971 INFO [train.py:996] (2/4) Epoch 6, batch 2300, loss[loss=0.2551, simple_loss=0.3133, pruned_loss=0.09845, over 22019.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3146, pruned_loss=0.08609, over 4279861.38 frames. ], batch size: 103, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 09:08:17,139 INFO [train.py:996] (2/4) Epoch 6, batch 2350, loss[loss=0.2266, simple_loss=0.311, pruned_loss=0.07104, over 20692.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3127, pruned_loss=0.08662, over 4264802.64 frames. ], batch size: 607, lr: 5.31e-03, grad_scale: 16.0 2023-06-21 09:08:24,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=928938.0, ans=0.0 2023-06-21 09:08:25,730 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.413e+02 3.345e+02 4.237e+02 6.014e+02 1.096e+03, threshold=8.474e+02, percent-clipped=18.0 2023-06-21 09:08:26,991 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.89 vs. limit=15.0 2023-06-21 09:08:30,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=928938.0, ans=0.125 2023-06-21 09:09:55,375 INFO [train.py:996] (2/4) Epoch 6, batch 2400, loss[loss=0.2265, simple_loss=0.2811, pruned_loss=0.08591, over 21559.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3168, pruned_loss=0.08887, over 4260615.02 frames. ], batch size: 263, lr: 5.31e-03, grad_scale: 32.0 2023-06-21 09:10:59,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=929418.0, ans=0.125 2023-06-21 09:11:33,355 INFO [train.py:996] (2/4) Epoch 6, batch 2450, loss[loss=0.2724, simple_loss=0.3446, pruned_loss=0.1001, over 21765.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3216, pruned_loss=0.09125, over 4260509.67 frames. ], batch size: 113, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:11:41,394 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.479e+02 3.127e+02 3.688e+02 4.498e+02 8.076e+02, threshold=7.375e+02, percent-clipped=0.0 2023-06-21 09:12:25,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=929658.0, ans=0.015 2023-06-21 09:12:37,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=929718.0, ans=0.0 2023-06-21 09:12:39,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=929718.0, ans=0.2 2023-06-21 09:12:54,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=929778.0, ans=0.015 2023-06-21 09:13:09,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=929778.0, ans=0.125 2023-06-21 09:13:13,626 INFO [train.py:996] (2/4) Epoch 6, batch 2500, loss[loss=0.2438, simple_loss=0.3184, pruned_loss=0.0846, over 21550.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3223, pruned_loss=0.09257, over 4266757.97 frames. ], batch size: 414, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:13:52,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=929958.0, ans=0.125 2023-06-21 09:14:49,973 INFO [train.py:996] (2/4) Epoch 6, batch 2550, loss[loss=0.2086, simple_loss=0.2901, pruned_loss=0.06357, over 21393.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3192, pruned_loss=0.09083, over 4271642.92 frames. ], batch size: 131, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:14:58,271 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.038e+02 2.790e+02 3.218e+02 3.631e+02 5.360e+02, threshold=6.436e+02, percent-clipped=0.0 2023-06-21 09:16:27,305 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.47 vs. limit=10.0 2023-06-21 09:16:31,245 INFO [train.py:996] (2/4) Epoch 6, batch 2600, loss[loss=0.2049, simple_loss=0.2748, pruned_loss=0.06755, over 21590.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3187, pruned_loss=0.0903, over 4263372.78 frames. ], batch size: 247, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:16:52,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=930498.0, ans=0.0 2023-06-21 09:16:53,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=930498.0, ans=0.2 2023-06-21 09:17:00,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=930498.0, ans=0.125 2023-06-21 09:17:40,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=930618.0, ans=0.0 2023-06-21 09:18:09,181 INFO [train.py:996] (2/4) Epoch 6, batch 2650, loss[loss=0.2534, simple_loss=0.3284, pruned_loss=0.0892, over 21473.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.32, pruned_loss=0.09043, over 4264170.87 frames. ], batch size: 131, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:18:16,907 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.474e+02 3.039e+02 3.537e+02 4.396e+02 7.352e+02, threshold=7.074e+02, percent-clipped=6.0 2023-06-21 09:19:15,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=930918.0, ans=0.0 2023-06-21 09:19:22,602 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-06-21 09:19:50,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=930978.0, ans=0.125 2023-06-21 09:19:52,602 INFO [train.py:996] (2/4) Epoch 6, batch 2700, loss[loss=0.3017, simple_loss=0.3661, pruned_loss=0.1186, over 21555.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3193, pruned_loss=0.09032, over 4255886.27 frames. ], batch size: 471, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:21:27,327 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.47 vs. limit=10.0 2023-06-21 09:21:34,632 INFO [train.py:996] (2/4) Epoch 6, batch 2750, loss[loss=0.2814, simple_loss=0.356, pruned_loss=0.1034, over 21693.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3182, pruned_loss=0.09014, over 4253674.59 frames. ], batch size: 441, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:21:36,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=931338.0, ans=0.0 2023-06-21 09:21:42,925 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.421e+02 2.939e+02 3.495e+02 4.251e+02 6.748e+02, threshold=6.989e+02, percent-clipped=0.0 2023-06-21 09:22:21,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=931458.0, ans=0.0 2023-06-21 09:23:20,466 INFO [train.py:996] (2/4) Epoch 6, batch 2800, loss[loss=0.2699, simple_loss=0.3538, pruned_loss=0.09297, over 21768.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3226, pruned_loss=0.09099, over 4258570.18 frames. ], batch size: 298, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:23:50,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=931698.0, ans=0.0 2023-06-21 09:24:07,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=931758.0, ans=0.2 2023-06-21 09:24:13,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=931758.0, ans=0.1 2023-06-21 09:25:00,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=931878.0, ans=0.0 2023-06-21 09:25:03,047 INFO [train.py:996] (2/4) Epoch 6, batch 2850, loss[loss=0.2441, simple_loss=0.2978, pruned_loss=0.09519, over 21273.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3258, pruned_loss=0.09223, over 4257014.24 frames. ], batch size: 549, lr: 5.30e-03, grad_scale: 32.0 2023-06-21 09:25:23,454 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.345e+02 3.230e+02 3.892e+02 4.894e+02 8.283e+02, threshold=7.785e+02, percent-clipped=6.0 2023-06-21 09:26:36,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=932178.0, ans=0.125 2023-06-21 09:26:45,931 INFO [train.py:996] (2/4) Epoch 6, batch 2900, loss[loss=0.2693, simple_loss=0.3296, pruned_loss=0.1045, over 22056.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3224, pruned_loss=0.09159, over 4266738.11 frames. ], batch size: 119, lr: 5.30e-03, grad_scale: 16.0 2023-06-21 09:26:46,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=932238.0, ans=0.125 2023-06-21 09:27:20,145 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.60 vs. limit=22.5 2023-06-21 09:28:16,656 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.73 vs. limit=22.5 2023-06-21 09:28:17,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=932478.0, ans=0.0 2023-06-21 09:28:28,480 INFO [train.py:996] (2/4) Epoch 6, batch 2950, loss[loss=0.2837, simple_loss=0.3649, pruned_loss=0.1013, over 21656.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3227, pruned_loss=0.09166, over 4271521.27 frames. ], batch size: 263, lr: 5.30e-03, grad_scale: 16.0 2023-06-21 09:28:42,916 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.364e+02 2.961e+02 3.319e+02 4.000e+02 7.696e+02, threshold=6.638e+02, percent-clipped=0.0 2023-06-21 09:29:14,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=932658.0, ans=0.125 2023-06-21 09:29:42,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=932718.0, ans=0.125 2023-06-21 09:30:14,728 INFO [train.py:996] (2/4) Epoch 6, batch 3000, loss[loss=0.2958, simple_loss=0.3727, pruned_loss=0.1095, over 21468.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.327, pruned_loss=0.09215, over 4275700.05 frames. ], batch size: 131, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:30:14,728 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 09:30:34,699 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.255, simple_loss=0.3481, pruned_loss=0.08099, over 1796401.00 frames. 2023-06-21 09:30:34,699 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-21 09:30:42,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=932838.0, ans=0.1 2023-06-21 09:30:44,656 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.62 vs. limit=15.0 2023-06-21 09:30:53,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=932898.0, ans=0.125 2023-06-21 09:30:58,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=932898.0, ans=0.125 2023-06-21 09:31:00,179 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:32:16,776 INFO [train.py:996] (2/4) Epoch 6, batch 3050, loss[loss=0.1881, simple_loss=0.2611, pruned_loss=0.05757, over 21455.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3276, pruned_loss=0.0912, over 4280367.64 frames. ], batch size: 194, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:32:26,225 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.346e+02 2.903e+02 3.413e+02 4.363e+02 7.333e+02, threshold=6.826e+02, percent-clipped=2.0 2023-06-21 09:32:58,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=933258.0, ans=0.04949747468305833 2023-06-21 09:33:51,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=933378.0, ans=0.125 2023-06-21 09:33:59,324 INFO [train.py:996] (2/4) Epoch 6, batch 3100, loss[loss=0.2176, simple_loss=0.311, pruned_loss=0.06214, over 21604.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.328, pruned_loss=0.09043, over 4284701.29 frames. ], batch size: 230, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:34:12,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=933438.0, ans=0.125 2023-06-21 09:34:36,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=933558.0, ans=0.0 2023-06-21 09:34:44,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=933558.0, ans=0.125 2023-06-21 09:35:10,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=933618.0, ans=0.0 2023-06-21 09:35:40,900 INFO [train.py:996] (2/4) Epoch 6, batch 3150, loss[loss=0.2476, simple_loss=0.3194, pruned_loss=0.08793, over 21594.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3285, pruned_loss=0.09013, over 4279294.21 frames. ], batch size: 230, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:35:55,936 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 3.015e+02 3.533e+02 4.107e+02 6.510e+02, threshold=7.067e+02, percent-clipped=0.0 2023-06-21 09:36:41,887 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-21 09:36:58,885 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:37:22,854 INFO [train.py:996] (2/4) Epoch 6, batch 3200, loss[loss=0.2682, simple_loss=0.3524, pruned_loss=0.09202, over 21756.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3305, pruned_loss=0.09096, over 4274732.39 frames. ], batch size: 441, lr: 5.29e-03, grad_scale: 32.0 2023-06-21 09:38:03,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=934158.0, ans=0.0 2023-06-21 09:38:15,343 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.44 vs. limit=6.0 2023-06-21 09:38:18,967 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:38:38,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=934218.0, ans=0.1 2023-06-21 09:38:39,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=934218.0, ans=0.04949747468305833 2023-06-21 09:38:40,215 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-06-21 09:39:07,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=934338.0, ans=0.125 2023-06-21 09:39:08,409 INFO [train.py:996] (2/4) Epoch 6, batch 3250, loss[loss=0.2552, simple_loss=0.3088, pruned_loss=0.1008, over 21744.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.333, pruned_loss=0.09323, over 4280239.74 frames. ], batch size: 282, lr: 5.29e-03, grad_scale: 32.0 2023-06-21 09:39:18,216 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.200e+02 2.861e+02 3.274e+02 3.932e+02 7.956e+02, threshold=6.547e+02, percent-clipped=1.0 2023-06-21 09:39:45,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=934398.0, ans=0.0 2023-06-21 09:40:12,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=934518.0, ans=0.125 2023-06-21 09:40:49,808 INFO [train.py:996] (2/4) Epoch 6, batch 3300, loss[loss=0.2257, simple_loss=0.3045, pruned_loss=0.07342, over 21388.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3262, pruned_loss=0.09313, over 4279840.97 frames. ], batch size: 194, lr: 5.29e-03, grad_scale: 32.0 2023-06-21 09:41:00,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=934638.0, ans=0.0 2023-06-21 09:41:03,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=934638.0, ans=0.0 2023-06-21 09:41:21,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=934698.0, ans=0.0 2023-06-21 09:41:21,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=934698.0, ans=0.125 2023-06-21 09:41:46,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=934758.0, ans=0.125 2023-06-21 09:42:15,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=934878.0, ans=0.2 2023-06-21 09:42:15,711 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.44 vs. limit=10.0 2023-06-21 09:42:30,703 INFO [train.py:996] (2/4) Epoch 6, batch 3350, loss[loss=0.2283, simple_loss=0.2973, pruned_loss=0.07969, over 21836.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3283, pruned_loss=0.09279, over 4277330.69 frames. ], batch size: 247, lr: 5.29e-03, grad_scale: 32.0 2023-06-21 09:42:45,023 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.921e+02 3.413e+02 3.921e+02 6.338e+02, threshold=6.826e+02, percent-clipped=0.0 2023-06-21 09:42:53,396 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.47 vs. limit=22.5 2023-06-21 09:42:57,750 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.89 vs. limit=15.0 2023-06-21 09:43:02,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=934998.0, ans=0.125 2023-06-21 09:43:27,597 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.70 vs. limit=15.0 2023-06-21 09:43:43,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=935118.0, ans=0.2 2023-06-21 09:44:01,607 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:44:11,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=935238.0, ans=0.125 2023-06-21 09:44:11,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=935238.0, ans=0.125 2023-06-21 09:44:17,682 INFO [train.py:996] (2/4) Epoch 6, batch 3400, loss[loss=0.2181, simple_loss=0.2838, pruned_loss=0.07618, over 21244.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3275, pruned_loss=0.09368, over 4275477.99 frames. ], batch size: 176, lr: 5.29e-03, grad_scale: 32.0 2023-06-21 09:44:27,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=935238.0, ans=0.1 2023-06-21 09:44:31,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=935238.0, ans=0.125 2023-06-21 09:44:32,171 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-21 09:45:01,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=935358.0, ans=0.125 2023-06-21 09:45:05,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=935358.0, ans=0.015 2023-06-21 09:46:04,895 INFO [train.py:996] (2/4) Epoch 6, batch 3450, loss[loss=0.2444, simple_loss=0.3029, pruned_loss=0.09299, over 21941.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3236, pruned_loss=0.09241, over 4275076.45 frames. ], batch size: 113, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:46:16,697 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.316e+02 2.951e+02 3.358e+02 4.026e+02 6.824e+02, threshold=6.715e+02, percent-clipped=0.0 2023-06-21 09:46:46,076 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.81 vs. limit=15.0 2023-06-21 09:47:07,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=935718.0, ans=15.0 2023-06-21 09:47:42,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=935778.0, ans=0.125 2023-06-21 09:47:44,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=935778.0, ans=0.0 2023-06-21 09:47:47,079 INFO [train.py:996] (2/4) Epoch 6, batch 3500, loss[loss=0.2812, simple_loss=0.347, pruned_loss=0.1077, over 21277.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3296, pruned_loss=0.09567, over 4257123.75 frames. ], batch size: 143, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:47:47,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=935838.0, ans=0.125 2023-06-21 09:47:52,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=935838.0, ans=0.2 2023-06-21 09:47:52,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=935838.0, ans=0.125 2023-06-21 09:48:03,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=935838.0, ans=0.125 2023-06-21 09:48:17,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=935898.0, ans=0.125 2023-06-21 09:48:22,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=935958.0, ans=0.1 2023-06-21 09:48:46,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=936018.0, ans=0.1 2023-06-21 09:48:48,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=936018.0, ans=0.0 2023-06-21 09:48:51,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=936018.0, ans=0.125 2023-06-21 09:49:28,515 INFO [train.py:996] (2/4) Epoch 6, batch 3550, loss[loss=0.2295, simple_loss=0.2964, pruned_loss=0.0813, over 21726.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3324, pruned_loss=0.09725, over 4259500.85 frames. ], batch size: 351, lr: 5.29e-03, grad_scale: 16.0 2023-06-21 09:49:44,159 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.567e+02 3.119e+02 3.460e+02 4.086e+02 7.821e+02, threshold=6.921e+02, percent-clipped=5.0 2023-06-21 09:49:58,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=936198.0, ans=0.125 2023-06-21 09:51:05,645 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=15.0 2023-06-21 09:51:13,832 INFO [train.py:996] (2/4) Epoch 6, batch 3600, loss[loss=0.2835, simple_loss=0.3384, pruned_loss=0.1143, over 21676.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3282, pruned_loss=0.09636, over 4262537.68 frames. ], batch size: 351, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 09:51:18,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=936438.0, ans=0.1 2023-06-21 09:51:28,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=936438.0, ans=0.0 2023-06-21 09:51:40,645 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.81 vs. limit=10.0 2023-06-21 09:51:49,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=936558.0, ans=0.0 2023-06-21 09:52:42,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=936678.0, ans=0.125 2023-06-21 09:52:56,110 INFO [train.py:996] (2/4) Epoch 6, batch 3650, loss[loss=0.2845, simple_loss=0.3592, pruned_loss=0.1049, over 21657.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3296, pruned_loss=0.09661, over 4270857.24 frames. ], batch size: 389, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 09:53:01,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=936738.0, ans=0.125 2023-06-21 09:53:08,889 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.288e+02 3.039e+02 3.609e+02 4.641e+02 6.973e+02, threshold=7.218e+02, percent-clipped=1.0 2023-06-21 09:53:34,623 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.50 vs. limit=15.0 2023-06-21 09:54:09,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=936978.0, ans=10.0 2023-06-21 09:54:21,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=936978.0, ans=0.0 2023-06-21 09:54:30,382 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.48 vs. limit=10.0 2023-06-21 09:54:32,633 INFO [train.py:996] (2/4) Epoch 6, batch 3700, loss[loss=0.2832, simple_loss=0.3709, pruned_loss=0.09777, over 21361.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3294, pruned_loss=0.09572, over 4272275.10 frames. ], batch size: 549, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 09:55:03,399 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:56:03,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=937278.0, ans=0.04949747468305833 2023-06-21 09:56:18,876 INFO [train.py:996] (2/4) Epoch 6, batch 3750, loss[loss=0.2303, simple_loss=0.308, pruned_loss=0.07635, over 21849.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3279, pruned_loss=0.09522, over 4280114.28 frames. ], batch size: 351, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 09:56:31,748 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.291e+02 2.989e+02 3.545e+02 4.107e+02 7.890e+02, threshold=7.090e+02, percent-clipped=2.0 2023-06-21 09:56:47,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=937398.0, ans=0.2 2023-06-21 09:58:01,416 INFO [train.py:996] (2/4) Epoch 6, batch 3800, loss[loss=0.244, simple_loss=0.3108, pruned_loss=0.08857, over 21785.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3251, pruned_loss=0.09308, over 4279334.17 frames. ], batch size: 247, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 09:58:26,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=937698.0, ans=0.125 2023-06-21 09:58:39,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=937698.0, ans=0.95 2023-06-21 09:58:39,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=937698.0, ans=0.125 2023-06-21 09:59:05,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=937818.0, ans=0.0 2023-06-21 09:59:31,725 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 09:59:42,418 INFO [train.py:996] (2/4) Epoch 6, batch 3850, loss[loss=0.2982, simple_loss=0.4131, pruned_loss=0.09167, over 19980.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3243, pruned_loss=0.09355, over 4281624.53 frames. ], batch size: 702, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 09:59:50,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=937938.0, ans=0.125 2023-06-21 09:59:55,382 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.997e+02 3.412e+02 4.254e+02 5.791e+02 1.316e+03, threshold=8.507e+02, percent-clipped=12.0 2023-06-21 10:00:26,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=938058.0, ans=0.125 2023-06-21 10:00:32,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=938058.0, ans=0.0 2023-06-21 10:00:55,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=938118.0, ans=0.0 2023-06-21 10:01:23,217 INFO [train.py:996] (2/4) Epoch 6, batch 3900, loss[loss=0.2504, simple_loss=0.3184, pruned_loss=0.09121, over 21827.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3211, pruned_loss=0.09334, over 4276466.08 frames. ], batch size: 107, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 10:01:23,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=938238.0, ans=0.2 2023-06-21 10:02:23,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=938358.0, ans=0.125 2023-06-21 10:02:24,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=938358.0, ans=0.1 2023-06-21 10:02:28,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=938418.0, ans=0.2 2023-06-21 10:02:55,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=938478.0, ans=0.125 2023-06-21 10:03:04,373 INFO [train.py:996] (2/4) Epoch 6, batch 3950, loss[loss=0.2348, simple_loss=0.279, pruned_loss=0.09535, over 20169.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3217, pruned_loss=0.09246, over 4277115.97 frames. ], batch size: 703, lr: 5.28e-03, grad_scale: 16.0 2023-06-21 10:03:17,124 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.976e+02 2.886e+02 3.404e+02 4.103e+02 5.613e+02, threshold=6.809e+02, percent-clipped=0.0 2023-06-21 10:03:22,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=938598.0, ans=0.0 2023-06-21 10:04:14,113 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2023-06-21 10:04:31,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=938778.0, ans=0.015 2023-06-21 10:04:33,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=938778.0, ans=0.125 2023-06-21 10:04:37,645 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.39 vs. limit=15.0 2023-06-21 10:04:45,832 INFO [train.py:996] (2/4) Epoch 6, batch 4000, loss[loss=0.2657, simple_loss=0.3298, pruned_loss=0.1008, over 20613.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3138, pruned_loss=0.08809, over 4276450.54 frames. ], batch size: 607, lr: 5.28e-03, grad_scale: 32.0 2023-06-21 10:04:54,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=938838.0, ans=0.125 2023-06-21 10:04:57,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=938838.0, ans=0.1 2023-06-21 10:05:38,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=938958.0, ans=0.1 2023-06-21 10:05:48,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=939018.0, ans=0.0 2023-06-21 10:05:59,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=939018.0, ans=0.0 2023-06-21 10:06:26,132 INFO [train.py:996] (2/4) Epoch 6, batch 4050, loss[loss=0.2644, simple_loss=0.325, pruned_loss=0.1019, over 21549.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3136, pruned_loss=0.08685, over 4277835.21 frames. ], batch size: 441, lr: 5.28e-03, grad_scale: 32.0 2023-06-21 10:06:43,683 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 2.907e+02 3.499e+02 4.123e+02 8.601e+02, threshold=6.998e+02, percent-clipped=5.0 2023-06-21 10:07:41,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=939318.0, ans=0.2 2023-06-21 10:07:56,912 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-21 10:08:13,490 INFO [train.py:996] (2/4) Epoch 6, batch 4100, loss[loss=0.2201, simple_loss=0.3051, pruned_loss=0.06751, over 21607.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3154, pruned_loss=0.08617, over 4275130.96 frames. ], batch size: 230, lr: 5.28e-03, grad_scale: 32.0 2023-06-21 10:08:16,504 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=22.5 2023-06-21 10:08:56,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=939558.0, ans=0.125 2023-06-21 10:09:24,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=939618.0, ans=0.125 2023-06-21 10:09:34,884 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.40 vs. limit=22.5 2023-06-21 10:09:54,975 INFO [train.py:996] (2/4) Epoch 6, batch 4150, loss[loss=0.1965, simple_loss=0.2877, pruned_loss=0.05271, over 21542.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3137, pruned_loss=0.08296, over 4280337.06 frames. ], batch size: 212, lr: 5.28e-03, grad_scale: 32.0 2023-06-21 10:10:17,697 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.779e+02 3.037e+02 3.666e+02 4.331e+02 9.059e+02, threshold=7.332e+02, percent-clipped=3.0 2023-06-21 10:10:21,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=939798.0, ans=0.0 2023-06-21 10:10:56,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=939858.0, ans=0.125 2023-06-21 10:10:56,849 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-21 10:11:05,934 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 10:11:21,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=939978.0, ans=0.1 2023-06-21 10:11:37,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=940038.0, ans=0.0 2023-06-21 10:11:44,080 INFO [train.py:996] (2/4) Epoch 6, batch 4200, loss[loss=0.2247, simple_loss=0.2965, pruned_loss=0.07647, over 21559.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3137, pruned_loss=0.08267, over 4275458.06 frames. ], batch size: 195, lr: 5.27e-03, grad_scale: 32.0 2023-06-21 10:12:01,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=940038.0, ans=0.0 2023-06-21 10:13:27,912 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-21 10:13:33,268 INFO [train.py:996] (2/4) Epoch 6, batch 4250, loss[loss=0.282, simple_loss=0.3562, pruned_loss=0.1039, over 21718.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3229, pruned_loss=0.08616, over 4266317.45 frames. ], batch size: 351, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:13:35,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=940338.0, ans=0.125 2023-06-21 10:13:35,657 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=22.5 2023-06-21 10:13:52,369 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.192e+02 3.226e+02 3.853e+02 4.783e+02 9.792e+02, threshold=7.707e+02, percent-clipped=2.0 2023-06-21 10:13:57,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=940398.0, ans=0.035 2023-06-21 10:14:09,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=940458.0, ans=0.0 2023-06-21 10:14:10,174 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.13 vs. limit=22.5 2023-06-21 10:14:57,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=940578.0, ans=0.0 2023-06-21 10:14:57,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=940578.0, ans=0.125 2023-06-21 10:15:15,866 INFO [train.py:996] (2/4) Epoch 6, batch 4300, loss[loss=0.2561, simple_loss=0.3578, pruned_loss=0.0772, over 21637.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3296, pruned_loss=0.08816, over 4273642.90 frames. ], batch size: 414, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:15:30,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=940638.0, ans=0.125 2023-06-21 10:16:40,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=940878.0, ans=0.125 2023-06-21 10:16:51,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=940878.0, ans=0.0 2023-06-21 10:16:55,282 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=22.5 2023-06-21 10:17:01,954 INFO [train.py:996] (2/4) Epoch 6, batch 4350, loss[loss=0.2471, simple_loss=0.302, pruned_loss=0.0961, over 21885.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3269, pruned_loss=0.08705, over 4274436.60 frames. ], batch size: 107, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:17:16,333 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.286e+02 2.990e+02 3.501e+02 4.556e+02 7.699e+02, threshold=7.002e+02, percent-clipped=0.0 2023-06-21 10:17:38,706 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.52 vs. limit=22.5 2023-06-21 10:17:57,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=941118.0, ans=0.1 2023-06-21 10:18:03,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=941118.0, ans=0.125 2023-06-21 10:18:14,863 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.00 vs. limit=15.0 2023-06-21 10:18:18,193 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.77 vs. limit=12.0 2023-06-21 10:18:43,535 INFO [train.py:996] (2/4) Epoch 6, batch 4400, loss[loss=0.2252, simple_loss=0.292, pruned_loss=0.07914, over 21200.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3221, pruned_loss=0.08655, over 4275327.50 frames. ], batch size: 548, lr: 5.27e-03, grad_scale: 32.0 2023-06-21 10:18:46,079 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=12.0 2023-06-21 10:19:09,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=941298.0, ans=0.125 2023-06-21 10:20:13,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=941478.0, ans=0.5 2023-06-21 10:20:17,750 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.11 vs. limit=12.0 2023-06-21 10:20:26,309 INFO [train.py:996] (2/4) Epoch 6, batch 4450, loss[loss=0.3739, simple_loss=0.4577, pruned_loss=0.1451, over 21523.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3307, pruned_loss=0.08869, over 4271875.86 frames. ], batch size: 471, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:20:47,416 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.365e+02 2.781e+02 3.285e+02 3.917e+02 7.316e+02, threshold=6.570e+02, percent-clipped=2.0 2023-06-21 10:20:54,410 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=2.535e-03 2023-06-21 10:21:33,475 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=22.5 2023-06-21 10:21:39,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=941718.0, ans=0.125 2023-06-21 10:21:52,856 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-06-21 10:21:58,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=941778.0, ans=0.125 2023-06-21 10:22:06,951 INFO [train.py:996] (2/4) Epoch 6, batch 4500, loss[loss=0.2472, simple_loss=0.3311, pruned_loss=0.08168, over 21682.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3316, pruned_loss=0.09076, over 4281118.55 frames. ], batch size: 263, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:22:31,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=941898.0, ans=0.125 2023-06-21 10:22:33,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=941898.0, ans=0.1 2023-06-21 10:23:54,930 INFO [train.py:996] (2/4) Epoch 6, batch 4550, loss[loss=0.3347, simple_loss=0.3886, pruned_loss=0.1404, over 21321.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3349, pruned_loss=0.09139, over 4279542.64 frames. ], batch size: 507, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:24:16,148 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 2.720e+02 3.042e+02 3.501e+02 7.303e+02, threshold=6.084e+02, percent-clipped=1.0 2023-06-21 10:24:32,880 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.74 vs. limit=15.0 2023-06-21 10:24:50,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=942258.0, ans=0.0 2023-06-21 10:25:22,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=942378.0, ans=0.125 2023-06-21 10:25:37,669 INFO [train.py:996] (2/4) Epoch 6, batch 4600, loss[loss=0.2992, simple_loss=0.3751, pruned_loss=0.1117, over 21611.00 frames. ], tot_loss[loss=0.2631, simple_loss=0.3384, pruned_loss=0.09388, over 4282358.45 frames. ], batch size: 414, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:25:43,708 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-21 10:26:10,154 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.96 vs. limit=6.0 2023-06-21 10:26:38,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=942558.0, ans=0.09899494936611666 2023-06-21 10:27:12,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=942678.0, ans=0.125 2023-06-21 10:27:12,660 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.70 vs. limit=15.0 2023-06-21 10:27:18,167 INFO [train.py:996] (2/4) Epoch 6, batch 4650, loss[loss=0.1768, simple_loss=0.2546, pruned_loss=0.0495, over 21753.00 frames. ], tot_loss[loss=0.2585, simple_loss=0.3324, pruned_loss=0.09226, over 4287682.12 frames. ], batch size: 298, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:27:31,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=942738.0, ans=0.0 2023-06-21 10:27:44,642 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.173e+02 2.693e+02 3.104e+02 3.574e+02 6.080e+02, threshold=6.208e+02, percent-clipped=0.0 2023-06-21 10:27:51,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=942798.0, ans=0.125 2023-06-21 10:28:00,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=942858.0, ans=0.0 2023-06-21 10:28:03,482 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=15.0 2023-06-21 10:28:57,273 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-21 10:28:59,780 INFO [train.py:996] (2/4) Epoch 6, batch 4700, loss[loss=0.2256, simple_loss=0.2767, pruned_loss=0.08722, over 21265.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3214, pruned_loss=0.08858, over 4283115.26 frames. ], batch size: 144, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:29:25,381 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=22.5 2023-06-21 10:30:05,036 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.89 vs. limit=15.0 2023-06-21 10:30:13,625 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2023-06-21 10:30:39,995 INFO [train.py:996] (2/4) Epoch 6, batch 4750, loss[loss=0.227, simple_loss=0.2965, pruned_loss=0.0788, over 21859.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3155, pruned_loss=0.08822, over 4286411.85 frames. ], batch size: 351, lr: 5.27e-03, grad_scale: 16.0 2023-06-21 10:31:00,526 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.759e+02 3.391e+02 4.138e+02 8.179e+02, threshold=6.782e+02, percent-clipped=2.0 2023-06-21 10:31:08,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=943398.0, ans=0.125 2023-06-21 10:31:11,775 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.68 vs. limit=10.0 2023-06-21 10:31:20,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=943458.0, ans=0.125 2023-06-21 10:31:28,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=943458.0, ans=0.125 2023-06-21 10:31:31,084 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=22.5 2023-06-21 10:31:50,622 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=22.5 2023-06-21 10:31:54,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=943518.0, ans=0.0 2023-06-21 10:32:04,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=943578.0, ans=0.04949747468305833 2023-06-21 10:32:16,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=943578.0, ans=0.04949747468305833 2023-06-21 10:32:20,464 INFO [train.py:996] (2/4) Epoch 6, batch 4800, loss[loss=0.2598, simple_loss=0.3634, pruned_loss=0.07814, over 21682.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3166, pruned_loss=0.08938, over 4293797.51 frames. ], batch size: 414, lr: 5.26e-03, grad_scale: 32.0 2023-06-21 10:32:31,014 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-21 10:32:46,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=943698.0, ans=0.125 2023-06-21 10:32:53,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=943698.0, ans=0.125 2023-06-21 10:33:26,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=943818.0, ans=0.025 2023-06-21 10:34:00,641 INFO [train.py:996] (2/4) Epoch 6, batch 4850, loss[loss=0.3076, simple_loss=0.3637, pruned_loss=0.1257, over 21667.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3159, pruned_loss=0.08838, over 4294377.84 frames. ], batch size: 441, lr: 5.26e-03, grad_scale: 32.0 2023-06-21 10:34:04,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=943938.0, ans=0.125 2023-06-21 10:34:28,267 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.349e+02 3.010e+02 3.635e+02 4.678e+02 6.819e+02, threshold=7.270e+02, percent-clipped=1.0 2023-06-21 10:35:04,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=944118.0, ans=0.125 2023-06-21 10:35:16,080 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=22.5 2023-06-21 10:35:23,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=944178.0, ans=0.125 2023-06-21 10:35:42,675 INFO [train.py:996] (2/4) Epoch 6, batch 4900, loss[loss=0.267, simple_loss=0.333, pruned_loss=0.1005, over 21305.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3165, pruned_loss=0.08915, over 4281495.84 frames. ], batch size: 159, lr: 5.26e-03, grad_scale: 16.0 2023-06-21 10:37:05,105 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-21 10:37:23,904 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 10:37:30,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=944478.0, ans=0.02 2023-06-21 10:37:36,982 INFO [train.py:996] (2/4) Epoch 6, batch 4950, loss[loss=0.2268, simple_loss=0.321, pruned_loss=0.06629, over 21611.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.322, pruned_loss=0.08833, over 4275634.90 frames. ], batch size: 389, lr: 5.26e-03, grad_scale: 16.0 2023-06-21 10:37:53,943 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 2.748e+02 3.392e+02 4.071e+02 6.752e+02, threshold=6.784e+02, percent-clipped=0.0 2023-06-21 10:38:23,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=944658.0, ans=0.0 2023-06-21 10:38:36,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=944718.0, ans=0.125 2023-06-21 10:38:37,658 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.31 vs. limit=22.5 2023-06-21 10:39:09,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=944838.0, ans=0.125 2023-06-21 10:39:10,147 INFO [train.py:996] (2/4) Epoch 6, batch 5000, loss[loss=0.245, simple_loss=0.3164, pruned_loss=0.08682, over 21469.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3212, pruned_loss=0.08456, over 4270519.34 frames. ], batch size: 211, lr: 5.26e-03, grad_scale: 16.0 2023-06-21 10:40:23,936 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-21 10:40:43,893 INFO [train.py:996] (2/4) Epoch 6, batch 5050, loss[loss=0.2511, simple_loss=0.315, pruned_loss=0.09363, over 21559.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3205, pruned_loss=0.08612, over 4276475.53 frames. ], batch size: 212, lr: 5.26e-03, grad_scale: 16.0 2023-06-21 10:41:00,729 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.999e+02 2.826e+02 3.138e+02 3.685e+02 6.329e+02, threshold=6.276e+02, percent-clipped=0.0 2023-06-21 10:41:04,470 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 10:41:10,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=945198.0, ans=0.125 2023-06-21 10:41:33,089 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.11 vs. limit=6.0 2023-06-21 10:42:01,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=945378.0, ans=0.0 2023-06-21 10:42:16,836 INFO [train.py:996] (2/4) Epoch 6, batch 5100, loss[loss=0.2632, simple_loss=0.3761, pruned_loss=0.07516, over 19796.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3199, pruned_loss=0.08607, over 4278022.66 frames. ], batch size: 703, lr: 5.26e-03, grad_scale: 16.0 2023-06-21 10:42:27,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=945438.0, ans=0.125 2023-06-21 10:42:29,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=945438.0, ans=0.0 2023-06-21 10:42:45,596 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=12.0 2023-06-21 10:43:20,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=945618.0, ans=0.125 2023-06-21 10:43:37,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=945678.0, ans=0.2 2023-06-21 10:43:44,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=945678.0, ans=0.07 2023-06-21 10:43:56,820 INFO [train.py:996] (2/4) Epoch 6, batch 5150, loss[loss=0.2377, simple_loss=0.32, pruned_loss=0.07767, over 21838.00 frames. ], tot_loss[loss=0.246, simple_loss=0.318, pruned_loss=0.08695, over 4286780.05 frames. ], batch size: 332, lr: 5.26e-03, grad_scale: 16.0 2023-06-21 10:44:08,955 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=15.0 2023-06-21 10:44:16,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=945798.0, ans=0.2 2023-06-21 10:44:19,132 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.377e+02 2.990e+02 3.465e+02 4.317e+02 6.616e+02, threshold=6.931e+02, percent-clipped=1.0 2023-06-21 10:44:20,365 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.36 vs. limit=22.5 2023-06-21 10:44:29,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=945798.0, ans=0.125 2023-06-21 10:44:32,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=945858.0, ans=0.1 2023-06-21 10:45:16,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=945978.0, ans=0.0 2023-06-21 10:45:26,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=945978.0, ans=0.1 2023-06-21 10:45:42,243 INFO [train.py:996] (2/4) Epoch 6, batch 5200, loss[loss=0.24, simple_loss=0.3258, pruned_loss=0.07709, over 21410.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3223, pruned_loss=0.08865, over 4287085.41 frames. ], batch size: 194, lr: 5.26e-03, grad_scale: 32.0 2023-06-21 10:46:14,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=946098.0, ans=0.0 2023-06-21 10:47:21,864 INFO [train.py:996] (2/4) Epoch 6, batch 5250, loss[loss=0.2548, simple_loss=0.3421, pruned_loss=0.08369, over 21740.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3254, pruned_loss=0.08675, over 4272470.02 frames. ], batch size: 414, lr: 5.26e-03, grad_scale: 32.0 2023-06-21 10:47:23,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=946338.0, ans=0.125 2023-06-21 10:47:39,387 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.150e+02 2.923e+02 3.658e+02 4.333e+02 7.638e+02, threshold=7.316e+02, percent-clipped=1.0 2023-06-21 10:48:15,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=946458.0, ans=0.125 2023-06-21 10:48:45,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=946578.0, ans=0.2 2023-06-21 10:48:59,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=946638.0, ans=0.025 2023-06-21 10:49:00,869 INFO [train.py:996] (2/4) Epoch 6, batch 5300, loss[loss=0.2064, simple_loss=0.3162, pruned_loss=0.04824, over 19726.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3238, pruned_loss=0.08615, over 4266400.64 frames. ], batch size: 703, lr: 5.26e-03, grad_scale: 32.0 2023-06-21 10:49:05,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=946638.0, ans=0.2 2023-06-21 10:49:06,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=946638.0, ans=0.125 2023-06-21 10:49:14,434 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-06-21 10:49:33,148 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.06 vs. limit=6.0 2023-06-21 10:49:58,800 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=12.0 2023-06-21 10:50:18,014 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.15 vs. limit=6.0 2023-06-21 10:50:19,370 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=15.0 2023-06-21 10:50:39,183 INFO [train.py:996] (2/4) Epoch 6, batch 5350, loss[loss=0.2348, simple_loss=0.3041, pruned_loss=0.08276, over 21552.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3211, pruned_loss=0.08713, over 4279172.22 frames. ], batch size: 131, lr: 5.26e-03, grad_scale: 32.0 2023-06-21 10:50:49,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=946938.0, ans=0.0 2023-06-21 10:50:50,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=946938.0, ans=0.2 2023-06-21 10:50:58,327 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.908e+02 3.182e+02 3.564e+02 5.714e+02, threshold=6.365e+02, percent-clipped=0.0 2023-06-21 10:51:11,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=946998.0, ans=0.125 2023-06-21 10:51:33,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=947058.0, ans=0.2 2023-06-21 10:51:36,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=947118.0, ans=0.125 2023-06-21 10:51:38,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=947118.0, ans=0.0 2023-06-21 10:52:02,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=947178.0, ans=0.0 2023-06-21 10:52:18,104 INFO [train.py:996] (2/4) Epoch 6, batch 5400, loss[loss=0.2403, simple_loss=0.3189, pruned_loss=0.08082, over 21734.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3198, pruned_loss=0.08828, over 4285895.57 frames. ], batch size: 441, lr: 5.25e-03, grad_scale: 16.0 2023-06-21 10:52:19,377 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.18 vs. limit=15.0 2023-06-21 10:52:33,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=947298.0, ans=0.125 2023-06-21 10:52:49,649 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.64 vs. limit=15.0 2023-06-21 10:53:08,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=947358.0, ans=0.125 2023-06-21 10:53:16,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=947418.0, ans=0.125 2023-06-21 10:53:36,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=947478.0, ans=0.125 2023-06-21 10:53:53,504 INFO [train.py:996] (2/4) Epoch 6, batch 5450, loss[loss=0.2324, simple_loss=0.309, pruned_loss=0.07787, over 21177.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3193, pruned_loss=0.08643, over 4283614.62 frames. ], batch size: 143, lr: 5.25e-03, grad_scale: 16.0 2023-06-21 10:54:17,122 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.387e+02 2.977e+02 3.575e+02 4.401e+02 6.671e+02, threshold=7.149e+02, percent-clipped=1.0 2023-06-21 10:54:22,054 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.09 vs. limit=15.0 2023-06-21 10:55:34,847 INFO [train.py:996] (2/4) Epoch 6, batch 5500, loss[loss=0.2293, simple_loss=0.3252, pruned_loss=0.06663, over 21706.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3243, pruned_loss=0.08395, over 4287807.81 frames. ], batch size: 298, lr: 5.25e-03, grad_scale: 16.0 2023-06-21 10:55:58,085 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0 2023-06-21 10:56:19,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=947898.0, ans=0.0 2023-06-21 10:56:36,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=947958.0, ans=0.1 2023-06-21 10:56:44,358 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=22.5 2023-06-21 10:57:20,580 INFO [train.py:996] (2/4) Epoch 6, batch 5550, loss[loss=0.2751, simple_loss=0.3686, pruned_loss=0.09086, over 21434.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3227, pruned_loss=0.08113, over 4277136.42 frames. ], batch size: 471, lr: 5.25e-03, grad_scale: 16.0 2023-06-21 10:57:37,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=948138.0, ans=0.125 2023-06-21 10:57:45,157 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.966e+02 2.700e+02 3.215e+02 3.984e+02 5.956e+02, threshold=6.431e+02, percent-clipped=0.0 2023-06-21 10:57:48,231 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.49 vs. limit=10.0 2023-06-21 10:57:54,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=948198.0, ans=0.125 2023-06-21 10:58:03,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=948198.0, ans=0.125 2023-06-21 10:59:07,574 INFO [train.py:996] (2/4) Epoch 6, batch 5600, loss[loss=0.3296, simple_loss=0.4217, pruned_loss=0.1187, over 21523.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3219, pruned_loss=0.07885, over 4280969.24 frames. ], batch size: 471, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 10:59:26,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=948438.0, ans=0.125 2023-06-21 11:00:21,861 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.54 vs. limit=15.0 2023-06-21 11:00:28,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=948678.0, ans=15.0 2023-06-21 11:00:43,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=948678.0, ans=0.125 2023-06-21 11:00:45,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=948738.0, ans=0.125 2023-06-21 11:00:46,536 INFO [train.py:996] (2/4) Epoch 6, batch 5650, loss[loss=0.2258, simple_loss=0.3018, pruned_loss=0.07489, over 21872.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3256, pruned_loss=0.08155, over 4280444.41 frames. ], batch size: 298, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 11:01:06,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=948798.0, ans=0.0 2023-06-21 11:01:10,943 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.904e+02 3.593e+02 4.833e+02 7.419e+02, threshold=7.185e+02, percent-clipped=8.0 2023-06-21 11:01:18,419 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-21 11:01:19,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=948798.0, ans=0.1 2023-06-21 11:01:41,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=948858.0, ans=0.0 2023-06-21 11:01:47,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=948918.0, ans=0.2 2023-06-21 11:02:28,411 INFO [train.py:996] (2/4) Epoch 6, batch 5700, loss[loss=0.2907, simple_loss=0.3719, pruned_loss=0.1047, over 21562.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3272, pruned_loss=0.08352, over 4282998.36 frames. ], batch size: 441, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 11:02:50,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=949098.0, ans=0.125 2023-06-21 11:04:12,891 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.49 vs. limit=15.0 2023-06-21 11:04:14,747 INFO [train.py:996] (2/4) Epoch 6, batch 5750, loss[loss=0.1607, simple_loss=0.2367, pruned_loss=0.04238, over 21359.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3237, pruned_loss=0.08088, over 4280678.55 frames. ], batch size: 176, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 11:04:27,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=949338.0, ans=10.0 2023-06-21 11:04:34,865 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.088e+02 2.640e+02 3.214e+02 3.808e+02 7.764e+02, threshold=6.428e+02, percent-clipped=2.0 2023-06-21 11:05:56,174 INFO [train.py:996] (2/4) Epoch 6, batch 5800, loss[loss=0.2473, simple_loss=0.3485, pruned_loss=0.07304, over 21815.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3213, pruned_loss=0.07909, over 4274806.68 frames. ], batch size: 371, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 11:06:07,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=949638.0, ans=0.125 2023-06-21 11:07:21,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=949878.0, ans=10.0 2023-06-21 11:07:31,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=949878.0, ans=0.025 2023-06-21 11:07:38,528 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.09 vs. limit=15.0 2023-06-21 11:07:39,076 INFO [train.py:996] (2/4) Epoch 6, batch 5850, loss[loss=0.1899, simple_loss=0.2862, pruned_loss=0.04679, over 21421.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3181, pruned_loss=0.07484, over 4276204.55 frames. ], batch size: 211, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 11:07:42,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=949938.0, ans=0.1 2023-06-21 11:07:54,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=949938.0, ans=0.0 2023-06-21 11:07:59,599 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=15.0 2023-06-21 11:08:03,604 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.609e+02 2.386e+02 2.763e+02 3.432e+02 5.220e+02, threshold=5.525e+02, percent-clipped=0.0 2023-06-21 11:09:01,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=950178.0, ans=0.2 2023-06-21 11:09:03,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=950178.0, ans=0.125 2023-06-21 11:09:05,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=950178.0, ans=0.125 2023-06-21 11:09:12,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=950178.0, ans=0.125 2023-06-21 11:09:18,229 INFO [train.py:996] (2/4) Epoch 6, batch 5900, loss[loss=0.2231, simple_loss=0.3009, pruned_loss=0.07264, over 21892.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.3099, pruned_loss=0.06938, over 4277207.52 frames. ], batch size: 371, lr: 5.25e-03, grad_scale: 32.0 2023-06-21 11:09:19,218 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.13 vs. limit=12.0 2023-06-21 11:09:20,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=950238.0, ans=0.125 2023-06-21 11:09:53,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=950298.0, ans=0.0 2023-06-21 11:10:14,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=950358.0, ans=0.2 2023-06-21 11:10:33,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=950418.0, ans=0.1 2023-06-21 11:10:54,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=950478.0, ans=0.1 2023-06-21 11:10:56,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=950538.0, ans=10.0 2023-06-21 11:10:57,248 INFO [train.py:996] (2/4) Epoch 6, batch 5950, loss[loss=0.2262, simple_loss=0.289, pruned_loss=0.08175, over 21408.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3083, pruned_loss=0.07266, over 4278837.14 frames. ], batch size: 194, lr: 5.25e-03, grad_scale: 16.0 2023-06-21 11:11:22,250 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.416e+02 2.586e+02 3.198e+02 3.905e+02 6.345e+02, threshold=6.395e+02, percent-clipped=3.0 2023-06-21 11:11:38,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=950658.0, ans=0.05 2023-06-21 11:11:43,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=950658.0, ans=0.125 2023-06-21 11:11:53,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=950658.0, ans=0.0 2023-06-21 11:12:11,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=950718.0, ans=0.0 2023-06-21 11:12:23,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=950778.0, ans=0.125 2023-06-21 11:12:39,947 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.91 vs. limit=15.0 2023-06-21 11:12:40,621 INFO [train.py:996] (2/4) Epoch 6, batch 6000, loss[loss=0.2131, simple_loss=0.2714, pruned_loss=0.07734, over 21270.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.305, pruned_loss=0.07692, over 4265754.51 frames. ], batch size: 176, lr: 5.24e-03, grad_scale: 32.0 2023-06-21 11:12:40,621 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 11:12:57,298 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2656, simple_loss=0.3626, pruned_loss=0.08426, over 1796401.00 frames. 2023-06-21 11:12:57,299 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-21 11:12:57,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=950838.0, ans=0.1 2023-06-21 11:13:23,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=950898.0, ans=0.0 2023-06-21 11:14:28,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=951078.0, ans=0.0 2023-06-21 11:14:43,512 INFO [train.py:996] (2/4) Epoch 6, batch 6050, loss[loss=0.2073, simple_loss=0.2681, pruned_loss=0.07327, over 21616.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3002, pruned_loss=0.07758, over 4262869.82 frames. ], batch size: 298, lr: 5.24e-03, grad_scale: 16.0 2023-06-21 11:14:56,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=951138.0, ans=0.125 2023-06-21 11:15:16,953 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.041e+02 2.804e+02 3.230e+02 3.761e+02 6.873e+02, threshold=6.459e+02, percent-clipped=1.0 2023-06-21 11:15:29,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=951258.0, ans=0.0 2023-06-21 11:16:06,537 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=15.0 2023-06-21 11:16:16,220 INFO [train.py:996] (2/4) Epoch 6, batch 6100, loss[loss=0.2408, simple_loss=0.3179, pruned_loss=0.08184, over 21712.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2991, pruned_loss=0.07658, over 4264412.34 frames. ], batch size: 389, lr: 5.24e-03, grad_scale: 8.0 2023-06-21 11:17:28,602 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=12.0 2023-06-21 11:17:42,532 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-21 11:17:55,298 INFO [train.py:996] (2/4) Epoch 6, batch 6150, loss[loss=0.2442, simple_loss=0.3103, pruned_loss=0.08904, over 21750.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3032, pruned_loss=0.08003, over 4276599.56 frames. ], batch size: 124, lr: 5.24e-03, grad_scale: 8.0 2023-06-21 11:18:16,609 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:18:33,351 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.810e+02 2.602e+02 3.011e+02 3.655e+02 5.167e+02, threshold=6.022e+02, percent-clipped=0.0 2023-06-21 11:19:22,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=951978.0, ans=0.1 2023-06-21 11:19:39,851 INFO [train.py:996] (2/4) Epoch 6, batch 6200, loss[loss=0.2648, simple_loss=0.3494, pruned_loss=0.09004, over 21494.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3072, pruned_loss=0.08029, over 4277210.61 frames. ], batch size: 548, lr: 5.24e-03, grad_scale: 8.0 2023-06-21 11:20:06,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=952098.0, ans=0.2 2023-06-21 11:20:36,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=952158.0, ans=0.1 2023-06-21 11:20:48,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=952218.0, ans=0.035 2023-06-21 11:20:53,503 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:21:14,881 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.21 vs. limit=15.0 2023-06-21 11:21:20,441 INFO [train.py:996] (2/4) Epoch 6, batch 6250, loss[loss=0.2631, simple_loss=0.365, pruned_loss=0.08054, over 21769.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3139, pruned_loss=0.08023, over 4282814.25 frames. ], batch size: 332, lr: 5.24e-03, grad_scale: 8.0 2023-06-21 11:21:51,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=952398.0, ans=0.125 2023-06-21 11:21:53,728 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.972e+02 3.114e+02 4.042e+02 5.400e+02 9.374e+02, threshold=8.084e+02, percent-clipped=17.0 2023-06-21 11:22:07,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=952458.0, ans=0.125 2023-06-21 11:22:18,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=952458.0, ans=0.125 2023-06-21 11:22:22,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=952518.0, ans=0.0 2023-06-21 11:22:29,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=952518.0, ans=0.125 2023-06-21 11:22:39,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=952578.0, ans=0.5 2023-06-21 11:22:52,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=952578.0, ans=0.2 2023-06-21 11:22:58,354 INFO [train.py:996] (2/4) Epoch 6, batch 6300, loss[loss=0.2564, simple_loss=0.3207, pruned_loss=0.09602, over 21932.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.316, pruned_loss=0.07914, over 4284841.95 frames. ], batch size: 113, lr: 5.24e-03, grad_scale: 8.0 2023-06-21 11:23:17,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=952638.0, ans=0.07 2023-06-21 11:23:48,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=952758.0, ans=0.125 2023-06-21 11:23:51,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=952758.0, ans=0.125 2023-06-21 11:24:01,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=952818.0, ans=0.015 2023-06-21 11:24:17,027 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=15.0 2023-06-21 11:24:27,917 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:24:34,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=952878.0, ans=0.2 2023-06-21 11:24:48,256 INFO [train.py:996] (2/4) Epoch 6, batch 6350, loss[loss=0.2584, simple_loss=0.3231, pruned_loss=0.09686, over 21622.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3208, pruned_loss=0.08395, over 4291227.45 frames. ], batch size: 263, lr: 5.24e-03, grad_scale: 8.0 2023-06-21 11:24:57,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=952938.0, ans=0.2 2023-06-21 11:25:06,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=952998.0, ans=0.0 2023-06-21 11:25:17,181 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.015e+02 3.069e+02 3.635e+02 4.276e+02 7.885e+02, threshold=7.269e+02, percent-clipped=0.0 2023-06-21 11:26:00,495 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.66 vs. limit=8.0 2023-06-21 11:26:28,477 INFO [train.py:996] (2/4) Epoch 6, batch 6400, loss[loss=0.2881, simple_loss=0.3492, pruned_loss=0.1135, over 21361.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3263, pruned_loss=0.08864, over 4294084.91 frames. ], batch size: 548, lr: 5.24e-03, grad_scale: 16.0 2023-06-21 11:26:29,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=953238.0, ans=0.1 2023-06-21 11:26:59,327 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:27:10,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=953358.0, ans=0.1 2023-06-21 11:27:41,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=953418.0, ans=0.125 2023-06-21 11:28:12,574 INFO [train.py:996] (2/4) Epoch 6, batch 6450, loss[loss=0.2142, simple_loss=0.2793, pruned_loss=0.07459, over 21829.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3286, pruned_loss=0.08781, over 4295780.67 frames. ], batch size: 107, lr: 5.24e-03, grad_scale: 16.0 2023-06-21 11:28:24,713 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.47 vs. limit=15.0 2023-06-21 11:28:36,814 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.907e+02 2.859e+02 3.374e+02 4.192e+02 6.332e+02, threshold=6.748e+02, percent-clipped=0.0 2023-06-21 11:28:58,444 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.37 vs. limit=10.0 2023-06-21 11:29:54,962 INFO [train.py:996] (2/4) Epoch 6, batch 6500, loss[loss=0.323, simple_loss=0.4194, pruned_loss=0.1133, over 19747.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3232, pruned_loss=0.08571, over 4281172.50 frames. ], batch size: 703, lr: 5.24e-03, grad_scale: 16.0 2023-06-21 11:30:01,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=953838.0, ans=0.0 2023-06-21 11:30:23,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=953898.0, ans=0.1 2023-06-21 11:30:23,698 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-06-21 11:31:10,941 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.04 vs. limit=6.0 2023-06-21 11:31:35,563 INFO [train.py:996] (2/4) Epoch 6, batch 6550, loss[loss=0.2537, simple_loss=0.3234, pruned_loss=0.092, over 21707.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3212, pruned_loss=0.08473, over 4267928.03 frames. ], batch size: 389, lr: 5.24e-03, grad_scale: 16.0 2023-06-21 11:31:50,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=954198.0, ans=0.035 2023-06-21 11:31:59,242 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.779e+02 3.082e+02 3.818e+02 7.032e+02, threshold=6.164e+02, percent-clipped=1.0 2023-06-21 11:32:19,999 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=22.5 2023-06-21 11:32:26,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=954258.0, ans=0.0 2023-06-21 11:33:14,285 INFO [train.py:996] (2/4) Epoch 6, batch 6600, loss[loss=0.2242, simple_loss=0.2843, pruned_loss=0.08199, over 21563.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3152, pruned_loss=0.0842, over 4273299.91 frames. ], batch size: 391, lr: 5.23e-03, grad_scale: 16.0 2023-06-21 11:33:25,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=954438.0, ans=0.0 2023-06-21 11:34:11,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=954618.0, ans=0.125 2023-06-21 11:34:25,075 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-21 11:34:42,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=954678.0, ans=0.2 2023-06-21 11:34:43,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=954678.0, ans=0.125 2023-06-21 11:34:52,656 INFO [train.py:996] (2/4) Epoch 6, batch 6650, loss[loss=0.2122, simple_loss=0.2761, pruned_loss=0.07419, over 21596.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3062, pruned_loss=0.08131, over 4273869.94 frames. ], batch size: 247, lr: 5.23e-03, grad_scale: 16.0 2023-06-21 11:35:21,654 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.023e+02 2.585e+02 3.021e+02 3.677e+02 6.066e+02, threshold=6.041e+02, percent-clipped=0.0 2023-06-21 11:36:07,001 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.33 vs. limit=15.0 2023-06-21 11:36:17,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=954978.0, ans=0.0 2023-06-21 11:36:30,492 INFO [train.py:996] (2/4) Epoch 6, batch 6700, loss[loss=0.2114, simple_loss=0.2854, pruned_loss=0.06866, over 21551.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3005, pruned_loss=0.08149, over 4272105.07 frames. ], batch size: 230, lr: 5.23e-03, grad_scale: 16.0 2023-06-21 11:36:47,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=955098.0, ans=0.0 2023-06-21 11:36:50,920 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=22.5 2023-06-21 11:37:01,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=955098.0, ans=0.0 2023-06-21 11:38:08,957 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.12 vs. limit=12.0 2023-06-21 11:38:09,253 INFO [train.py:996] (2/4) Epoch 6, batch 6750, loss[loss=0.2494, simple_loss=0.3079, pruned_loss=0.09548, over 21782.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.2994, pruned_loss=0.08241, over 4278402.00 frames. ], batch size: 112, lr: 5.23e-03, grad_scale: 16.0 2023-06-21 11:38:29,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=955398.0, ans=0.125 2023-06-21 11:38:32,823 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.018e+02 2.820e+02 3.249e+02 3.969e+02 6.943e+02, threshold=6.498e+02, percent-clipped=2.0 2023-06-21 11:38:34,039 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.29 vs. limit=10.0 2023-06-21 11:38:49,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=955458.0, ans=0.07 2023-06-21 11:39:29,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=955578.0, ans=0.09899494936611666 2023-06-21 11:39:41,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=955578.0, ans=0.07 2023-06-21 11:39:47,099 INFO [train.py:996] (2/4) Epoch 6, batch 6800, loss[loss=0.2638, simple_loss=0.3152, pruned_loss=0.1062, over 21743.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.304, pruned_loss=0.08628, over 4275996.11 frames. ], batch size: 112, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:39:53,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=955638.0, ans=0.0 2023-06-21 11:40:01,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=955698.0, ans=0.125 2023-06-21 11:41:24,281 INFO [train.py:996] (2/4) Epoch 6, batch 6850, loss[loss=0.2223, simple_loss=0.291, pruned_loss=0.0768, over 21895.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3026, pruned_loss=0.08669, over 4278071.01 frames. ], batch size: 316, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:41:48,462 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 2.884e+02 3.443e+02 4.128e+02 6.086e+02, threshold=6.887e+02, percent-clipped=0.0 2023-06-21 11:41:54,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=955998.0, ans=0.0 2023-06-21 11:42:10,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=956058.0, ans=0.2 2023-06-21 11:42:33,845 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.96 vs. limit=15.0 2023-06-21 11:43:04,885 INFO [train.py:996] (2/4) Epoch 6, batch 6900, loss[loss=0.209, simple_loss=0.2884, pruned_loss=0.06483, over 21234.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3033, pruned_loss=0.08621, over 4279977.20 frames. ], batch size: 159, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:43:13,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=956238.0, ans=0.2 2023-06-21 11:43:27,122 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:44:20,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=956418.0, ans=0.1 2023-06-21 11:44:45,083 INFO [train.py:996] (2/4) Epoch 6, batch 6950, loss[loss=0.2327, simple_loss=0.2914, pruned_loss=0.08696, over 21272.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3071, pruned_loss=0.08284, over 4282046.08 frames. ], batch size: 143, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:44:49,605 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.54 vs. limit=15.0 2023-06-21 11:45:13,818 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.672e+02 2.660e+02 3.171e+02 3.598e+02 5.873e+02, threshold=6.343e+02, percent-clipped=0.0 2023-06-21 11:45:16,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=956598.0, ans=0.07 2023-06-21 11:45:40,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=956658.0, ans=0.125 2023-06-21 11:46:13,721 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-21 11:46:24,079 INFO [train.py:996] (2/4) Epoch 6, batch 7000, loss[loss=0.2479, simple_loss=0.3121, pruned_loss=0.09184, over 15400.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3112, pruned_loss=0.08647, over 4264843.30 frames. ], batch size: 61, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:46:34,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=956838.0, ans=0.125 2023-06-21 11:47:46,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=957018.0, ans=0.0 2023-06-21 11:48:03,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=957138.0, ans=0.125 2023-06-21 11:48:04,835 INFO [train.py:996] (2/4) Epoch 6, batch 7050, loss[loss=0.197, simple_loss=0.2929, pruned_loss=0.05056, over 21833.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3088, pruned_loss=0.08512, over 4263443.51 frames. ], batch size: 371, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:48:11,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=957138.0, ans=0.2 2023-06-21 11:48:15,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=957138.0, ans=0.2 2023-06-21 11:48:23,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=957138.0, ans=0.0 2023-06-21 11:48:38,899 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 3.010e+02 3.429e+02 4.410e+02 6.547e+02, threshold=6.858e+02, percent-clipped=1.0 2023-06-21 11:48:49,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=957198.0, ans=0.2 2023-06-21 11:49:18,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=957318.0, ans=0.125 2023-06-21 11:49:24,824 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:49:45,427 INFO [train.py:996] (2/4) Epoch 6, batch 7100, loss[loss=0.3211, simple_loss=0.3776, pruned_loss=0.1323, over 21457.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3144, pruned_loss=0.08829, over 4258140.06 frames. ], batch size: 509, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:51:05,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=957618.0, ans=0.125 2023-06-21 11:51:25,109 INFO [train.py:996] (2/4) Epoch 6, batch 7150, loss[loss=0.2554, simple_loss=0.3409, pruned_loss=0.08496, over 16663.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3096, pruned_loss=0.08418, over 4249969.12 frames. ], batch size: 60, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:51:51,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=957738.0, ans=0.2 2023-06-21 11:52:08,492 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.713e+02 3.100e+02 3.583e+02 6.411e+02, threshold=6.200e+02, percent-clipped=0.0 2023-06-21 11:52:14,677 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.96 vs. limit=15.0 2023-06-21 11:52:17,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=957858.0, ans=0.2 2023-06-21 11:52:36,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=957918.0, ans=0.1 2023-06-21 11:52:52,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=957978.0, ans=0.125 2023-06-21 11:53:14,740 INFO [train.py:996] (2/4) Epoch 6, batch 7200, loss[loss=0.2404, simple_loss=0.301, pruned_loss=0.08985, over 21416.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3137, pruned_loss=0.08768, over 4261993.31 frames. ], batch size: 131, lr: 5.23e-03, grad_scale: 32.0 2023-06-21 11:53:15,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=958038.0, ans=0.125 2023-06-21 11:53:47,880 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.88 vs. limit=15.0 2023-06-21 11:53:53,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=958098.0, ans=22.5 2023-06-21 11:53:59,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=958158.0, ans=0.125 2023-06-21 11:54:02,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=958158.0, ans=0.125 2023-06-21 11:54:09,739 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=12.0 2023-06-21 11:54:54,083 INFO [train.py:996] (2/4) Epoch 6, batch 7250, loss[loss=0.2576, simple_loss=0.3164, pruned_loss=0.09936, over 21798.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3095, pruned_loss=0.08734, over 4267092.42 frames. ], batch size: 107, lr: 5.22e-03, grad_scale: 32.0 2023-06-21 11:55:28,083 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 2.745e+02 3.061e+02 4.034e+02 7.842e+02, threshold=6.122e+02, percent-clipped=5.0 2023-06-21 11:55:30,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=958398.0, ans=0.1 2023-06-21 11:55:46,695 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=22.5 2023-06-21 11:56:03,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=958518.0, ans=0.1 2023-06-21 11:56:05,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=958518.0, ans=0.0 2023-06-21 11:56:33,721 INFO [train.py:996] (2/4) Epoch 6, batch 7300, loss[loss=0.2309, simple_loss=0.288, pruned_loss=0.08691, over 21575.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3031, pruned_loss=0.08579, over 4269992.64 frames. ], batch size: 247, lr: 5.22e-03, grad_scale: 32.0 2023-06-21 11:56:43,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=958638.0, ans=0.04949747468305833 2023-06-21 11:57:16,652 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 11:57:47,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=958878.0, ans=0.125 2023-06-21 11:58:21,347 INFO [train.py:996] (2/4) Epoch 6, batch 7350, loss[loss=0.2801, simple_loss=0.3563, pruned_loss=0.1019, over 21860.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3019, pruned_loss=0.08591, over 4261629.26 frames. ], batch size: 124, lr: 5.22e-03, grad_scale: 16.0 2023-06-21 11:58:41,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=958998.0, ans=0.0 2023-06-21 11:58:44,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=958998.0, ans=0.0 2023-06-21 11:58:46,848 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.60 vs. limit=10.0 2023-06-21 11:58:52,016 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.271e+02 2.849e+02 3.277e+02 4.064e+02 7.126e+02, threshold=6.555e+02, percent-clipped=2.0 2023-06-21 12:00:04,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=959178.0, ans=0.125 2023-06-21 12:00:07,535 INFO [train.py:996] (2/4) Epoch 6, batch 7400, loss[loss=0.2501, simple_loss=0.3391, pruned_loss=0.08056, over 21827.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.308, pruned_loss=0.08866, over 4269952.93 frames. ], batch size: 372, lr: 5.22e-03, grad_scale: 16.0 2023-06-21 12:00:44,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=959358.0, ans=0.125 2023-06-21 12:00:52,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=959358.0, ans=0.04949747468305833 2023-06-21 12:01:25,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=959418.0, ans=0.125 2023-06-21 12:01:28,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=959478.0, ans=0.125 2023-06-21 12:01:48,242 INFO [train.py:996] (2/4) Epoch 6, batch 7450, loss[loss=0.2152, simple_loss=0.2843, pruned_loss=0.07303, over 21624.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3067, pruned_loss=0.08728, over 4270921.84 frames. ], batch size: 298, lr: 5.22e-03, grad_scale: 16.0 2023-06-21 12:02:13,862 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 2.784e+02 3.192e+02 3.792e+02 7.564e+02, threshold=6.383e+02, percent-clipped=1.0 2023-06-21 12:03:16,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=959778.0, ans=0.125 2023-06-21 12:03:31,323 INFO [train.py:996] (2/4) Epoch 6, batch 7500, loss[loss=0.26, simple_loss=0.3511, pruned_loss=0.08446, over 21443.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3114, pruned_loss=0.08914, over 4270623.51 frames. ], batch size: 211, lr: 5.22e-03, grad_scale: 16.0 2023-06-21 12:03:50,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=959898.0, ans=0.0 2023-06-21 12:04:11,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=959958.0, ans=0.1 2023-06-21 12:04:19,819 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=22.5 2023-06-21 12:04:39,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=960018.0, ans=0.5 2023-06-21 12:05:16,823 INFO [train.py:996] (2/4) Epoch 6, batch 7550, loss[loss=0.1728, simple_loss=0.2427, pruned_loss=0.05145, over 15842.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3167, pruned_loss=0.08742, over 4266324.97 frames. ], batch size: 62, lr: 5.22e-03, grad_scale: 16.0 2023-06-21 12:05:19,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=960138.0, ans=0.0 2023-06-21 12:05:41,926 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 3.161e+02 3.642e+02 4.665e+02 7.611e+02, threshold=7.284e+02, percent-clipped=6.0 2023-06-21 12:06:53,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=960378.0, ans=0.125 2023-06-21 12:06:56,349 INFO [train.py:996] (2/4) Epoch 6, batch 7600, loss[loss=0.239, simple_loss=0.3024, pruned_loss=0.08781, over 21584.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3176, pruned_loss=0.08715, over 4271234.11 frames. ], batch size: 548, lr: 5.22e-03, grad_scale: 32.0 2023-06-21 12:07:03,034 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:07:07,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=960438.0, ans=0.2 2023-06-21 12:08:20,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=960678.0, ans=0.0 2023-06-21 12:08:34,753 INFO [train.py:996] (2/4) Epoch 6, batch 7650, loss[loss=0.2655, simple_loss=0.3642, pruned_loss=0.0834, over 20087.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3161, pruned_loss=0.08776, over 4276350.81 frames. ], batch size: 703, lr: 5.22e-03, grad_scale: 32.0 2023-06-21 12:08:49,823 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-06-21 12:08:52,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=960798.0, ans=0.07 2023-06-21 12:09:01,333 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 3.034e+02 3.412e+02 4.046e+02 6.566e+02, threshold=6.823e+02, percent-clipped=0.0 2023-06-21 12:09:59,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=960978.0, ans=0.04949747468305833 2023-06-21 12:10:17,246 INFO [train.py:996] (2/4) Epoch 6, batch 7700, loss[loss=0.3227, simple_loss=0.3805, pruned_loss=0.1325, over 21622.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3205, pruned_loss=0.0915, over 4279192.25 frames. ], batch size: 389, lr: 5.22e-03, grad_scale: 32.0 2023-06-21 12:10:34,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=961098.0, ans=0.125 2023-06-21 12:11:32,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=961218.0, ans=0.0 2023-06-21 12:11:43,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=961278.0, ans=0.125 2023-06-21 12:11:59,181 INFO [train.py:996] (2/4) Epoch 6, batch 7750, loss[loss=0.2865, simple_loss=0.3765, pruned_loss=0.09826, over 21762.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3272, pruned_loss=0.09175, over 4277776.60 frames. ], batch size: 282, lr: 5.22e-03, grad_scale: 32.0 2023-06-21 12:12:34,948 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.093e+02 3.114e+02 3.576e+02 4.204e+02 7.368e+02, threshold=7.152e+02, percent-clipped=1.0 2023-06-21 12:13:22,929 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:13:29,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=961578.0, ans=0.125 2023-06-21 12:13:39,892 INFO [train.py:996] (2/4) Epoch 6, batch 7800, loss[loss=0.2636, simple_loss=0.3315, pruned_loss=0.09785, over 21761.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3314, pruned_loss=0.09306, over 4275136.43 frames. ], batch size: 282, lr: 5.22e-03, grad_scale: 16.0 2023-06-21 12:14:18,268 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.25 vs. limit=15.0 2023-06-21 12:14:27,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=961698.0, ans=0.1 2023-06-21 12:14:40,481 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=15.0 2023-06-21 12:15:09,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=961878.0, ans=0.2 2023-06-21 12:15:18,199 INFO [train.py:996] (2/4) Epoch 6, batch 7850, loss[loss=0.2322, simple_loss=0.2901, pruned_loss=0.08721, over 21616.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3224, pruned_loss=0.09166, over 4263205.78 frames. ], batch size: 298, lr: 5.21e-03, grad_scale: 16.0 2023-06-21 12:15:18,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=961938.0, ans=0.0 2023-06-21 12:15:37,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=961998.0, ans=0.125 2023-06-21 12:16:02,016 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.425e+02 2.926e+02 3.453e+02 4.214e+02 9.317e+02, threshold=6.905e+02, percent-clipped=1.0 2023-06-21 12:16:27,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=962118.0, ans=0.0 2023-06-21 12:16:51,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=962178.0, ans=0.125 2023-06-21 12:17:01,343 INFO [train.py:996] (2/4) Epoch 6, batch 7900, loss[loss=0.2049, simple_loss=0.2698, pruned_loss=0.07002, over 21143.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3163, pruned_loss=0.08975, over 4254980.05 frames. ], batch size: 143, lr: 5.21e-03, grad_scale: 16.0 2023-06-21 12:18:06,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=962358.0, ans=0.125 2023-06-21 12:18:15,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=962418.0, ans=0.125 2023-06-21 12:18:31,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=962478.0, ans=0.2 2023-06-21 12:18:51,638 INFO [train.py:996] (2/4) Epoch 6, batch 7950, loss[loss=0.3119, simple_loss=0.3807, pruned_loss=0.1216, over 21750.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3199, pruned_loss=0.08859, over 4256890.40 frames. ], batch size: 441, lr: 5.21e-03, grad_scale: 16.0 2023-06-21 12:18:52,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=962538.0, ans=0.07 2023-06-21 12:19:15,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=962538.0, ans=0.0 2023-06-21 12:19:29,549 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.499e+02 3.498e+02 4.372e+02 5.089e+02 1.068e+03, threshold=8.743e+02, percent-clipped=8.0 2023-06-21 12:20:44,515 INFO [train.py:996] (2/4) Epoch 6, batch 8000, loss[loss=0.2719, simple_loss=0.3457, pruned_loss=0.09903, over 21445.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3243, pruned_loss=0.09038, over 4250965.10 frames. ], batch size: 194, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:20:50,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=962838.0, ans=0.07 2023-06-21 12:20:55,684 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=12.0 2023-06-21 12:21:02,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=962898.0, ans=0.0 2023-06-21 12:21:07,465 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:21:21,395 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.85 vs. limit=15.0 2023-06-21 12:21:36,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=962958.0, ans=0.0 2023-06-21 12:22:10,293 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-21 12:22:28,321 INFO [train.py:996] (2/4) Epoch 6, batch 8050, loss[loss=0.2719, simple_loss=0.3618, pruned_loss=0.091, over 21692.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3304, pruned_loss=0.09117, over 4258715.89 frames. ], batch size: 389, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:22:43,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=963198.0, ans=0.0 2023-06-21 12:22:55,405 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.280e+02 2.992e+02 3.416e+02 4.104e+02 8.130e+02, threshold=6.832e+02, percent-clipped=0.0 2023-06-21 12:23:47,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=963318.0, ans=0.125 2023-06-21 12:23:48,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=963318.0, ans=0.125 2023-06-21 12:23:58,770 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-21 12:24:07,624 INFO [train.py:996] (2/4) Epoch 6, batch 8100, loss[loss=0.2913, simple_loss=0.3495, pruned_loss=0.1166, over 21538.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3281, pruned_loss=0.09174, over 4264046.73 frames. ], batch size: 548, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:24:33,404 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.70 vs. limit=6.0 2023-06-21 12:25:00,409 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.44 vs. limit=12.0 2023-06-21 12:25:29,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=963618.0, ans=0.07 2023-06-21 12:25:35,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=963678.0, ans=15.0 2023-06-21 12:25:50,214 INFO [train.py:996] (2/4) Epoch 6, batch 8150, loss[loss=0.2368, simple_loss=0.3274, pruned_loss=0.07316, over 21597.00 frames. ], tot_loss[loss=0.2627, simple_loss=0.3363, pruned_loss=0.09457, over 4258187.03 frames. ], batch size: 263, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:26:36,236 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 3.091e+02 3.490e+02 4.370e+02 7.436e+02, threshold=6.980e+02, percent-clipped=1.0 2023-06-21 12:26:44,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=963858.0, ans=0.0 2023-06-21 12:26:48,045 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=15.0 2023-06-21 12:26:49,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=963858.0, ans=0.125 2023-06-21 12:27:27,555 INFO [train.py:996] (2/4) Epoch 6, batch 8200, loss[loss=0.2044, simple_loss=0.2696, pruned_loss=0.06955, over 21135.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3273, pruned_loss=0.09101, over 4260605.17 frames. ], batch size: 159, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:27:55,884 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.63 vs. limit=10.0 2023-06-21 12:28:48,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=964218.0, ans=0.035 2023-06-21 12:29:05,605 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.57 vs. limit=15.0 2023-06-21 12:29:07,641 INFO [train.py:996] (2/4) Epoch 6, batch 8250, loss[loss=0.2018, simple_loss=0.307, pruned_loss=0.04836, over 19924.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3264, pruned_loss=0.0913, over 4268444.63 frames. ], batch size: 703, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:29:56,755 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 2.891e+02 3.432e+02 4.145e+02 7.025e+02, threshold=6.865e+02, percent-clipped=1.0 2023-06-21 12:30:12,122 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=15.0 2023-06-21 12:30:20,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=964518.0, ans=0.0 2023-06-21 12:30:46,924 INFO [train.py:996] (2/4) Epoch 6, batch 8300, loss[loss=0.261, simple_loss=0.3441, pruned_loss=0.08889, over 21639.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3252, pruned_loss=0.08834, over 4265998.47 frames. ], batch size: 389, lr: 5.21e-03, grad_scale: 16.0 2023-06-21 12:30:56,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=964638.0, ans=0.2 2023-06-21 12:31:53,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=964758.0, ans=0.2 2023-06-21 12:32:08,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=964818.0, ans=0.1 2023-06-21 12:32:33,475 INFO [train.py:996] (2/4) Epoch 6, batch 8350, loss[loss=0.2277, simple_loss=0.3046, pruned_loss=0.07537, over 21540.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3244, pruned_loss=0.08663, over 4272805.94 frames. ], batch size: 212, lr: 5.21e-03, grad_scale: 16.0 2023-06-21 12:32:45,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=964938.0, ans=0.0 2023-06-21 12:33:12,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=964998.0, ans=0.2 2023-06-21 12:33:16,746 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-21 12:33:17,180 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.153e+02 2.748e+02 3.092e+02 3.699e+02 5.409e+02, threshold=6.184e+02, percent-clipped=0.0 2023-06-21 12:33:34,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=965058.0, ans=0.95 2023-06-21 12:33:39,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=965118.0, ans=0.125 2023-06-21 12:33:40,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=965118.0, ans=0.125 2023-06-21 12:33:45,907 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:34:02,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=965178.0, ans=0.2 2023-06-21 12:34:14,495 INFO [train.py:996] (2/4) Epoch 6, batch 8400, loss[loss=0.1915, simple_loss=0.2658, pruned_loss=0.0586, over 21199.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3225, pruned_loss=0.08419, over 4273076.23 frames. ], batch size: 143, lr: 5.21e-03, grad_scale: 32.0 2023-06-21 12:34:35,818 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:34:58,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=965358.0, ans=0.05 2023-06-21 12:35:09,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=965418.0, ans=0.1 2023-06-21 12:35:18,078 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 12:35:42,889 INFO [train.py:996] (2/4) Epoch 6, batch 8450, loss[loss=0.2639, simple_loss=0.3251, pruned_loss=0.1013, over 21866.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3196, pruned_loss=0.08343, over 4277288.60 frames. ], batch size: 118, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:36:23,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=965598.0, ans=0.2 2023-06-21 12:36:28,781 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.529e+02 3.064e+02 3.775e+02 6.261e+02, threshold=6.127e+02, percent-clipped=1.0 2023-06-21 12:36:34,676 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.05 vs. limit=6.0 2023-06-21 12:36:40,994 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2023-06-21 12:37:17,959 INFO [train.py:996] (2/4) Epoch 6, batch 8500, loss[loss=0.2045, simple_loss=0.2625, pruned_loss=0.07331, over 21265.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3154, pruned_loss=0.08477, over 4280791.40 frames. ], batch size: 548, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:37:35,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=965838.0, ans=0.025 2023-06-21 12:37:43,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=965838.0, ans=22.5 2023-06-21 12:38:05,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=965958.0, ans=0.2 2023-06-21 12:38:09,568 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=15.0 2023-06-21 12:38:34,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=966018.0, ans=0.1 2023-06-21 12:38:58,632 INFO [train.py:996] (2/4) Epoch 6, batch 8550, loss[loss=0.2897, simple_loss=0.3758, pruned_loss=0.1019, over 21766.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3191, pruned_loss=0.0872, over 4273045.98 frames. ], batch size: 351, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:39:06,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=966138.0, ans=0.125 2023-06-21 12:39:47,346 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.344e+02 3.008e+02 3.313e+02 4.045e+02 7.159e+02, threshold=6.625e+02, percent-clipped=3.0 2023-06-21 12:41:11,032 INFO [train.py:996] (2/4) Epoch 6, batch 8600, loss[loss=0.2867, simple_loss=0.3531, pruned_loss=0.1101, over 21601.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3275, pruned_loss=0.08985, over 4273102.95 frames. ], batch size: 263, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:41:27,607 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.18 vs. limit=15.0 2023-06-21 12:41:44,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=966498.0, ans=0.125 2023-06-21 12:42:52,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=966678.0, ans=0.2 2023-06-21 12:42:55,498 INFO [train.py:996] (2/4) Epoch 6, batch 8650, loss[loss=0.2492, simple_loss=0.3459, pruned_loss=0.07628, over 21537.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3342, pruned_loss=0.09025, over 4278993.85 frames. ], batch size: 471, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:43:16,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=966798.0, ans=0.0 2023-06-21 12:43:23,699 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.785e+02 2.940e+02 3.541e+02 4.015e+02 7.663e+02, threshold=7.081e+02, percent-clipped=3.0 2023-06-21 12:43:47,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=966918.0, ans=0.07 2023-06-21 12:44:22,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=966978.0, ans=0.125 2023-06-21 12:44:29,615 INFO [train.py:996] (2/4) Epoch 6, batch 8700, loss[loss=0.2262, simple_loss=0.2849, pruned_loss=0.08375, over 21647.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3251, pruned_loss=0.08637, over 4277923.65 frames. ], batch size: 282, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:45:08,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=967158.0, ans=0.2 2023-06-21 12:45:21,951 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.22 vs. limit=22.5 2023-06-21 12:46:04,008 INFO [train.py:996] (2/4) Epoch 6, batch 8750, loss[loss=0.2777, simple_loss=0.3846, pruned_loss=0.0854, over 20829.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3217, pruned_loss=0.08736, over 4279757.19 frames. ], batch size: 608, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:46:10,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=967338.0, ans=0.125 2023-06-21 12:46:34,042 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 3.061e+02 3.811e+02 4.792e+02 9.884e+02, threshold=7.621e+02, percent-clipped=4.0 2023-06-21 12:47:42,810 INFO [train.py:996] (2/4) Epoch 6, batch 8800, loss[loss=0.319, simple_loss=0.408, pruned_loss=0.115, over 19945.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3302, pruned_loss=0.09031, over 4285215.40 frames. ], batch size: 702, lr: 5.20e-03, grad_scale: 32.0 2023-06-21 12:47:43,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=967638.0, ans=0.1 2023-06-21 12:48:08,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=967698.0, ans=0.1 2023-06-21 12:48:11,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=967698.0, ans=0.125 2023-06-21 12:48:15,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=967758.0, ans=0.125 2023-06-21 12:49:03,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=967878.0, ans=0.125 2023-06-21 12:49:12,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=967878.0, ans=0.0 2023-06-21 12:49:18,419 INFO [train.py:996] (2/4) Epoch 6, batch 8850, loss[loss=0.2454, simple_loss=0.3404, pruned_loss=0.07513, over 21630.00 frames. ], tot_loss[loss=0.26, simple_loss=0.3364, pruned_loss=0.09177, over 4286350.14 frames. ], batch size: 389, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:49:19,306 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.06 vs. limit=15.0 2023-06-21 12:49:29,881 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=22.5 2023-06-21 12:49:43,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=967998.0, ans=0.125 2023-06-21 12:49:48,829 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.170e+02 2.854e+02 3.378e+02 4.143e+02 7.151e+02, threshold=6.757e+02, percent-clipped=0.0 2023-06-21 12:50:07,594 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-21 12:50:24,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=968118.0, ans=0.0 2023-06-21 12:50:42,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=968178.0, ans=0.125 2023-06-21 12:50:54,516 INFO [train.py:996] (2/4) Epoch 6, batch 8900, loss[loss=0.2227, simple_loss=0.283, pruned_loss=0.08119, over 21308.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3306, pruned_loss=0.09077, over 4288922.86 frames. ], batch size: 177, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:51:14,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=968298.0, ans=0.2 2023-06-21 12:52:20,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=968478.0, ans=0.1 2023-06-21 12:52:21,422 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.10 vs. limit=6.0 2023-06-21 12:52:27,952 INFO [train.py:996] (2/4) Epoch 6, batch 8950, loss[loss=0.2198, simple_loss=0.2866, pruned_loss=0.07646, over 21395.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3286, pruned_loss=0.09064, over 4277306.14 frames. ], batch size: 194, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:52:35,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=968538.0, ans=0.1 2023-06-21 12:53:06,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=968598.0, ans=0.1 2023-06-21 12:53:12,537 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 3.104e+02 3.637e+02 4.159e+02 7.258e+02, threshold=7.275e+02, percent-clipped=2.0 2023-06-21 12:53:54,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=968778.0, ans=0.125 2023-06-21 12:54:02,958 INFO [train.py:996] (2/4) Epoch 6, batch 9000, loss[loss=0.2321, simple_loss=0.2851, pruned_loss=0.08956, over 21620.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3234, pruned_loss=0.09114, over 4274833.65 frames. ], batch size: 282, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:54:02,958 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 12:54:25,122 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2624, simple_loss=0.3599, pruned_loss=0.08239, over 1796401.00 frames. 2023-06-21 12:54:25,122 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-21 12:54:26,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=968838.0, ans=0.125 2023-06-21 12:54:31,908 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.40 vs. limit=12.0 2023-06-21 12:54:45,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=968898.0, ans=0.2 2023-06-21 12:55:15,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=968958.0, ans=0.1 2023-06-21 12:56:01,358 INFO [train.py:996] (2/4) Epoch 6, batch 9050, loss[loss=0.2365, simple_loss=0.3198, pruned_loss=0.07665, over 21283.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3194, pruned_loss=0.08777, over 4278224.11 frames. ], batch size: 549, lr: 5.20e-03, grad_scale: 16.0 2023-06-21 12:56:23,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=969138.0, ans=0.125 2023-06-21 12:56:41,593 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.080e+02 2.936e+02 3.440e+02 3.853e+02 8.730e+02, threshold=6.881e+02, percent-clipped=1.0 2023-06-21 12:57:09,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=969318.0, ans=0.2 2023-06-21 12:57:43,019 INFO [train.py:996] (2/4) Epoch 6, batch 9100, loss[loss=0.2498, simple_loss=0.3476, pruned_loss=0.07602, over 21639.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3266, pruned_loss=0.08994, over 4275665.54 frames. ], batch size: 441, lr: 5.19e-03, grad_scale: 16.0 2023-06-21 12:58:03,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=969498.0, ans=0.125 2023-06-21 12:58:07,337 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2023-06-21 12:58:17,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=969558.0, ans=0.025 2023-06-21 12:58:36,722 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.59 vs. limit=15.0 2023-06-21 12:58:36,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=969618.0, ans=15.0 2023-06-21 12:59:10,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=969678.0, ans=0.035 2023-06-21 12:59:27,364 INFO [train.py:996] (2/4) Epoch 6, batch 9150, loss[loss=0.2407, simple_loss=0.3135, pruned_loss=0.08397, over 21501.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3305, pruned_loss=0.08819, over 4267804.10 frames. ], batch size: 131, lr: 5.19e-03, grad_scale: 16.0 2023-06-21 12:59:57,823 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.039e+02 2.788e+02 3.229e+02 4.446e+02 7.555e+02, threshold=6.457e+02, percent-clipped=3.0 2023-06-21 13:00:05,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=969858.0, ans=0.2 2023-06-21 13:00:55,195 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.78 vs. limit=12.0 2023-06-21 13:01:00,378 INFO [train.py:996] (2/4) Epoch 6, batch 9200, loss[loss=0.29, simple_loss=0.3613, pruned_loss=0.1094, over 21324.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3303, pruned_loss=0.08634, over 4261536.03 frames. ], batch size: 548, lr: 5.19e-03, grad_scale: 32.0 2023-06-21 13:01:05,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=970038.0, ans=0.04949747468305833 2023-06-21 13:01:16,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=970098.0, ans=0.035 2023-06-21 13:02:25,776 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=22.5 2023-06-21 13:02:35,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=970338.0, ans=0.0 2023-06-21 13:02:36,672 INFO [train.py:996] (2/4) Epoch 6, batch 9250, loss[loss=0.2579, simple_loss=0.3168, pruned_loss=0.09954, over 21448.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3331, pruned_loss=0.09034, over 4268854.13 frames. ], batch size: 131, lr: 5.19e-03, grad_scale: 32.0 2023-06-21 13:02:58,217 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.02 vs. limit=12.0 2023-06-21 13:03:05,755 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=15.0 2023-06-21 13:03:07,703 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.236e+02 3.070e+02 3.502e+02 4.094e+02 6.605e+02, threshold=7.004e+02, percent-clipped=1.0 2023-06-21 13:04:03,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=970578.0, ans=0.125 2023-06-21 13:04:13,738 INFO [train.py:996] (2/4) Epoch 6, batch 9300, loss[loss=0.3058, simple_loss=0.3647, pruned_loss=0.1234, over 21281.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3259, pruned_loss=0.09013, over 4276833.24 frames. ], batch size: 471, lr: 5.19e-03, grad_scale: 32.0 2023-06-21 13:04:45,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=970698.0, ans=0.0 2023-06-21 13:05:49,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=970938.0, ans=0.2 2023-06-21 13:05:50,437 INFO [train.py:996] (2/4) Epoch 6, batch 9350, loss[loss=0.2523, simple_loss=0.321, pruned_loss=0.09185, over 21900.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3323, pruned_loss=0.09073, over 4280631.17 frames. ], batch size: 98, lr: 5.19e-03, grad_scale: 32.0 2023-06-21 13:05:50,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=970938.0, ans=0.125 2023-06-21 13:06:31,420 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 3.002e+02 3.519e+02 4.065e+02 7.578e+02, threshold=7.038e+02, percent-clipped=1.0 2023-06-21 13:06:49,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=971058.0, ans=0.2 2023-06-21 13:07:07,068 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:07:08,896 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.19 vs. limit=10.0 2023-06-21 13:07:26,212 INFO [train.py:996] (2/4) Epoch 6, batch 9400, loss[loss=0.2214, simple_loss=0.2902, pruned_loss=0.07634, over 21664.00 frames. ], tot_loss[loss=0.2584, simple_loss=0.334, pruned_loss=0.0914, over 4282819.78 frames. ], batch size: 282, lr: 5.19e-03, grad_scale: 32.0 2023-06-21 13:07:52,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=971298.0, ans=0.2 2023-06-21 13:08:03,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=971298.0, ans=0.1 2023-06-21 13:08:16,819 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.33 vs. limit=22.5 2023-06-21 13:08:56,681 INFO [train.py:996] (2/4) Epoch 6, batch 9450, loss[loss=0.2376, simple_loss=0.2946, pruned_loss=0.09026, over 21601.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3258, pruned_loss=0.09047, over 4277004.96 frames. ], batch size: 298, lr: 5.19e-03, grad_scale: 32.0 2023-06-21 13:09:18,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=971598.0, ans=0.125 2023-06-21 13:09:22,078 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.35 vs. limit=15.0 2023-06-21 13:09:41,567 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.370e+02 3.152e+02 3.709e+02 4.839e+02 7.749e+02, threshold=7.417e+02, percent-clipped=1.0 2023-06-21 13:10:14,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=971718.0, ans=0.2 2023-06-21 13:10:25,103 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:10:31,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=971838.0, ans=0.0 2023-06-21 13:10:32,104 INFO [train.py:996] (2/4) Epoch 6, batch 9500, loss[loss=0.2382, simple_loss=0.2951, pruned_loss=0.0906, over 21537.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3172, pruned_loss=0.0884, over 4276975.73 frames. ], batch size: 414, lr: 5.19e-03, grad_scale: 8.0 2023-06-21 13:10:55,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=971898.0, ans=0.125 2023-06-21 13:11:03,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=971898.0, ans=0.0 2023-06-21 13:11:39,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=972018.0, ans=0.125 2023-06-21 13:11:55,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=972078.0, ans=0.125 2023-06-21 13:12:03,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=972078.0, ans=0.2 2023-06-21 13:12:07,947 INFO [train.py:996] (2/4) Epoch 6, batch 9550, loss[loss=0.286, simple_loss=0.3785, pruned_loss=0.09676, over 21638.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3211, pruned_loss=0.08988, over 4279302.29 frames. ], batch size: 414, lr: 5.19e-03, grad_scale: 8.0 2023-06-21 13:12:34,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=972198.0, ans=0.0 2023-06-21 13:13:00,557 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.012e+02 2.847e+02 3.325e+02 4.189e+02 8.114e+02, threshold=6.651e+02, percent-clipped=1.0 2023-06-21 13:13:01,258 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:13:33,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=972378.0, ans=0.1 2023-06-21 13:13:37,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=972378.0, ans=0.125 2023-06-21 13:13:43,376 INFO [train.py:996] (2/4) Epoch 6, batch 9600, loss[loss=0.2148, simple_loss=0.2893, pruned_loss=0.07019, over 21775.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3246, pruned_loss=0.09141, over 4286356.21 frames. ], batch size: 112, lr: 5.19e-03, grad_scale: 16.0 2023-06-21 13:14:42,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=972558.0, ans=0.125 2023-06-21 13:14:48,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=972618.0, ans=0.0 2023-06-21 13:15:01,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=972618.0, ans=0.125 2023-06-21 13:15:07,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=972678.0, ans=0.125 2023-06-21 13:15:09,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=972678.0, ans=0.0 2023-06-21 13:15:25,387 INFO [train.py:996] (2/4) Epoch 6, batch 9650, loss[loss=0.2276, simple_loss=0.2968, pruned_loss=0.07919, over 21491.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3267, pruned_loss=0.09117, over 4287235.78 frames. ], batch size: 211, lr: 5.19e-03, grad_scale: 16.0 2023-06-21 13:15:40,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=972738.0, ans=0.05 2023-06-21 13:15:43,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=972798.0, ans=0.125 2023-06-21 13:16:07,459 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-21 13:16:12,605 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.288e+02 2.974e+02 3.476e+02 4.202e+02 8.291e+02, threshold=6.952e+02, percent-clipped=2.0 2023-06-21 13:16:25,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=972918.0, ans=0.1 2023-06-21 13:16:43,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=972978.0, ans=0.1 2023-06-21 13:16:53,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=972978.0, ans=0.0 2023-06-21 13:17:06,067 INFO [train.py:996] (2/4) Epoch 6, batch 9700, loss[loss=0.2459, simple_loss=0.321, pruned_loss=0.08547, over 21911.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3293, pruned_loss=0.09146, over 4287076.66 frames. ], batch size: 316, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:17:12,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=973038.0, ans=0.0 2023-06-21 13:17:37,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=973098.0, ans=0.1 2023-06-21 13:17:51,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=973158.0, ans=0.2 2023-06-21 13:17:53,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=973158.0, ans=0.125 2023-06-21 13:18:05,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=973218.0, ans=0.1 2023-06-21 13:18:33,524 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-06-21 13:18:35,242 INFO [train.py:996] (2/4) Epoch 6, batch 9750, loss[loss=0.2916, simple_loss=0.3707, pruned_loss=0.1062, over 21877.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3226, pruned_loss=0.09053, over 4286691.19 frames. ], batch size: 118, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:19:13,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=973398.0, ans=0.125 2023-06-21 13:19:23,217 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 2.880e+02 3.330e+02 4.100e+02 8.108e+02, threshold=6.660e+02, percent-clipped=1.0 2023-06-21 13:19:50,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=973578.0, ans=0.125 2023-06-21 13:19:56,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=973578.0, ans=0.0 2023-06-21 13:20:01,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=973578.0, ans=0.0 2023-06-21 13:20:09,972 INFO [train.py:996] (2/4) Epoch 6, batch 9800, loss[loss=0.2438, simple_loss=0.3087, pruned_loss=0.08947, over 21786.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3237, pruned_loss=0.09027, over 4287619.45 frames. ], batch size: 441, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:20:31,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=973698.0, ans=0.0 2023-06-21 13:20:51,541 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.09 vs. limit=12.0 2023-06-21 13:20:58,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=973758.0, ans=0.1 2023-06-21 13:21:33,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=973878.0, ans=0.125 2023-06-21 13:21:40,039 INFO [train.py:996] (2/4) Epoch 6, batch 9850, loss[loss=0.2254, simple_loss=0.2865, pruned_loss=0.08209, over 21442.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3193, pruned_loss=0.08958, over 4292847.95 frames. ], batch size: 131, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:22:23,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=973998.0, ans=0.1 2023-06-21 13:22:32,261 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.980e+02 2.858e+02 3.118e+02 3.826e+02 5.863e+02, threshold=6.237e+02, percent-clipped=0.0 2023-06-21 13:23:03,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=974178.0, ans=0.125 2023-06-21 13:23:06,168 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:23:07,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=974178.0, ans=0.5 2023-06-21 13:23:15,702 INFO [train.py:996] (2/4) Epoch 6, batch 9900, loss[loss=0.3192, simple_loss=0.3762, pruned_loss=0.1311, over 21381.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3144, pruned_loss=0.08856, over 4282809.23 frames. ], batch size: 471, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:23:16,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=974238.0, ans=0.125 2023-06-21 13:23:24,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=974238.0, ans=0.1 2023-06-21 13:24:56,240 INFO [train.py:996] (2/4) Epoch 6, batch 9950, loss[loss=0.283, simple_loss=0.3398, pruned_loss=0.1131, over 21424.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3162, pruned_loss=0.09132, over 4259726.64 frames. ], batch size: 131, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:25:01,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=974538.0, ans=0.0 2023-06-21 13:25:16,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=974598.0, ans=0.0 2023-06-21 13:25:29,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=974598.0, ans=0.1 2023-06-21 13:25:39,901 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 2.967e+02 3.408e+02 4.209e+02 6.972e+02, threshold=6.817e+02, percent-clipped=1.0 2023-06-21 13:26:32,399 INFO [train.py:996] (2/4) Epoch 6, batch 10000, loss[loss=0.2499, simple_loss=0.3226, pruned_loss=0.08861, over 21762.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3125, pruned_loss=0.09045, over 4262430.51 frames. ], batch size: 352, lr: 5.18e-03, grad_scale: 32.0 2023-06-21 13:26:54,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=974838.0, ans=0.1 2023-06-21 13:27:00,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=974898.0, ans=0.125 2023-06-21 13:27:37,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=975018.0, ans=0.2 2023-06-21 13:28:07,389 INFO [train.py:996] (2/4) Epoch 6, batch 10050, loss[loss=0.2128, simple_loss=0.2813, pruned_loss=0.07208, over 21717.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3145, pruned_loss=0.09057, over 4264590.86 frames. ], batch size: 282, lr: 5.18e-03, grad_scale: 32.0 2023-06-21 13:28:39,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=975198.0, ans=0.125 2023-06-21 13:28:50,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=975258.0, ans=0.125 2023-06-21 13:28:51,173 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.983e+02 2.710e+02 3.231e+02 4.212e+02 7.416e+02, threshold=6.463e+02, percent-clipped=2.0 2023-06-21 13:29:44,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=975378.0, ans=0.1 2023-06-21 13:29:53,556 INFO [train.py:996] (2/4) Epoch 6, batch 10100, loss[loss=0.263, simple_loss=0.3262, pruned_loss=0.09996, over 21609.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.312, pruned_loss=0.08829, over 4258583.48 frames. ], batch size: 230, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:29:53,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=975438.0, ans=0.0 2023-06-21 13:29:58,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=975438.0, ans=0.1 2023-06-21 13:30:11,833 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.59 vs. limit=12.0 2023-06-21 13:30:34,306 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.72 vs. limit=10.0 2023-06-21 13:30:34,416 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.20 vs. limit=15.0 2023-06-21 13:31:06,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=975678.0, ans=0.1 2023-06-21 13:31:29,494 INFO [train.py:996] (2/4) Epoch 6, batch 10150, loss[loss=0.2484, simple_loss=0.316, pruned_loss=0.09046, over 21655.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3198, pruned_loss=0.09146, over 4263474.51 frames. ], batch size: 332, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:31:52,738 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:32:04,262 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 3.136e+02 3.616e+02 4.302e+02 7.230e+02, threshold=7.231e+02, percent-clipped=1.0 2023-06-21 13:32:38,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=975978.0, ans=0.125 2023-06-21 13:33:04,745 INFO [train.py:996] (2/4) Epoch 6, batch 10200, loss[loss=0.2543, simple_loss=0.331, pruned_loss=0.08884, over 21001.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3177, pruned_loss=0.08851, over 4266831.00 frames. ], batch size: 607, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:33:14,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=976038.0, ans=0.125 2023-06-21 13:34:33,459 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:34:40,838 INFO [train.py:996] (2/4) Epoch 6, batch 10250, loss[loss=0.199, simple_loss=0.2893, pruned_loss=0.05433, over 21378.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3122, pruned_loss=0.08233, over 4257419.94 frames. ], batch size: 211, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:35:03,241 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=22.5 2023-06-21 13:35:21,115 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.539e+02 2.417e+02 2.778e+02 3.535e+02 6.658e+02, threshold=5.557e+02, percent-clipped=0.0 2023-06-21 13:36:18,479 INFO [train.py:996] (2/4) Epoch 6, batch 10300, loss[loss=0.2644, simple_loss=0.3456, pruned_loss=0.09153, over 21426.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3162, pruned_loss=0.08412, over 4267384.18 frames. ], batch size: 194, lr: 5.18e-03, grad_scale: 16.0 2023-06-21 13:36:20,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=976638.0, ans=0.0 2023-06-21 13:36:26,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=976638.0, ans=0.125 2023-06-21 13:36:34,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=976698.0, ans=0.1 2023-06-21 13:37:36,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=976878.0, ans=0.1 2023-06-21 13:37:51,347 INFO [train.py:996] (2/4) Epoch 6, batch 10350, loss[loss=0.2247, simple_loss=0.3087, pruned_loss=0.07034, over 21838.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3164, pruned_loss=0.08435, over 4270947.99 frames. ], batch size: 372, lr: 5.17e-03, grad_scale: 16.0 2023-06-21 13:38:39,360 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.28 vs. limit=22.5 2023-06-21 13:38:41,221 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 3.013e+02 3.424e+02 4.062e+02 6.181e+02, threshold=6.848e+02, percent-clipped=5.0 2023-06-21 13:39:27,356 INFO [train.py:996] (2/4) Epoch 6, batch 10400, loss[loss=0.1561, simple_loss=0.2115, pruned_loss=0.05038, over 21107.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3095, pruned_loss=0.08243, over 4271711.24 frames. ], batch size: 143, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:39:30,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=977238.0, ans=0.125 2023-06-21 13:39:32,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=977238.0, ans=0.1 2023-06-21 13:40:23,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=977358.0, ans=0.04949747468305833 2023-06-21 13:40:24,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=977358.0, ans=0.0 2023-06-21 13:40:52,134 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.92 vs. limit=6.0 2023-06-21 13:41:09,988 INFO [train.py:996] (2/4) Epoch 6, batch 10450, loss[loss=0.266, simple_loss=0.3441, pruned_loss=0.09395, over 21825.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3151, pruned_loss=0.08597, over 4262513.52 frames. ], batch size: 316, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:42:01,151 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 3.285e+02 3.743e+02 4.607e+02 9.328e+02, threshold=7.486e+02, percent-clipped=7.0 2023-06-21 13:42:52,012 INFO [train.py:996] (2/4) Epoch 6, batch 10500, loss[loss=0.2324, simple_loss=0.3031, pruned_loss=0.08085, over 21811.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3136, pruned_loss=0.08447, over 4254879.75 frames. ], batch size: 98, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:42:59,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=977838.0, ans=0.125 2023-06-21 13:43:31,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=977958.0, ans=0.5 2023-06-21 13:43:54,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=978018.0, ans=0.125 2023-06-21 13:44:27,336 INFO [train.py:996] (2/4) Epoch 6, batch 10550, loss[loss=0.2305, simple_loss=0.2901, pruned_loss=0.08548, over 21662.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3095, pruned_loss=0.08448, over 4250192.67 frames. ], batch size: 333, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:44:27,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=978138.0, ans=0.125 2023-06-21 13:45:07,073 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=12.0 2023-06-21 13:45:12,157 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.729e+02 3.054e+02 3.524e+02 6.998e+02, threshold=6.108e+02, percent-clipped=0.0 2023-06-21 13:45:52,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=978378.0, ans=0.04949747468305833 2023-06-21 13:45:55,131 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=12.0 2023-06-21 13:46:02,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=978438.0, ans=0.0 2023-06-21 13:46:03,810 INFO [train.py:996] (2/4) Epoch 6, batch 10600, loss[loss=0.2048, simple_loss=0.2807, pruned_loss=0.06448, over 21737.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3042, pruned_loss=0.08248, over 4244613.28 frames. ], batch size: 316, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:46:46,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=978558.0, ans=0.07 2023-06-21 13:46:58,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=978618.0, ans=0.125 2023-06-21 13:47:04,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=978618.0, ans=0.125 2023-06-21 13:47:26,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=978678.0, ans=0.125 2023-06-21 13:47:34,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=978678.0, ans=0.125 2023-06-21 13:47:44,898 INFO [train.py:996] (2/4) Epoch 6, batch 10650, loss[loss=0.216, simple_loss=0.2961, pruned_loss=0.06796, over 21804.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3067, pruned_loss=0.08093, over 4255971.64 frames. ], batch size: 317, lr: 5.17e-03, grad_scale: 16.0 2023-06-21 13:48:05,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=978798.0, ans=0.125 2023-06-21 13:48:26,908 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.319e+02 2.894e+02 3.773e+02 4.928e+02 8.046e+02, threshold=7.546e+02, percent-clipped=12.0 2023-06-21 13:49:22,176 INFO [train.py:996] (2/4) Epoch 6, batch 10700, loss[loss=0.2604, simple_loss=0.3344, pruned_loss=0.09317, over 21698.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3062, pruned_loss=0.08095, over 4256547.43 frames. ], batch size: 441, lr: 5.17e-03, grad_scale: 16.0 2023-06-21 13:49:44,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=979098.0, ans=10.0 2023-06-21 13:49:47,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=979098.0, ans=0.0 2023-06-21 13:50:09,225 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:50:09,669 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.11 vs. limit=15.0 2023-06-21 13:50:44,783 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 13:50:47,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=979278.0, ans=0.0 2023-06-21 13:51:05,752 INFO [train.py:996] (2/4) Epoch 6, batch 10750, loss[loss=0.29, simple_loss=0.3718, pruned_loss=0.1041, over 21407.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3173, pruned_loss=0.08577, over 4255852.92 frames. ], batch size: 211, lr: 5.17e-03, grad_scale: 16.0 2023-06-21 13:51:12,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=979338.0, ans=0.2 2023-06-21 13:51:30,956 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=15.0 2023-06-21 13:51:42,219 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.277e+02 3.198e+02 3.607e+02 4.478e+02 7.932e+02, threshold=7.214e+02, percent-clipped=1.0 2023-06-21 13:52:03,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=979518.0, ans=0.2 2023-06-21 13:52:09,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=979518.0, ans=0.1 2023-06-21 13:52:19,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=979518.0, ans=0.125 2023-06-21 13:52:21,112 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=22.5 2023-06-21 13:52:34,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=979578.0, ans=0.125 2023-06-21 13:52:43,756 INFO [train.py:996] (2/4) Epoch 6, batch 10800, loss[loss=0.3235, simple_loss=0.4375, pruned_loss=0.1047, over 19848.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3214, pruned_loss=0.08644, over 4252356.86 frames. ], batch size: 702, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:52:44,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=979638.0, ans=0.2 2023-06-21 13:52:51,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=979638.0, ans=0.0 2023-06-21 13:53:18,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=979758.0, ans=0.1 2023-06-21 13:53:38,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=979818.0, ans=0.0 2023-06-21 13:53:57,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=979818.0, ans=0.125 2023-06-21 13:54:03,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=979878.0, ans=0.2 2023-06-21 13:54:04,045 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.50 vs. limit=22.5 2023-06-21 13:54:09,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=979878.0, ans=0.0 2023-06-21 13:54:14,991 INFO [train.py:996] (2/4) Epoch 6, batch 10850, loss[loss=0.2086, simple_loss=0.2761, pruned_loss=0.07057, over 20700.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3241, pruned_loss=0.08701, over 4255394.81 frames. ], batch size: 608, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:54:39,392 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-21 13:55:06,862 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.323e+02 2.788e+02 3.255e+02 3.917e+02 5.822e+02, threshold=6.509e+02, percent-clipped=0.0 2023-06-21 13:55:17,103 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-21 13:55:49,258 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.35 vs. limit=22.5 2023-06-21 13:55:51,257 INFO [train.py:996] (2/4) Epoch 6, batch 10900, loss[loss=0.2184, simple_loss=0.285, pruned_loss=0.07585, over 21212.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3169, pruned_loss=0.08541, over 4253417.20 frames. ], batch size: 143, lr: 5.17e-03, grad_scale: 32.0 2023-06-21 13:56:13,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=980238.0, ans=0.125 2023-06-21 13:56:52,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=980418.0, ans=0.1 2023-06-21 13:56:56,295 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.34 vs. limit=15.0 2023-06-21 13:57:07,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=980418.0, ans=0.1 2023-06-21 13:57:12,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=980478.0, ans=0.125 2023-06-21 13:57:23,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=980478.0, ans=0.125 2023-06-21 13:57:25,449 INFO [train.py:996] (2/4) Epoch 6, batch 10950, loss[loss=0.1895, simple_loss=0.2602, pruned_loss=0.05945, over 21618.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3105, pruned_loss=0.08302, over 4247972.27 frames. ], batch size: 298, lr: 5.16e-03, grad_scale: 32.0 2023-06-21 13:57:51,191 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-06-21 13:57:53,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=980598.0, ans=0.125 2023-06-21 13:58:15,832 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 2.774e+02 3.261e+02 3.678e+02 5.101e+02, threshold=6.522e+02, percent-clipped=0.0 2023-06-21 13:58:22,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=980658.0, ans=0.125 2023-06-21 13:58:59,435 INFO [train.py:996] (2/4) Epoch 6, batch 11000, loss[loss=0.2317, simple_loss=0.2961, pruned_loss=0.08364, over 21534.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3095, pruned_loss=0.0836, over 4253687.28 frames. ], batch size: 195, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 13:59:27,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=980898.0, ans=0.125 2023-06-21 14:00:36,195 INFO [train.py:996] (2/4) Epoch 6, batch 11050, loss[loss=0.2385, simple_loss=0.2915, pruned_loss=0.09278, over 21673.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.307, pruned_loss=0.08529, over 4260608.56 frames. ], batch size: 393, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:00:48,290 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.93 vs. limit=6.0 2023-06-21 14:01:28,782 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 2.910e+02 3.183e+02 3.721e+02 5.949e+02, threshold=6.365e+02, percent-clipped=0.0 2023-06-21 14:01:36,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=981318.0, ans=0.0 2023-06-21 14:01:42,973 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-21 14:01:43,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=981318.0, ans=0.2 2023-06-21 14:02:10,604 INFO [train.py:996] (2/4) Epoch 6, batch 11100, loss[loss=0.2525, simple_loss=0.3081, pruned_loss=0.09846, over 21503.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3073, pruned_loss=0.08621, over 4251156.21 frames. ], batch size: 195, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:02:48,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=981498.0, ans=0.0 2023-06-21 14:02:59,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=981558.0, ans=0.125 2023-06-21 14:03:41,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=981678.0, ans=0.125 2023-06-21 14:03:48,515 INFO [train.py:996] (2/4) Epoch 6, batch 11150, loss[loss=0.2789, simple_loss=0.361, pruned_loss=0.09838, over 20656.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3063, pruned_loss=0.08626, over 4258760.13 frames. ], batch size: 607, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:04:14,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=981798.0, ans=0.125 2023-06-21 14:04:19,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=981798.0, ans=0.0 2023-06-21 14:04:30,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=981798.0, ans=0.125 2023-06-21 14:04:41,517 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 2.660e+02 3.131e+02 3.688e+02 5.663e+02, threshold=6.262e+02, percent-clipped=0.0 2023-06-21 14:04:55,955 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 14:05:18,365 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.83 vs. limit=15.0 2023-06-21 14:05:24,956 INFO [train.py:996] (2/4) Epoch 6, batch 11200, loss[loss=0.2234, simple_loss=0.2876, pruned_loss=0.07961, over 21541.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3059, pruned_loss=0.08619, over 4267268.16 frames. ], batch size: 414, lr: 5.16e-03, grad_scale: 32.0 2023-06-21 14:05:33,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=982038.0, ans=0.1 2023-06-21 14:05:46,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=982098.0, ans=0.05 2023-06-21 14:05:51,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=982098.0, ans=0.1 2023-06-21 14:06:12,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=982158.0, ans=0.125 2023-06-21 14:06:57,263 INFO [train.py:996] (2/4) Epoch 6, batch 11250, loss[loss=0.2382, simple_loss=0.3091, pruned_loss=0.08368, over 20167.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3048, pruned_loss=0.08551, over 4266156.25 frames. ], batch size: 702, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:07:24,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=982398.0, ans=0.125 2023-06-21 14:07:46,952 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.118e+02 2.620e+02 2.906e+02 3.338e+02 5.205e+02, threshold=5.813e+02, percent-clipped=0.0 2023-06-21 14:08:13,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=982578.0, ans=0.1 2023-06-21 14:08:13,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=982578.0, ans=0.2 2023-06-21 14:08:28,286 INFO [train.py:996] (2/4) Epoch 6, batch 11300, loss[loss=0.2488, simple_loss=0.315, pruned_loss=0.09134, over 21918.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3062, pruned_loss=0.0854, over 4271016.10 frames. ], batch size: 316, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:08:49,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=982698.0, ans=0.2 2023-06-21 14:09:29,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=982818.0, ans=0.125 2023-06-21 14:09:32,715 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 14:09:51,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=982878.0, ans=0.125 2023-06-21 14:10:00,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=982878.0, ans=0.0 2023-06-21 14:10:03,376 INFO [train.py:996] (2/4) Epoch 6, batch 11350, loss[loss=0.2247, simple_loss=0.2978, pruned_loss=0.0758, over 21457.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3077, pruned_loss=0.08474, over 4268196.25 frames. ], batch size: 195, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:10:08,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=982938.0, ans=0.125 2023-06-21 14:10:22,042 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.31 vs. limit=10.0 2023-06-21 14:10:53,915 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.216e+02 2.788e+02 3.178e+02 3.739e+02 7.652e+02, threshold=6.355e+02, percent-clipped=2.0 2023-06-21 14:11:17,448 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.71 vs. limit=15.0 2023-06-21 14:11:35,962 INFO [train.py:996] (2/4) Epoch 6, batch 11400, loss[loss=0.2234, simple_loss=0.3014, pruned_loss=0.07269, over 21359.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3144, pruned_loss=0.08786, over 4269911.14 frames. ], batch size: 194, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:12:09,228 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.03 vs. limit=15.0 2023-06-21 14:12:13,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=983298.0, ans=0.125 2023-06-21 14:12:18,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=983298.0, ans=0.125 2023-06-21 14:13:00,032 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-21 14:13:02,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=983478.0, ans=0.1 2023-06-21 14:13:18,813 INFO [train.py:996] (2/4) Epoch 6, batch 11450, loss[loss=0.2409, simple_loss=0.314, pruned_loss=0.08391, over 21585.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3163, pruned_loss=0.08731, over 4271486.24 frames. ], batch size: 263, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:13:37,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=983598.0, ans=0.125 2023-06-21 14:13:47,314 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=22.5 2023-06-21 14:14:03,044 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.362e+02 2.838e+02 3.448e+02 4.254e+02 7.137e+02, threshold=6.896e+02, percent-clipped=4.0 2023-06-21 14:14:55,047 INFO [train.py:996] (2/4) Epoch 6, batch 11500, loss[loss=0.2992, simple_loss=0.3664, pruned_loss=0.116, over 21755.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3196, pruned_loss=0.08856, over 4276571.39 frames. ], batch size: 124, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:15:09,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=983838.0, ans=0.125 2023-06-21 14:15:12,915 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.27 vs. limit=8.0 2023-06-21 14:15:18,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=983898.0, ans=0.125 2023-06-21 14:15:59,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=984018.0, ans=0.1 2023-06-21 14:16:37,029 INFO [train.py:996] (2/4) Epoch 6, batch 11550, loss[loss=0.4145, simple_loss=0.4927, pruned_loss=0.1681, over 21489.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3274, pruned_loss=0.08918, over 4276883.37 frames. ], batch size: 507, lr: 5.16e-03, grad_scale: 16.0 2023-06-21 14:16:39,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=984138.0, ans=0.0 2023-06-21 14:17:23,350 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.256e+02 2.955e+02 3.350e+02 4.139e+02 7.597e+02, threshold=6.701e+02, percent-clipped=2.0 2023-06-21 14:17:37,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=984318.0, ans=0.2 2023-06-21 14:17:48,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=984318.0, ans=0.125 2023-06-21 14:18:01,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=984378.0, ans=0.0 2023-06-21 14:18:09,014 INFO [train.py:996] (2/4) Epoch 6, batch 11600, loss[loss=0.2557, simple_loss=0.3473, pruned_loss=0.08211, over 21343.00 frames. ], tot_loss[loss=0.2602, simple_loss=0.3399, pruned_loss=0.09023, over 4271786.49 frames. ], batch size: 131, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:18:31,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=984498.0, ans=0.1 2023-06-21 14:19:20,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=984618.0, ans=0.2 2023-06-21 14:19:25,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=984618.0, ans=0.1 2023-06-21 14:19:45,204 INFO [train.py:996] (2/4) Epoch 6, batch 11650, loss[loss=0.2645, simple_loss=0.3431, pruned_loss=0.0929, over 21448.00 frames. ], tot_loss[loss=0.2637, simple_loss=0.346, pruned_loss=0.09068, over 4270908.34 frames. ], batch size: 211, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:20:29,595 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.200e+02 3.000e+02 3.616e+02 4.303e+02 7.688e+02, threshold=7.232e+02, percent-clipped=3.0 2023-06-21 14:20:35,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=984858.0, ans=0.2 2023-06-21 14:20:44,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=984918.0, ans=0.0 2023-06-21 14:20:53,136 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.89 vs. limit=12.0 2023-06-21 14:21:21,188 INFO [train.py:996] (2/4) Epoch 6, batch 11700, loss[loss=0.2522, simple_loss=0.3057, pruned_loss=0.0994, over 21881.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3372, pruned_loss=0.09064, over 4271322.84 frames. ], batch size: 373, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:22:17,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=985158.0, ans=0.125 2023-06-21 14:22:56,809 INFO [train.py:996] (2/4) Epoch 6, batch 11750, loss[loss=0.2894, simple_loss=0.3403, pruned_loss=0.1192, over 21367.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3282, pruned_loss=0.08933, over 4257301.91 frames. ], batch size: 471, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:22:58,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=985338.0, ans=0.0 2023-06-21 14:23:11,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=985338.0, ans=0.0 2023-06-21 14:23:19,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=985398.0, ans=0.125 2023-06-21 14:23:57,413 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.131e+02 2.950e+02 3.559e+02 4.361e+02 6.685e+02, threshold=7.118e+02, percent-clipped=0.0 2023-06-21 14:24:29,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=985578.0, ans=0.0 2023-06-21 14:24:33,578 INFO [train.py:996] (2/4) Epoch 6, batch 11800, loss[loss=0.2516, simple_loss=0.3276, pruned_loss=0.08776, over 21504.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3286, pruned_loss=0.09088, over 4258798.94 frames. ], batch size: 131, lr: 5.15e-03, grad_scale: 16.0 2023-06-21 14:24:34,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=985638.0, ans=0.125 2023-06-21 14:24:37,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=985638.0, ans=0.125 2023-06-21 14:25:42,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=985818.0, ans=0.0 2023-06-21 14:25:42,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=985818.0, ans=0.125 2023-06-21 14:25:51,592 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 14:26:00,284 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-06-21 14:26:14,995 INFO [train.py:996] (2/4) Epoch 6, batch 11850, loss[loss=0.2334, simple_loss=0.3275, pruned_loss=0.06971, over 21818.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3308, pruned_loss=0.09033, over 4259325.47 frames. ], batch size: 371, lr: 5.15e-03, grad_scale: 16.0 2023-06-21 14:27:10,025 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.053e+02 2.768e+02 3.132e+02 3.956e+02 6.532e+02, threshold=6.263e+02, percent-clipped=0.0 2023-06-21 14:27:25,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=986118.0, ans=0.2 2023-06-21 14:27:38,471 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-21 14:27:39,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=986178.0, ans=0.0 2023-06-21 14:27:50,726 INFO [train.py:996] (2/4) Epoch 6, batch 11900, loss[loss=0.2548, simple_loss=0.3533, pruned_loss=0.07811, over 19726.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3324, pruned_loss=0.08796, over 4258106.51 frames. ], batch size: 702, lr: 5.15e-03, grad_scale: 16.0 2023-06-21 14:28:10,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=986298.0, ans=0.125 2023-06-21 14:28:27,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=986298.0, ans=0.125 2023-06-21 14:28:38,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=986358.0, ans=0.125 2023-06-21 14:28:53,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=986418.0, ans=0.125 2023-06-21 14:29:01,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=986418.0, ans=0.125 2023-06-21 14:29:01,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=986418.0, ans=0.2 2023-06-21 14:29:21,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=986478.0, ans=0.0 2023-06-21 14:29:27,038 INFO [train.py:996] (2/4) Epoch 6, batch 11950, loss[loss=0.246, simple_loss=0.3442, pruned_loss=0.07391, over 21668.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.33, pruned_loss=0.08411, over 4262468.76 frames. ], batch size: 247, lr: 5.15e-03, grad_scale: 16.0 2023-06-21 14:30:14,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=986658.0, ans=0.0 2023-06-21 14:30:23,374 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.092e+02 2.615e+02 3.254e+02 4.193e+02 8.163e+02, threshold=6.508e+02, percent-clipped=5.0 2023-06-21 14:31:03,748 INFO [train.py:996] (2/4) Epoch 6, batch 12000, loss[loss=0.2136, simple_loss=0.2784, pruned_loss=0.07442, over 21570.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3254, pruned_loss=0.08289, over 4255638.53 frames. ], batch size: 263, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:31:03,748 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 14:31:23,361 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2642, simple_loss=0.3586, pruned_loss=0.08492, over 1796401.00 frames. 2023-06-21 14:31:23,361 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-21 14:31:31,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=986838.0, ans=0.125 2023-06-21 14:31:36,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=986838.0, ans=0.2 2023-06-21 14:32:15,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=986958.0, ans=0.2 2023-06-21 14:32:40,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=987078.0, ans=0.125 2023-06-21 14:33:01,542 INFO [train.py:996] (2/4) Epoch 6, batch 12050, loss[loss=0.2541, simple_loss=0.3126, pruned_loss=0.09777, over 21646.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3222, pruned_loss=0.08538, over 4251723.74 frames. ], batch size: 195, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:33:53,157 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.343e+02 3.010e+02 3.446e+02 4.017e+02 8.146e+02, threshold=6.892e+02, percent-clipped=4.0 2023-06-21 14:34:43,743 INFO [train.py:996] (2/4) Epoch 6, batch 12100, loss[loss=0.2967, simple_loss=0.3725, pruned_loss=0.1104, over 21369.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3262, pruned_loss=0.08995, over 4258523.11 frames. ], batch size: 131, lr: 5.15e-03, grad_scale: 32.0 2023-06-21 14:34:57,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=987438.0, ans=0.125 2023-06-21 14:34:58,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=987438.0, ans=0.1 2023-06-21 14:35:06,603 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 14:36:18,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=987678.0, ans=0.2 2023-06-21 14:36:21,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=987678.0, ans=0.125 2023-06-21 14:36:24,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=987678.0, ans=0.2 2023-06-21 14:36:27,296 INFO [train.py:996] (2/4) Epoch 6, batch 12150, loss[loss=0.2524, simple_loss=0.3415, pruned_loss=0.08164, over 21852.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.329, pruned_loss=0.0887, over 4254861.51 frames. ], batch size: 316, lr: 5.15e-03, grad_scale: 16.0 2023-06-21 14:36:34,463 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.14 vs. limit=15.0 2023-06-21 14:36:46,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=987798.0, ans=0.125 2023-06-21 14:36:47,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=987798.0, ans=0.125 2023-06-21 14:37:19,844 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.338e+02 3.020e+02 3.614e+02 4.015e+02 8.551e+02, threshold=7.228e+02, percent-clipped=5.0 2023-06-21 14:37:38,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=987918.0, ans=0.09899494936611666 2023-06-21 14:38:01,608 INFO [train.py:996] (2/4) Epoch 6, batch 12200, loss[loss=0.2883, simple_loss=0.3267, pruned_loss=0.125, over 21353.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3252, pruned_loss=0.08858, over 4260300.67 frames. ], batch size: 508, lr: 5.15e-03, grad_scale: 16.0 2023-06-21 14:38:06,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=988038.0, ans=0.125 2023-06-21 14:38:42,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=988158.0, ans=0.95 2023-06-21 14:39:09,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=988218.0, ans=0.1 2023-06-21 14:39:36,220 INFO [train.py:996] (2/4) Epoch 6, batch 12250, loss[loss=0.1665, simple_loss=0.2384, pruned_loss=0.04726, over 21757.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3171, pruned_loss=0.0856, over 4265622.86 frames. ], batch size: 112, lr: 5.14e-03, grad_scale: 16.0 2023-06-21 14:40:12,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=988458.0, ans=0.1 2023-06-21 14:40:22,486 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.523e+02 2.454e+02 2.966e+02 3.954e+02 7.953e+02, threshold=5.931e+02, percent-clipped=3.0 2023-06-21 14:40:24,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=988518.0, ans=0.125 2023-06-21 14:41:01,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=988578.0, ans=0.125 2023-06-21 14:41:04,569 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 14:41:10,041 INFO [train.py:996] (2/4) Epoch 6, batch 12300, loss[loss=0.1931, simple_loss=0.2632, pruned_loss=0.06148, over 21153.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3086, pruned_loss=0.07916, over 4266571.24 frames. ], batch size: 143, lr: 5.14e-03, grad_scale: 16.0 2023-06-21 14:41:49,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=988758.0, ans=0.025 2023-06-21 14:41:54,968 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-21 14:41:57,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=988758.0, ans=0.125 2023-06-21 14:42:05,823 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.52 vs. limit=22.5 2023-06-21 14:42:44,767 INFO [train.py:996] (2/4) Epoch 6, batch 12350, loss[loss=0.2761, simple_loss=0.354, pruned_loss=0.09911, over 21720.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3155, pruned_loss=0.08091, over 4275489.01 frames. ], batch size: 389, lr: 5.14e-03, grad_scale: 16.0 2023-06-21 14:42:58,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=988998.0, ans=0.0 2023-06-21 14:43:36,232 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.794e+02 2.741e+02 3.281e+02 4.325e+02 6.278e+02, threshold=6.562e+02, percent-clipped=1.0 2023-06-21 14:44:04,133 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.68 vs. limit=22.5 2023-06-21 14:44:06,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=989178.0, ans=0.2 2023-06-21 14:44:11,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=989178.0, ans=0.0 2023-06-21 14:44:11,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=989178.0, ans=0.5 2023-06-21 14:44:18,043 INFO [train.py:996] (2/4) Epoch 6, batch 12400, loss[loss=0.2222, simple_loss=0.3014, pruned_loss=0.07148, over 21832.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3179, pruned_loss=0.08468, over 4276374.97 frames. ], batch size: 298, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:45:15,373 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=15.0 2023-06-21 14:45:30,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=989418.0, ans=0.0 2023-06-21 14:45:52,840 INFO [train.py:996] (2/4) Epoch 6, batch 12450, loss[loss=0.3039, simple_loss=0.3668, pruned_loss=0.1204, over 21605.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3211, pruned_loss=0.08825, over 4280735.64 frames. ], batch size: 389, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:45:53,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=989538.0, ans=0.0 2023-06-21 14:45:59,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=989538.0, ans=0.0 2023-06-21 14:46:49,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=989658.0, ans=0.1 2023-06-21 14:46:55,549 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.846e+02 3.218e+02 3.959e+02 6.466e+02, threshold=6.436e+02, percent-clipped=0.0 2023-06-21 14:47:16,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=989778.0, ans=0.125 2023-06-21 14:47:19,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=989778.0, ans=0.125 2023-06-21 14:47:35,193 INFO [train.py:996] (2/4) Epoch 6, batch 12500, loss[loss=0.2735, simple_loss=0.371, pruned_loss=0.08794, over 21643.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3333, pruned_loss=0.09256, over 4279321.59 frames. ], batch size: 263, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:49:09,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=990078.0, ans=0.125 2023-06-21 14:49:11,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=990078.0, ans=0.05 2023-06-21 14:49:19,717 INFO [train.py:996] (2/4) Epoch 6, batch 12550, loss[loss=0.1857, simple_loss=0.2261, pruned_loss=0.07269, over 19991.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3356, pruned_loss=0.09382, over 4276479.30 frames. ], batch size: 703, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:49:23,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=990138.0, ans=0.125 2023-06-21 14:49:23,955 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.09 vs. limit=22.5 2023-06-21 14:49:49,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=990198.0, ans=0.1 2023-06-21 14:49:57,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=990198.0, ans=0.05 2023-06-21 14:50:00,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=990258.0, ans=0.125 2023-06-21 14:50:12,189 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.424e+02 2.946e+02 3.555e+02 3.995e+02 6.725e+02, threshold=7.110e+02, percent-clipped=1.0 2023-06-21 14:50:55,496 INFO [train.py:996] (2/4) Epoch 6, batch 12600, loss[loss=0.2263, simple_loss=0.3082, pruned_loss=0.0722, over 21621.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.334, pruned_loss=0.09076, over 4274486.93 frames. ], batch size: 230, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:51:10,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=990438.0, ans=0.0 2023-06-21 14:51:11,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=990438.0, ans=0.07 2023-06-21 14:51:20,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=990498.0, ans=0.1 2023-06-21 14:51:22,534 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.04 vs. limit=5.0 2023-06-21 14:51:28,430 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.08 vs. limit=15.0 2023-06-21 14:52:02,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=990618.0, ans=0.0 2023-06-21 14:52:25,059 INFO [train.py:996] (2/4) Epoch 6, batch 12650, loss[loss=0.266, simple_loss=0.3243, pruned_loss=0.1039, over 21883.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3249, pruned_loss=0.08631, over 4275531.37 frames. ], batch size: 316, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:52:54,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=990798.0, ans=10.0 2023-06-21 14:53:09,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=990858.0, ans=0.125 2023-06-21 14:53:16,407 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.864e+02 2.529e+02 3.007e+02 3.447e+02 6.549e+02, threshold=6.013e+02, percent-clipped=0.0 2023-06-21 14:53:20,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=990918.0, ans=0.0 2023-06-21 14:53:53,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=990978.0, ans=0.125 2023-06-21 14:54:11,998 INFO [train.py:996] (2/4) Epoch 6, batch 12700, loss[loss=0.2518, simple_loss=0.3118, pruned_loss=0.09589, over 21438.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3257, pruned_loss=0.08918, over 4278279.60 frames. ], batch size: 211, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:54:40,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=991098.0, ans=0.125 2023-06-21 14:54:40,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=991098.0, ans=0.07 2023-06-21 14:54:59,753 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.73 vs. limit=15.0 2023-06-21 14:55:31,992 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.79 vs. limit=15.0 2023-06-21 14:55:48,670 INFO [train.py:996] (2/4) Epoch 6, batch 12750, loss[loss=0.2801, simple_loss=0.3511, pruned_loss=0.1045, over 20075.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3275, pruned_loss=0.08994, over 4274532.68 frames. ], batch size: 702, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:56:13,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=991398.0, ans=0.125 2023-06-21 14:56:25,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=991458.0, ans=0.1 2023-06-21 14:56:33,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=991458.0, ans=0.2 2023-06-21 14:56:36,157 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.913e+02 3.338e+02 4.032e+02 7.736e+02, threshold=6.676e+02, percent-clipped=3.0 2023-06-21 14:57:17,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=991578.0, ans=0.125 2023-06-21 14:57:20,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=991578.0, ans=0.125 2023-06-21 14:57:24,144 INFO [train.py:996] (2/4) Epoch 6, batch 12800, loss[loss=0.2952, simple_loss=0.3467, pruned_loss=0.1219, over 21618.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3258, pruned_loss=0.0902, over 4276612.25 frames. ], batch size: 508, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:57:24,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=991638.0, ans=0.125 2023-06-21 14:58:03,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=991758.0, ans=0.0 2023-06-21 14:58:08,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=991758.0, ans=0.2 2023-06-21 14:58:17,134 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=22.5 2023-06-21 14:58:59,824 INFO [train.py:996] (2/4) Epoch 6, batch 12850, loss[loss=0.2587, simple_loss=0.3291, pruned_loss=0.09412, over 20674.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3283, pruned_loss=0.09123, over 4278044.41 frames. ], batch size: 607, lr: 5.14e-03, grad_scale: 32.0 2023-06-21 14:59:40,077 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.20 vs. limit=10.0 2023-06-21 14:59:45,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=992058.0, ans=22.5 2023-06-21 14:59:53,149 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.045e+02 2.801e+02 3.143e+02 3.622e+02 6.427e+02, threshold=6.286e+02, percent-clipped=0.0 2023-06-21 15:00:26,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=992178.0, ans=0.1 2023-06-21 15:00:36,429 INFO [train.py:996] (2/4) Epoch 6, batch 12900, loss[loss=0.2254, simple_loss=0.2937, pruned_loss=0.07851, over 21184.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3266, pruned_loss=0.08753, over 4278754.87 frames. ], batch size: 159, lr: 5.13e-03, grad_scale: 32.0 2023-06-21 15:01:17,509 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.25 vs. limit=15.0 2023-06-21 15:02:08,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=992478.0, ans=0.1 2023-06-21 15:02:12,323 INFO [train.py:996] (2/4) Epoch 6, batch 12950, loss[loss=0.2107, simple_loss=0.2926, pruned_loss=0.0644, over 21696.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3246, pruned_loss=0.086, over 4278331.54 frames. ], batch size: 332, lr: 5.13e-03, grad_scale: 32.0 2023-06-21 15:02:34,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=992538.0, ans=0.1 2023-06-21 15:03:05,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=992658.0, ans=0.1 2023-06-21 15:03:08,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=992658.0, ans=0.0 2023-06-21 15:03:11,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=992658.0, ans=0.0 2023-06-21 15:03:15,920 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.760e+02 2.940e+02 3.602e+02 4.409e+02 7.106e+02, threshold=7.204e+02, percent-clipped=2.0 2023-06-21 15:03:46,795 INFO [train.py:996] (2/4) Epoch 6, batch 13000, loss[loss=0.2237, simple_loss=0.3033, pruned_loss=0.072, over 21828.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3243, pruned_loss=0.08613, over 4282793.46 frames. ], batch size: 372, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:03:48,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=992838.0, ans=0.2 2023-06-21 15:03:51,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=992838.0, ans=0.0 2023-06-21 15:03:53,893 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.31 vs. limit=22.5 2023-06-21 15:05:08,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=993078.0, ans=0.95 2023-06-21 15:05:20,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=993138.0, ans=0.125 2023-06-21 15:05:21,367 INFO [train.py:996] (2/4) Epoch 6, batch 13050, loss[loss=0.2796, simple_loss=0.3426, pruned_loss=0.1083, over 21932.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3217, pruned_loss=0.08477, over 4287774.86 frames. ], batch size: 415, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:05:23,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=993138.0, ans=0.1 2023-06-21 15:05:39,442 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.85 vs. limit=15.0 2023-06-21 15:05:47,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=993198.0, ans=0.0 2023-06-21 15:06:07,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=993258.0, ans=0.0 2023-06-21 15:06:21,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=993258.0, ans=0.125 2023-06-21 15:06:23,889 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 2.762e+02 3.169e+02 4.003e+02 6.766e+02, threshold=6.339e+02, percent-clipped=0.0 2023-06-21 15:06:33,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=993318.0, ans=0.2 2023-06-21 15:06:45,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=993378.0, ans=0.0 2023-06-21 15:07:00,668 INFO [train.py:996] (2/4) Epoch 6, batch 13100, loss[loss=0.2902, simple_loss=0.4368, pruned_loss=0.07179, over 19634.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3229, pruned_loss=0.08478, over 4288348.82 frames. ], batch size: 702, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:07:43,964 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.69 vs. limit=15.0 2023-06-21 15:08:07,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=993618.0, ans=0.07 2023-06-21 15:08:43,000 INFO [train.py:996] (2/4) Epoch 6, batch 13150, loss[loss=0.2071, simple_loss=0.2658, pruned_loss=0.07421, over 21188.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3277, pruned_loss=0.08851, over 4281809.53 frames. ], batch size: 143, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:09:03,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=993738.0, ans=0.1 2023-06-21 15:09:10,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=993798.0, ans=0.2 2023-06-21 15:09:14,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=993798.0, ans=0.0 2023-06-21 15:09:37,804 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.083e+02 3.111e+02 3.957e+02 5.293e+02 1.278e+03, threshold=7.913e+02, percent-clipped=9.0 2023-06-21 15:09:53,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=993918.0, ans=0.1 2023-06-21 15:10:13,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=993978.0, ans=0.0 2023-06-21 15:10:16,075 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.94 vs. limit=15.0 2023-06-21 15:10:27,533 INFO [train.py:996] (2/4) Epoch 6, batch 13200, loss[loss=0.2724, simple_loss=0.3361, pruned_loss=0.1044, over 21400.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3264, pruned_loss=0.08816, over 4280037.36 frames. ], batch size: 549, lr: 5.13e-03, grad_scale: 32.0 2023-06-21 15:10:59,309 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.66 vs. limit=10.0 2023-06-21 15:11:12,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=994158.0, ans=0.1 2023-06-21 15:11:27,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=994218.0, ans=0.125 2023-06-21 15:11:27,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=994218.0, ans=0.125 2023-06-21 15:11:28,406 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2023-06-21 15:12:03,196 INFO [train.py:996] (2/4) Epoch 6, batch 13250, loss[loss=0.2484, simple_loss=0.3397, pruned_loss=0.07853, over 21800.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3251, pruned_loss=0.0892, over 4282072.68 frames. ], batch size: 332, lr: 5.13e-03, grad_scale: 32.0 2023-06-21 15:12:08,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=994338.0, ans=0.0 2023-06-21 15:12:10,101 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-21 15:12:20,380 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=15.0 2023-06-21 15:12:51,215 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.028e+02 2.898e+02 3.245e+02 3.819e+02 5.517e+02, threshold=6.489e+02, percent-clipped=0.0 2023-06-21 15:13:33,153 INFO [train.py:996] (2/4) Epoch 6, batch 13300, loss[loss=0.2781, simple_loss=0.3464, pruned_loss=0.1049, over 21790.00 frames. ], tot_loss[loss=0.253, simple_loss=0.328, pruned_loss=0.08895, over 4281908.87 frames. ], batch size: 124, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:13:36,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=994638.0, ans=0.125 2023-06-21 15:13:44,665 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:13:46,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=994638.0, ans=0.2 2023-06-21 15:13:52,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=994698.0, ans=10.0 2023-06-21 15:14:08,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=994758.0, ans=0.125 2023-06-21 15:14:27,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=994758.0, ans=0.0 2023-06-21 15:14:59,144 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.25 vs. limit=15.0 2023-06-21 15:15:05,598 INFO [train.py:996] (2/4) Epoch 6, batch 13350, loss[loss=0.2626, simple_loss=0.3418, pruned_loss=0.09169, over 21724.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3311, pruned_loss=0.09092, over 4277081.01 frames. ], batch size: 351, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:15:06,044 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:15:15,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=994938.0, ans=0.125 2023-06-21 15:15:49,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=995058.0, ans=0.125 2023-06-21 15:16:01,504 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.076e+02 3.846e+02 4.574e+02 8.350e+02, threshold=7.691e+02, percent-clipped=3.0 2023-06-21 15:16:02,589 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=22.5 2023-06-21 15:16:09,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=995118.0, ans=0.0 2023-06-21 15:16:31,711 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-21 15:16:34,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=995178.0, ans=0.0 2023-06-21 15:16:44,153 INFO [train.py:996] (2/4) Epoch 6, batch 13400, loss[loss=0.2662, simple_loss=0.3306, pruned_loss=0.1009, over 21428.00 frames. ], tot_loss[loss=0.2589, simple_loss=0.3323, pruned_loss=0.09275, over 4277163.62 frames. ], batch size: 548, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:17:22,079 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=22.5 2023-06-21 15:17:25,383 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-21 15:17:26,756 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.92 vs. limit=15.0 2023-06-21 15:17:33,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=995358.0, ans=0.0 2023-06-21 15:17:49,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=995358.0, ans=0.0 2023-06-21 15:17:49,799 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-06-21 15:17:59,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=995418.0, ans=0.2 2023-06-21 15:18:25,030 INFO [train.py:996] (2/4) Epoch 6, batch 13450, loss[loss=0.2718, simple_loss=0.3372, pruned_loss=0.1032, over 21739.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3345, pruned_loss=0.09486, over 4272745.01 frames. ], batch size: 441, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:18:54,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=995598.0, ans=0.0 2023-06-21 15:18:56,527 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-21 15:19:12,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=995658.0, ans=0.125 2023-06-21 15:19:19,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=995658.0, ans=0.125 2023-06-21 15:19:24,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=995658.0, ans=0.0 2023-06-21 15:19:29,865 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 3.161e+02 3.421e+02 3.980e+02 7.603e+02, threshold=6.841e+02, percent-clipped=0.0 2023-06-21 15:19:44,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=995778.0, ans=0.125 2023-06-21 15:19:47,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=995778.0, ans=0.0 2023-06-21 15:19:55,976 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=12.0 2023-06-21 15:20:06,006 INFO [train.py:996] (2/4) Epoch 6, batch 13500, loss[loss=0.2245, simple_loss=0.2957, pruned_loss=0.07671, over 21699.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3247, pruned_loss=0.09197, over 4261586.89 frames. ], batch size: 298, lr: 5.13e-03, grad_scale: 16.0 2023-06-21 15:20:06,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=995838.0, ans=0.0 2023-06-21 15:20:16,384 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.83 vs. limit=6.0 2023-06-21 15:20:16,506 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=15.0 2023-06-21 15:20:52,504 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-06-21 15:20:55,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=995958.0, ans=0.015 2023-06-21 15:21:00,004 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:21:20,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=996018.0, ans=15.0 2023-06-21 15:21:43,568 INFO [train.py:996] (2/4) Epoch 6, batch 13550, loss[loss=0.2567, simple_loss=0.3605, pruned_loss=0.07644, over 21799.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3292, pruned_loss=0.09091, over 4261022.07 frames. ], batch size: 282, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:21:47,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=996138.0, ans=0.0 2023-06-21 15:22:24,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=996198.0, ans=0.125 2023-06-21 15:22:29,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=996258.0, ans=0.2 2023-06-21 15:22:29,828 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2023-06-21 15:22:36,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=996258.0, ans=0.125 2023-06-21 15:22:44,913 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.395e+02 3.030e+02 3.606e+02 4.387e+02 7.560e+02, threshold=7.212e+02, percent-clipped=4.0 2023-06-21 15:22:59,670 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.93 vs. limit=15.0 2023-06-21 15:23:14,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=996378.0, ans=0.0 2023-06-21 15:23:18,624 INFO [train.py:996] (2/4) Epoch 6, batch 13600, loss[loss=0.2166, simple_loss=0.3017, pruned_loss=0.06569, over 21527.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3309, pruned_loss=0.0922, over 4259551.41 frames. ], batch size: 131, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:23:50,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=996498.0, ans=0.1 2023-06-21 15:23:53,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=996498.0, ans=0.0 2023-06-21 15:24:13,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=996558.0, ans=0.125 2023-06-21 15:24:21,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=996618.0, ans=0.125 2023-06-21 15:24:27,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=996618.0, ans=0.2 2023-06-21 15:24:35,845 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.82 vs. limit=6.0 2023-06-21 15:24:38,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=996678.0, ans=0.1 2023-06-21 15:24:47,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=996678.0, ans=0.0 2023-06-21 15:24:58,364 INFO [train.py:996] (2/4) Epoch 6, batch 13650, loss[loss=0.2494, simple_loss=0.3027, pruned_loss=0.09803, over 20067.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3257, pruned_loss=0.08864, over 4256451.54 frames. ], batch size: 703, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:25:36,102 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:25:51,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=996918.0, ans=0.125 2023-06-21 15:25:54,149 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.919e+02 3.475e+02 4.506e+02 7.169e+02, threshold=6.950e+02, percent-clipped=0.0 2023-06-21 15:26:32,648 INFO [train.py:996] (2/4) Epoch 6, batch 13700, loss[loss=0.2185, simple_loss=0.2725, pruned_loss=0.08225, over 21255.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3193, pruned_loss=0.08827, over 4259740.12 frames. ], batch size: 176, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:27:00,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=997098.0, ans=0.125 2023-06-21 15:27:20,944 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.94 vs. limit=22.5 2023-06-21 15:27:22,554 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=22.5 2023-06-21 15:27:45,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=997218.0, ans=0.2 2023-06-21 15:27:45,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=997218.0, ans=0.125 2023-06-21 15:28:15,475 INFO [train.py:996] (2/4) Epoch 6, batch 13750, loss[loss=0.2638, simple_loss=0.3405, pruned_loss=0.09356, over 21628.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3167, pruned_loss=0.08688, over 4254578.65 frames. ], batch size: 414, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:28:28,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=997338.0, ans=0.125 2023-06-21 15:28:41,041 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.39 vs. limit=10.0 2023-06-21 15:29:15,007 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.229e+02 4.012e+02 5.672e+02 9.491e+02, threshold=8.024e+02, percent-clipped=9.0 2023-06-21 15:29:40,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=997578.0, ans=0.0 2023-06-21 15:29:51,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=997578.0, ans=0.0 2023-06-21 15:29:58,585 INFO [train.py:996] (2/4) Epoch 6, batch 13800, loss[loss=0.3299, simple_loss=0.4201, pruned_loss=0.1199, over 21670.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3222, pruned_loss=0.086, over 4261860.24 frames. ], batch size: 414, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:30:08,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=997638.0, ans=0.125 2023-06-21 15:30:20,951 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=22.5 2023-06-21 15:30:43,111 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-06-21 15:30:51,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=997758.0, ans=0.125 2023-06-21 15:31:16,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=997818.0, ans=0.125 2023-06-21 15:31:35,476 INFO [train.py:996] (2/4) Epoch 6, batch 13850, loss[loss=0.3102, simple_loss=0.3959, pruned_loss=0.1123, over 21706.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3282, pruned_loss=0.0871, over 4262768.52 frames. ], batch size: 441, lr: 5.12e-03, grad_scale: 8.0 2023-06-21 15:32:07,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=997998.0, ans=0.2 2023-06-21 15:32:18,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=998058.0, ans=0.2 2023-06-21 15:32:24,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=998058.0, ans=0.125 2023-06-21 15:32:25,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=998058.0, ans=0.125 2023-06-21 15:32:36,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=998118.0, ans=0.125 2023-06-21 15:32:43,628 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.082e+02 2.953e+02 3.447e+02 4.211e+02 7.666e+02, threshold=6.893e+02, percent-clipped=0.0 2023-06-21 15:32:45,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=998118.0, ans=0.125 2023-06-21 15:32:51,543 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 15:33:10,827 INFO [train.py:996] (2/4) Epoch 6, batch 13900, loss[loss=0.2918, simple_loss=0.3883, pruned_loss=0.09767, over 20733.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.331, pruned_loss=0.09026, over 4269247.51 frames. ], batch size: 608, lr: 5.12e-03, grad_scale: 8.0 2023-06-21 15:33:31,895 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.71 vs. limit=5.0 2023-06-21 15:33:41,216 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.37 vs. limit=12.0 2023-06-21 15:33:43,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=998298.0, ans=0.125 2023-06-21 15:34:33,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=998478.0, ans=0.125 2023-06-21 15:34:41,888 INFO [train.py:996] (2/4) Epoch 6, batch 13950, loss[loss=0.3109, simple_loss=0.3672, pruned_loss=0.1274, over 21619.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3328, pruned_loss=0.09323, over 4276531.21 frames. ], batch size: 471, lr: 5.12e-03, grad_scale: 8.0 2023-06-21 15:34:50,454 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=22.5 2023-06-21 15:35:43,663 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.434e+02 3.088e+02 3.493e+02 4.359e+02 6.535e+02, threshold=6.987e+02, percent-clipped=0.0 2023-06-21 15:36:09,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=998838.0, ans=0.125 2023-06-21 15:36:10,581 INFO [train.py:996] (2/4) Epoch 6, batch 14000, loss[loss=0.1995, simple_loss=0.2771, pruned_loss=0.06092, over 21748.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3299, pruned_loss=0.09123, over 4278321.68 frames. ], batch size: 248, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:36:27,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=998838.0, ans=0.125 2023-06-21 15:37:12,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=999018.0, ans=0.0 2023-06-21 15:37:41,071 INFO [train.py:996] (2/4) Epoch 6, batch 14050, loss[loss=0.1954, simple_loss=0.2666, pruned_loss=0.06207, over 21426.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3257, pruned_loss=0.08735, over 4279045.29 frames. ], batch size: 211, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:37:46,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=999138.0, ans=6.0 2023-06-21 15:37:46,696 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-21 15:38:07,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=999198.0, ans=0.125 2023-06-21 15:38:27,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=999258.0, ans=0.125 2023-06-21 15:38:36,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=999258.0, ans=0.125 2023-06-21 15:38:48,440 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.785e+02 3.183e+02 4.255e+02 6.746e+02, threshold=6.366e+02, percent-clipped=0.0 2023-06-21 15:39:16,533 INFO [train.py:996] (2/4) Epoch 6, batch 14100, loss[loss=0.2494, simple_loss=0.3226, pruned_loss=0.08808, over 21922.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3195, pruned_loss=0.08747, over 4272386.89 frames. ], batch size: 317, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:39:56,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=999558.0, ans=0.125 2023-06-21 15:40:09,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=999558.0, ans=0.2 2023-06-21 15:40:28,784 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=15.0 2023-06-21 15:40:40,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=999678.0, ans=0.0 2023-06-21 15:40:49,623 INFO [train.py:996] (2/4) Epoch 6, batch 14150, loss[loss=0.2742, simple_loss=0.3569, pruned_loss=0.09572, over 21648.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.322, pruned_loss=0.08807, over 4262594.36 frames. ], batch size: 414, lr: 5.12e-03, grad_scale: 16.0 2023-06-21 15:40:52,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=999738.0, ans=0.0 2023-06-21 15:41:47,170 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 2.804e+02 3.332e+02 4.334e+02 8.014e+02, threshold=6.664e+02, percent-clipped=2.0 2023-06-21 15:41:53,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=999918.0, ans=0.1 2023-06-21 15:42:16,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=999978.0, ans=0.07 2023-06-21 15:42:23,660 INFO [train.py:996] (2/4) Epoch 6, batch 14200, loss[loss=0.279, simple_loss=0.3428, pruned_loss=0.1076, over 21113.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3191, pruned_loss=0.08592, over 4255071.19 frames. ], batch size: 608, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:42:36,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1000038.0, ans=0.1 2023-06-21 15:43:26,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1000218.0, ans=0.0 2023-06-21 15:43:56,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1000278.0, ans=0.1 2023-06-21 15:43:58,958 INFO [train.py:996] (2/4) Epoch 6, batch 14250, loss[loss=0.2083, simple_loss=0.2735, pruned_loss=0.0716, over 21798.00 frames. ], tot_loss[loss=0.242, simple_loss=0.314, pruned_loss=0.08501, over 4254954.85 frames. ], batch size: 124, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:44:07,594 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-06-21 15:44:22,865 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.77 vs. limit=6.0 2023-06-21 15:44:43,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1000458.0, ans=0.0 2023-06-21 15:44:53,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1000458.0, ans=0.2 2023-06-21 15:44:59,213 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.913e+02 2.649e+02 3.041e+02 3.616e+02 7.648e+02, threshold=6.082e+02, percent-clipped=1.0 2023-06-21 15:44:59,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1000518.0, ans=0.0 2023-06-21 15:45:01,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1000518.0, ans=0.125 2023-06-21 15:45:21,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1000578.0, ans=0.125 2023-06-21 15:45:35,967 INFO [train.py:996] (2/4) Epoch 6, batch 14300, loss[loss=0.2628, simple_loss=0.3558, pruned_loss=0.08488, over 21810.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3136, pruned_loss=0.08333, over 4245611.67 frames. ], batch size: 282, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:45:39,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1000638.0, ans=0.0 2023-06-21 15:46:22,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1000758.0, ans=0.1 2023-06-21 15:46:30,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1000818.0, ans=0.125 2023-06-21 15:47:11,421 INFO [train.py:996] (2/4) Epoch 6, batch 14350, loss[loss=0.2394, simple_loss=0.3101, pruned_loss=0.08429, over 21540.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3185, pruned_loss=0.08408, over 4251251.16 frames. ], batch size: 195, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:48:18,790 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.340e+02 3.231e+02 3.841e+02 4.769e+02 8.361e+02, threshold=7.683e+02, percent-clipped=10.0 2023-06-21 15:48:22,830 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.09 vs. limit=15.0 2023-06-21 15:48:45,996 INFO [train.py:996] (2/4) Epoch 6, batch 14400, loss[loss=0.2409, simple_loss=0.2937, pruned_loss=0.09402, over 21204.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3178, pruned_loss=0.08517, over 4260204.25 frames. ], batch size: 159, lr: 5.11e-03, grad_scale: 32.0 2023-06-21 15:48:47,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1001238.0, ans=0.1 2023-06-21 15:48:52,858 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.05 vs. limit=6.0 2023-06-21 15:49:18,871 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.16 vs. limit=12.0 2023-06-21 15:50:08,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1001478.0, ans=0.2 2023-06-21 15:50:20,542 INFO [train.py:996] (2/4) Epoch 6, batch 14450, loss[loss=0.2498, simple_loss=0.311, pruned_loss=0.09431, over 21784.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3115, pruned_loss=0.08563, over 4270921.54 frames. ], batch size: 112, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:50:52,766 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.10 vs. limit=15.0 2023-06-21 15:51:03,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1001658.0, ans=0.1 2023-06-21 15:51:29,726 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.239e+02 2.916e+02 3.251e+02 4.168e+02 6.765e+02, threshold=6.503e+02, percent-clipped=0.0 2023-06-21 15:51:30,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1001718.0, ans=0.125 2023-06-21 15:51:43,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1001778.0, ans=0.0 2023-06-21 15:51:55,519 INFO [train.py:996] (2/4) Epoch 6, batch 14500, loss[loss=0.2097, simple_loss=0.2837, pruned_loss=0.06783, over 21850.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3074, pruned_loss=0.08459, over 4273946.39 frames. ], batch size: 118, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:52:00,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1001838.0, ans=0.0 2023-06-21 15:52:19,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1001898.0, ans=0.2 2023-06-21 15:52:53,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1001958.0, ans=0.125 2023-06-21 15:53:10,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1002018.0, ans=0.1 2023-06-21 15:53:12,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1002018.0, ans=0.0 2023-06-21 15:53:15,837 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=22.5 2023-06-21 15:53:24,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1002078.0, ans=0.125 2023-06-21 15:53:31,826 INFO [train.py:996] (2/4) Epoch 6, batch 14550, loss[loss=0.2692, simple_loss=0.3406, pruned_loss=0.09888, over 21901.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3125, pruned_loss=0.08657, over 4276618.45 frames. ], batch size: 316, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:53:36,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1002138.0, ans=0.0 2023-06-21 15:54:26,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1002258.0, ans=0.125 2023-06-21 15:54:33,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1002258.0, ans=0.125 2023-06-21 15:54:41,349 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.124e+02 3.127e+02 3.885e+02 5.234e+02 1.064e+03, threshold=7.771e+02, percent-clipped=10.0 2023-06-21 15:54:56,346 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=12.0 2023-06-21 15:54:57,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1002378.0, ans=0.125 2023-06-21 15:55:07,180 INFO [train.py:996] (2/4) Epoch 6, batch 14600, loss[loss=0.2122, simple_loss=0.2641, pruned_loss=0.08014, over 20265.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3231, pruned_loss=0.09226, over 4278007.87 frames. ], batch size: 702, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:55:15,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1002438.0, ans=0.0 2023-06-21 15:55:25,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1002498.0, ans=0.125 2023-06-21 15:55:44,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1002498.0, ans=0.125 2023-06-21 15:55:45,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1002558.0, ans=0.125 2023-06-21 15:56:41,651 INFO [train.py:996] (2/4) Epoch 6, batch 14650, loss[loss=0.2991, simple_loss=0.3839, pruned_loss=0.1071, over 21511.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3255, pruned_loss=0.0914, over 4282315.28 frames. ], batch size: 471, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:56:48,923 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.14 vs. limit=22.5 2023-06-21 15:57:20,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1002858.0, ans=0.125 2023-06-21 15:57:50,600 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.004e+02 2.901e+02 3.727e+02 5.131e+02 9.036e+02, threshold=7.453e+02, percent-clipped=4.0 2023-06-21 15:57:57,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1002918.0, ans=0.125 2023-06-21 15:58:21,846 INFO [train.py:996] (2/4) Epoch 6, batch 14700, loss[loss=0.2126, simple_loss=0.303, pruned_loss=0.06111, over 21608.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3178, pruned_loss=0.08497, over 4282598.95 frames. ], batch size: 230, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 15:58:25,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1003038.0, ans=0.125 2023-06-21 15:58:39,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1003038.0, ans=0.1 2023-06-21 15:59:16,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1003158.0, ans=0.125 2023-06-21 15:59:18,902 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.89 vs. limit=10.0 2023-06-21 15:59:19,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1003158.0, ans=0.1 2023-06-21 15:59:59,096 INFO [train.py:996] (2/4) Epoch 6, batch 14750, loss[loss=0.4164, simple_loss=0.469, pruned_loss=0.1819, over 21615.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3246, pruned_loss=0.0883, over 4281685.80 frames. ], batch size: 414, lr: 5.11e-03, grad_scale: 16.0 2023-06-21 16:00:15,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1003338.0, ans=0.125 2023-06-21 16:00:41,077 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=15.0 2023-06-21 16:00:52,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1003458.0, ans=0.0 2023-06-21 16:01:04,193 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.567e+02 3.041e+02 3.594e+02 4.539e+02 8.460e+02, threshold=7.189e+02, percent-clipped=3.0 2023-06-21 16:01:15,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1003518.0, ans=0.04949747468305833 2023-06-21 16:01:22,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1003578.0, ans=0.2 2023-06-21 16:01:39,219 INFO [train.py:996] (2/4) Epoch 6, batch 14800, loss[loss=0.3155, simple_loss=0.3926, pruned_loss=0.1192, over 21565.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3356, pruned_loss=0.09298, over 4280540.79 frames. ], batch size: 389, lr: 5.11e-03, grad_scale: 32.0 2023-06-21 16:01:43,106 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=22.5 2023-06-21 16:02:07,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1003698.0, ans=0.125 2023-06-21 16:02:30,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1003758.0, ans=0.125 2023-06-21 16:02:51,528 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-06-21 16:03:20,684 INFO [train.py:996] (2/4) Epoch 6, batch 14850, loss[loss=0.2341, simple_loss=0.2883, pruned_loss=0.08992, over 21334.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3283, pruned_loss=0.09172, over 4281929.70 frames. ], batch size: 160, lr: 5.10e-03, grad_scale: 32.0 2023-06-21 16:03:58,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1004058.0, ans=0.0 2023-06-21 16:04:28,222 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.518e+02 3.136e+02 3.769e+02 4.672e+02 7.258e+02, threshold=7.538e+02, percent-clipped=1.0 2023-06-21 16:05:03,045 INFO [train.py:996] (2/4) Epoch 6, batch 14900, loss[loss=0.2777, simple_loss=0.3499, pruned_loss=0.1028, over 21831.00 frames. ], tot_loss[loss=0.2599, simple_loss=0.333, pruned_loss=0.09335, over 4280421.65 frames. ], batch size: 124, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:05:05,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1004238.0, ans=0.125 2023-06-21 16:05:21,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1004298.0, ans=15.0 2023-06-21 16:06:22,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1004418.0, ans=0.125 2023-06-21 16:06:33,655 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.74 vs. limit=15.0 2023-06-21 16:06:40,445 INFO [train.py:996] (2/4) Epoch 6, batch 14950, loss[loss=0.2089, simple_loss=0.2944, pruned_loss=0.06175, over 21701.00 frames. ], tot_loss[loss=0.2591, simple_loss=0.3329, pruned_loss=0.09271, over 4275560.78 frames. ], batch size: 247, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:06:51,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1004538.0, ans=0.1 2023-06-21 16:06:55,131 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.71 vs. limit=22.5 2023-06-21 16:07:41,313 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.433e+02 2.931e+02 3.491e+02 4.464e+02 7.538e+02, threshold=6.982e+02, percent-clipped=0.0 2023-06-21 16:07:42,335 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=12.0 2023-06-21 16:07:49,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1004718.0, ans=0.2 2023-06-21 16:08:03,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1004778.0, ans=0.125 2023-06-21 16:08:09,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1004838.0, ans=0.0 2023-06-21 16:08:10,578 INFO [train.py:996] (2/4) Epoch 6, batch 15000, loss[loss=0.3081, simple_loss=0.3687, pruned_loss=0.1238, over 21621.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.3351, pruned_loss=0.09455, over 4283640.27 frames. ], batch size: 389, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:08:10,579 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 16:08:19,593 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.5843, 4.2675, 4.1815, 2.6432], device='cuda:2') 2023-06-21 16:08:23,766 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.8822, 3.3274, 3.4188, 3.2260], device='cuda:2') 2023-06-21 16:08:27,123 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.26, simple_loss=0.3558, pruned_loss=0.08209, over 1796401.00 frames. 2023-06-21 16:08:27,124 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-21 16:09:52,107 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:10:04,148 INFO [train.py:996] (2/4) Epoch 6, batch 15050, loss[loss=0.2318, simple_loss=0.2968, pruned_loss=0.08339, over 21288.00 frames. ], tot_loss[loss=0.2644, simple_loss=0.3364, pruned_loss=0.09617, over 4283251.51 frames. ], batch size: 176, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:10:35,075 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-21 16:11:05,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1005258.0, ans=0.125 2023-06-21 16:11:16,859 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.430e+02 2.976e+02 3.378e+02 4.052e+02 9.524e+02, threshold=6.756e+02, percent-clipped=3.0 2023-06-21 16:11:44,042 INFO [train.py:996] (2/4) Epoch 6, batch 15100, loss[loss=0.2435, simple_loss=0.3169, pruned_loss=0.08501, over 20619.00 frames. ], tot_loss[loss=0.265, simple_loss=0.3385, pruned_loss=0.09572, over 4273394.41 frames. ], batch size: 607, lr: 5.10e-03, grad_scale: 8.0 2023-06-21 16:12:09,821 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-21 16:12:24,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1005498.0, ans=0.5 2023-06-21 16:13:22,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1005738.0, ans=0.2 2023-06-21 16:13:23,666 INFO [train.py:996] (2/4) Epoch 6, batch 15150, loss[loss=0.2015, simple_loss=0.2636, pruned_loss=0.06972, over 15445.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3331, pruned_loss=0.09472, over 4269766.70 frames. ], batch size: 60, lr: 5.10e-03, grad_scale: 8.0 2023-06-21 16:14:15,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1005858.0, ans=0.125 2023-06-21 16:14:25,211 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 2.946e+02 3.329e+02 3.848e+02 7.712e+02, threshold=6.658e+02, percent-clipped=2.0 2023-06-21 16:14:34,006 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.86 vs. limit=15.0 2023-06-21 16:14:40,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1005978.0, ans=0.1 2023-06-21 16:14:44,329 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-06-21 16:14:57,059 INFO [train.py:996] (2/4) Epoch 6, batch 15200, loss[loss=0.2407, simple_loss=0.3221, pruned_loss=0.07962, over 21616.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3247, pruned_loss=0.09053, over 4269821.09 frames. ], batch size: 414, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:15:43,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1006158.0, ans=0.125 2023-06-21 16:15:52,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1006218.0, ans=0.1 2023-06-21 16:15:57,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1006218.0, ans=0.2 2023-06-21 16:16:15,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1006278.0, ans=0.2 2023-06-21 16:16:30,314 INFO [train.py:996] (2/4) Epoch 6, batch 15250, loss[loss=0.2418, simple_loss=0.3007, pruned_loss=0.09147, over 21253.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3197, pruned_loss=0.08971, over 4264266.55 frames. ], batch size: 471, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:16:59,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1006398.0, ans=0.2 2023-06-21 16:17:14,326 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2023-06-21 16:17:15,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1006458.0, ans=0.125 2023-06-21 16:17:32,963 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.370e+02 3.045e+02 3.599e+02 4.449e+02 6.735e+02, threshold=7.197e+02, percent-clipped=2.0 2023-06-21 16:18:15,102 INFO [train.py:996] (2/4) Epoch 6, batch 15300, loss[loss=0.2062, simple_loss=0.2572, pruned_loss=0.07761, over 20765.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.32, pruned_loss=0.09165, over 4270345.99 frames. ], batch size: 609, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:18:56,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1006758.0, ans=0.125 2023-06-21 16:19:31,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1006878.0, ans=0.125 2023-06-21 16:19:37,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1006878.0, ans=0.125 2023-06-21 16:19:44,621 INFO [train.py:996] (2/4) Epoch 6, batch 15350, loss[loss=0.302, simple_loss=0.3625, pruned_loss=0.1208, over 21259.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3267, pruned_loss=0.09496, over 4270751.87 frames. ], batch size: 176, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:20:21,170 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.20 vs. limit=15.0 2023-06-21 16:20:40,633 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.234e+02 2.864e+02 3.312e+02 3.850e+02 5.534e+02, threshold=6.625e+02, percent-clipped=0.0 2023-06-21 16:21:12,486 INFO [train.py:996] (2/4) Epoch 6, batch 15400, loss[loss=0.2687, simple_loss=0.3336, pruned_loss=0.1019, over 21879.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3286, pruned_loss=0.09333, over 4255805.21 frames. ], batch size: 118, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:21:21,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1007238.0, ans=0.2 2023-06-21 16:21:26,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1007238.0, ans=0.1 2023-06-21 16:21:37,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1007298.0, ans=0.125 2023-06-21 16:21:48,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1007298.0, ans=0.0 2023-06-21 16:22:51,025 INFO [train.py:996] (2/4) Epoch 6, batch 15450, loss[loss=0.2528, simple_loss=0.3162, pruned_loss=0.09472, over 21436.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3251, pruned_loss=0.09191, over 4261692.61 frames. ], batch size: 131, lr: 5.10e-03, grad_scale: 16.0 2023-06-21 16:22:51,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1007538.0, ans=0.1 2023-06-21 16:23:09,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1007598.0, ans=0.0 2023-06-21 16:23:15,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1007598.0, ans=0.1 2023-06-21 16:23:33,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1007658.0, ans=0.125 2023-06-21 16:23:53,556 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.992e+02 2.820e+02 3.206e+02 3.882e+02 5.798e+02, threshold=6.411e+02, percent-clipped=0.0 2023-06-21 16:24:26,139 INFO [train.py:996] (2/4) Epoch 6, batch 15500, loss[loss=0.2578, simple_loss=0.3325, pruned_loss=0.09154, over 21465.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3275, pruned_loss=0.09125, over 4251148.97 frames. ], batch size: 131, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:24:40,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1007838.0, ans=0.0 2023-06-21 16:25:00,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1007958.0, ans=0.04949747468305833 2023-06-21 16:25:10,236 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.99 vs. limit=6.0 2023-06-21 16:25:16,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1007958.0, ans=0.035 2023-06-21 16:26:05,927 INFO [train.py:996] (2/4) Epoch 6, batch 15550, loss[loss=0.2233, simple_loss=0.2863, pruned_loss=0.0801, over 21117.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3267, pruned_loss=0.09007, over 4253736.36 frames. ], batch size: 143, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:26:31,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1008198.0, ans=0.125 2023-06-21 16:27:08,316 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.337e+02 2.789e+02 3.164e+02 3.648e+02 6.720e+02, threshold=6.328e+02, percent-clipped=2.0 2023-06-21 16:27:13,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1008318.0, ans=0.0 2023-06-21 16:27:39,800 INFO [train.py:996] (2/4) Epoch 6, batch 15600, loss[loss=0.3145, simple_loss=0.3766, pruned_loss=0.1262, over 21449.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3212, pruned_loss=0.08833, over 4239702.35 frames. ], batch size: 508, lr: 5.09e-03, grad_scale: 32.0 2023-06-21 16:28:09,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1008558.0, ans=0.0 2023-06-21 16:28:58,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1008678.0, ans=0.0 2023-06-21 16:29:11,335 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=15.0 2023-06-21 16:29:13,703 INFO [train.py:996] (2/4) Epoch 6, batch 15650, loss[loss=0.2511, simple_loss=0.306, pruned_loss=0.09815, over 21662.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3192, pruned_loss=0.08747, over 4234202.86 frames. ], batch size: 282, lr: 5.09e-03, grad_scale: 32.0 2023-06-21 16:29:18,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1008738.0, ans=0.125 2023-06-21 16:29:22,531 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.22 vs. limit=6.0 2023-06-21 16:30:15,988 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.365e+02 3.062e+02 3.560e+02 4.421e+02 6.753e+02, threshold=7.119e+02, percent-clipped=3.0 2023-06-21 16:30:47,538 INFO [train.py:996] (2/4) Epoch 6, batch 15700, loss[loss=0.2072, simple_loss=0.2694, pruned_loss=0.07249, over 21856.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3156, pruned_loss=0.08724, over 4239847.01 frames. ], batch size: 107, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:30:50,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1009038.0, ans=0.1 2023-06-21 16:30:53,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1009038.0, ans=0.0 2023-06-21 16:30:55,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1009038.0, ans=0.125 2023-06-21 16:31:47,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1009218.0, ans=0.125 2023-06-21 16:32:21,336 INFO [train.py:996] (2/4) Epoch 6, batch 15750, loss[loss=0.2385, simple_loss=0.3091, pruned_loss=0.08397, over 21796.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3113, pruned_loss=0.08746, over 4239819.68 frames. ], batch size: 317, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:33:08,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1009458.0, ans=0.05 2023-06-21 16:33:09,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1009518.0, ans=0.125 2023-06-21 16:33:23,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1009518.0, ans=0.2 2023-06-21 16:33:24,861 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 2.861e+02 3.404e+02 4.012e+02 5.531e+02, threshold=6.808e+02, percent-clipped=0.0 2023-06-21 16:33:55,157 INFO [train.py:996] (2/4) Epoch 6, batch 15800, loss[loss=0.218, simple_loss=0.28, pruned_loss=0.07799, over 21734.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3066, pruned_loss=0.08653, over 4246540.10 frames. ], batch size: 124, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:34:07,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1009638.0, ans=0.2 2023-06-21 16:34:15,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1009698.0, ans=0.0 2023-06-21 16:34:35,792 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=8.03 vs. limit=15.0 2023-06-21 16:35:17,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1009878.0, ans=0.04949747468305833 2023-06-21 16:35:29,281 INFO [train.py:996] (2/4) Epoch 6, batch 15850, loss[loss=0.2008, simple_loss=0.2673, pruned_loss=0.06719, over 20724.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3097, pruned_loss=0.08907, over 4248666.36 frames. ], batch size: 608, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:35:51,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1009998.0, ans=0.1 2023-06-21 16:35:59,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1010058.0, ans=0.0 2023-06-21 16:36:14,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1010058.0, ans=0.125 2023-06-21 16:36:28,882 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.45 vs. limit=15.0 2023-06-21 16:36:32,241 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.297e+02 2.973e+02 3.336e+02 4.018e+02 6.867e+02, threshold=6.671e+02, percent-clipped=1.0 2023-06-21 16:37:02,613 INFO [train.py:996] (2/4) Epoch 6, batch 15900, loss[loss=0.2527, simple_loss=0.3256, pruned_loss=0.08991, over 21370.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3076, pruned_loss=0.08891, over 4258611.29 frames. ], batch size: 194, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:37:09,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1010238.0, ans=0.125 2023-06-21 16:38:16,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1010478.0, ans=0.125 2023-06-21 16:38:35,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1010538.0, ans=0.125 2023-06-21 16:38:36,307 INFO [train.py:996] (2/4) Epoch 6, batch 15950, loss[loss=0.1698, simple_loss=0.266, pruned_loss=0.03685, over 21762.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3093, pruned_loss=0.08623, over 4257747.27 frames. ], batch size: 298, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:38:41,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1010538.0, ans=0.0 2023-06-21 16:38:47,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1010538.0, ans=0.125 2023-06-21 16:39:11,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1010658.0, ans=0.0 2023-06-21 16:39:34,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1010718.0, ans=0.0 2023-06-21 16:39:40,127 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.945e+02 2.630e+02 3.032e+02 3.635e+02 5.664e+02, threshold=6.064e+02, percent-clipped=0.0 2023-06-21 16:40:10,505 INFO [train.py:996] (2/4) Epoch 6, batch 16000, loss[loss=0.2745, simple_loss=0.3493, pruned_loss=0.09983, over 21769.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3123, pruned_loss=0.08491, over 4267553.09 frames. ], batch size: 441, lr: 5.09e-03, grad_scale: 32.0 2023-06-21 16:40:12,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1010838.0, ans=0.0 2023-06-21 16:40:18,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1010838.0, ans=0.2 2023-06-21 16:40:24,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1010898.0, ans=0.0 2023-06-21 16:41:10,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1011018.0, ans=0.2 2023-06-21 16:41:40,563 INFO [train.py:996] (2/4) Epoch 6, batch 16050, loss[loss=0.1887, simple_loss=0.2604, pruned_loss=0.05852, over 21813.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3127, pruned_loss=0.0821, over 4273265.87 frames. ], batch size: 102, lr: 5.09e-03, grad_scale: 32.0 2023-06-21 16:41:41,523 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=15.0 2023-06-21 16:42:07,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1011198.0, ans=0.1 2023-06-21 16:42:29,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1011318.0, ans=0.0 2023-06-21 16:42:30,006 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:42:31,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1011318.0, ans=0.125 2023-06-21 16:42:39,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1011318.0, ans=0.125 2023-06-21 16:42:44,671 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.136e+02 2.881e+02 3.788e+02 4.822e+02 9.882e+02, threshold=7.576e+02, percent-clipped=9.0 2023-06-21 16:42:46,472 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:42:46,976 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=22.5 2023-06-21 16:43:13,231 INFO [train.py:996] (2/4) Epoch 6, batch 16100, loss[loss=0.3004, simple_loss=0.3511, pruned_loss=0.1248, over 21867.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3174, pruned_loss=0.0838, over 4270352.90 frames. ], batch size: 107, lr: 5.09e-03, grad_scale: 16.0 2023-06-21 16:44:42,451 INFO [train.py:996] (2/4) Epoch 6, batch 16150, loss[loss=0.2398, simple_loss=0.2985, pruned_loss=0.09051, over 21528.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3185, pruned_loss=0.08594, over 4278929.83 frames. ], batch size: 212, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 16:45:17,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1011858.0, ans=0.0 2023-06-21 16:45:47,591 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.154e+02 3.047e+02 3.537e+02 4.143e+02 9.363e+02, threshold=7.074e+02, percent-clipped=2.0 2023-06-21 16:46:16,808 INFO [train.py:996] (2/4) Epoch 6, batch 16200, loss[loss=0.2388, simple_loss=0.3032, pruned_loss=0.08721, over 21193.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3216, pruned_loss=0.08694, over 4276308.78 frames. ], batch size: 608, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 16:46:23,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1012038.0, ans=0.125 2023-06-21 16:46:44,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1012098.0, ans=0.125 2023-06-21 16:46:58,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1012158.0, ans=0.125 2023-06-21 16:47:07,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1012158.0, ans=0.0 2023-06-21 16:47:10,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1012218.0, ans=0.2 2023-06-21 16:47:30,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1012278.0, ans=0.125 2023-06-21 16:47:51,839 INFO [train.py:996] (2/4) Epoch 6, batch 16250, loss[loss=0.2051, simple_loss=0.2886, pruned_loss=0.06074, over 21504.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.322, pruned_loss=0.08759, over 4278659.64 frames. ], batch size: 389, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 16:48:10,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1012398.0, ans=0.125 2023-06-21 16:48:18,556 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-21 16:48:20,089 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.96 vs. limit=6.0 2023-06-21 16:48:40,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1012458.0, ans=0.0 2023-06-21 16:49:01,826 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.762e+02 3.334e+02 4.108e+02 7.386e+02, threshold=6.668e+02, percent-clipped=1.0 2023-06-21 16:49:20,330 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:49:23,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1012578.0, ans=0.125 2023-06-21 16:49:26,006 INFO [train.py:996] (2/4) Epoch 6, batch 16300, loss[loss=0.1889, simple_loss=0.2815, pruned_loss=0.04813, over 21643.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3148, pruned_loss=0.0834, over 4274013.03 frames. ], batch size: 263, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 16:49:42,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1012698.0, ans=0.0 2023-06-21 16:50:34,495 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 16:50:44,097 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-06-21 16:50:54,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1012878.0, ans=0.0 2023-06-21 16:50:56,869 INFO [train.py:996] (2/4) Epoch 6, batch 16350, loss[loss=0.2437, simple_loss=0.3315, pruned_loss=0.078, over 20758.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3168, pruned_loss=0.08533, over 4274573.93 frames. ], batch size: 607, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 16:51:00,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1012938.0, ans=0.1 2023-06-21 16:51:59,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1013118.0, ans=0.2 2023-06-21 16:52:05,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1013118.0, ans=0.2 2023-06-21 16:52:07,858 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.29 vs. limit=15.0 2023-06-21 16:52:11,283 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.990e+02 2.615e+02 3.247e+02 3.873e+02 7.213e+02, threshold=6.493e+02, percent-clipped=3.0 2023-06-21 16:52:11,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1013118.0, ans=0.1 2023-06-21 16:52:17,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1013178.0, ans=0.125 2023-06-21 16:52:30,800 INFO [train.py:996] (2/4) Epoch 6, batch 16400, loss[loss=0.2328, simple_loss=0.311, pruned_loss=0.07724, over 21441.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3212, pruned_loss=0.08773, over 4281505.93 frames. ], batch size: 548, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 16:52:44,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1013298.0, ans=0.125 2023-06-21 16:53:29,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1013418.0, ans=0.0 2023-06-21 16:53:46,994 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=22.5 2023-06-21 16:54:04,535 INFO [train.py:996] (2/4) Epoch 6, batch 16450, loss[loss=0.2448, simple_loss=0.309, pruned_loss=0.09033, over 21477.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3217, pruned_loss=0.08926, over 4285922.63 frames. ], batch size: 211, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 16:55:00,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1013658.0, ans=0.125 2023-06-21 16:55:19,951 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.335e+02 2.857e+02 3.262e+02 3.717e+02 6.839e+02, threshold=6.523e+02, percent-clipped=2.0 2023-06-21 16:55:29,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1013778.0, ans=0.1 2023-06-21 16:55:32,840 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.58 vs. limit=10.0 2023-06-21 16:55:39,123 INFO [train.py:996] (2/4) Epoch 6, batch 16500, loss[loss=0.2503, simple_loss=0.3738, pruned_loss=0.0634, over 19794.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.318, pruned_loss=0.08728, over 4276172.45 frames. ], batch size: 703, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 16:56:36,516 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=22.5 2023-06-21 16:56:46,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1014018.0, ans=0.125 2023-06-21 16:57:00,665 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2023-06-21 16:57:05,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1014078.0, ans=0.09899494936611666 2023-06-21 16:57:17,894 INFO [train.py:996] (2/4) Epoch 6, batch 16550, loss[loss=0.2825, simple_loss=0.3599, pruned_loss=0.1025, over 21598.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3179, pruned_loss=0.08494, over 4278152.16 frames. ], batch size: 389, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 16:57:30,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1014138.0, ans=0.125 2023-06-21 16:58:24,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1014318.0, ans=0.0 2023-06-21 16:58:26,985 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.53 vs. limit=22.5 2023-06-21 16:58:29,283 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.258e+02 2.954e+02 3.435e+02 4.498e+02 9.143e+02, threshold=6.870e+02, percent-clipped=8.0 2023-06-21 16:58:40,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1014378.0, ans=0.0 2023-06-21 16:58:44,194 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.40 vs. limit=10.0 2023-06-21 16:58:47,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1014378.0, ans=0.1 2023-06-21 16:58:54,061 INFO [train.py:996] (2/4) Epoch 6, batch 16600, loss[loss=0.2108, simple_loss=0.3349, pruned_loss=0.04337, over 20827.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3271, pruned_loss=0.08854, over 4267551.46 frames. ], batch size: 608, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 16:59:12,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1014438.0, ans=0.125 2023-06-21 16:59:40,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1014558.0, ans=0.125 2023-06-21 16:59:48,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1014558.0, ans=0.1 2023-06-21 17:00:03,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1014618.0, ans=0.1 2023-06-21 17:00:05,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1014618.0, ans=0.125 2023-06-21 17:00:34,747 INFO [train.py:996] (2/4) Epoch 6, batch 16650, loss[loss=0.2991, simple_loss=0.3635, pruned_loss=0.1174, over 21803.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3365, pruned_loss=0.09191, over 4266714.57 frames. ], batch size: 441, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 17:00:41,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1014738.0, ans=0.125 2023-06-21 17:00:56,074 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.72 vs. limit=15.0 2023-06-21 17:01:47,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1014918.0, ans=0.125 2023-06-21 17:01:52,123 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.123e+02 3.593e+02 4.644e+02 7.930e+02, threshold=7.186e+02, percent-clipped=2.0 2023-06-21 17:02:01,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1014978.0, ans=0.125 2023-06-21 17:02:21,303 INFO [train.py:996] (2/4) Epoch 6, batch 16700, loss[loss=0.2112, simple_loss=0.2738, pruned_loss=0.0743, over 21126.00 frames. ], tot_loss[loss=0.2626, simple_loss=0.3385, pruned_loss=0.09335, over 4257630.54 frames. ], batch size: 143, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 17:02:32,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1015038.0, ans=0.1 2023-06-21 17:02:57,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1015158.0, ans=0.1 2023-06-21 17:02:58,026 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-21 17:03:33,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1015218.0, ans=0.0 2023-06-21 17:03:52,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1015278.0, ans=0.125 2023-06-21 17:03:54,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1015278.0, ans=0.125 2023-06-21 17:03:58,828 INFO [train.py:996] (2/4) Epoch 6, batch 16750, loss[loss=0.2869, simple_loss=0.3845, pruned_loss=0.09467, over 21288.00 frames. ], tot_loss[loss=0.2672, simple_loss=0.3418, pruned_loss=0.09634, over 4267740.35 frames. ], batch size: 549, lr: 5.08e-03, grad_scale: 16.0 2023-06-21 17:04:22,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1015398.0, ans=0.0 2023-06-21 17:04:30,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1015398.0, ans=0.0 2023-06-21 17:05:12,288 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.369e+02 3.506e+02 4.253e+02 6.038e+02 1.079e+03, threshold=8.506e+02, percent-clipped=10.0 2023-06-21 17:05:13,451 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.19 vs. limit=15.0 2023-06-21 17:05:15,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1015578.0, ans=0.0 2023-06-21 17:05:24,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1015578.0, ans=0.0 2023-06-21 17:05:30,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1015578.0, ans=0.125 2023-06-21 17:05:34,886 INFO [train.py:996] (2/4) Epoch 6, batch 16800, loss[loss=0.2476, simple_loss=0.3147, pruned_loss=0.09025, over 21813.00 frames. ], tot_loss[loss=0.2692, simple_loss=0.3452, pruned_loss=0.09661, over 4268209.71 frames. ], batch size: 282, lr: 5.08e-03, grad_scale: 32.0 2023-06-21 17:06:52,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1015878.0, ans=0.1 2023-06-21 17:07:04,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1015878.0, ans=0.0 2023-06-21 17:07:08,793 INFO [train.py:996] (2/4) Epoch 6, batch 16850, loss[loss=0.2404, simple_loss=0.3025, pruned_loss=0.08913, over 21617.00 frames. ], tot_loss[loss=0.2665, simple_loss=0.3408, pruned_loss=0.09606, over 4268096.78 frames. ], batch size: 548, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:08:21,510 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 2.931e+02 3.423e+02 4.482e+02 7.655e+02, threshold=6.845e+02, percent-clipped=0.0 2023-06-21 17:08:43,845 INFO [train.py:996] (2/4) Epoch 6, batch 16900, loss[loss=0.2212, simple_loss=0.2858, pruned_loss=0.07827, over 21292.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3342, pruned_loss=0.09375, over 4273214.75 frames. ], batch size: 159, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:09:31,711 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.35 vs. limit=15.0 2023-06-21 17:09:41,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1016418.0, ans=0.0 2023-06-21 17:10:15,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1016538.0, ans=0.125 2023-06-21 17:10:16,672 INFO [train.py:996] (2/4) Epoch 6, batch 16950, loss[loss=0.2355, simple_loss=0.3026, pruned_loss=0.0842, over 21838.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.327, pruned_loss=0.09263, over 4265962.16 frames. ], batch size: 124, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:10:18,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1016538.0, ans=0.1 2023-06-21 17:10:21,472 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 17:11:19,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1016718.0, ans=0.125 2023-06-21 17:11:26,921 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.086e+02 2.733e+02 3.000e+02 3.564e+02 5.984e+02, threshold=6.000e+02, percent-clipped=0.0 2023-06-21 17:11:46,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1016778.0, ans=0.2 2023-06-21 17:11:50,085 INFO [train.py:996] (2/4) Epoch 6, batch 17000, loss[loss=0.2523, simple_loss=0.3206, pruned_loss=0.09194, over 21809.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3239, pruned_loss=0.09278, over 4273361.14 frames. ], batch size: 112, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:11:51,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1016838.0, ans=0.0 2023-06-21 17:11:53,609 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 17:12:01,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1016838.0, ans=0.0 2023-06-21 17:13:18,952 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 17:13:20,260 INFO [train.py:996] (2/4) Epoch 6, batch 17050, loss[loss=0.2729, simple_loss=0.3542, pruned_loss=0.09576, over 21422.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3296, pruned_loss=0.09438, over 4269104.03 frames. ], batch size: 548, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:13:31,809 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.65 vs. limit=15.0 2023-06-21 17:13:39,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1017138.0, ans=0.0 2023-06-21 17:13:51,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1017198.0, ans=0.0 2023-06-21 17:14:30,021 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.382e+02 3.084e+02 3.616e+02 4.433e+02 7.180e+02, threshold=7.232e+02, percent-clipped=5.0 2023-06-21 17:14:38,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1017378.0, ans=0.2 2023-06-21 17:14:49,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1017378.0, ans=0.125 2023-06-21 17:14:51,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1017438.0, ans=0.5 2023-06-21 17:14:52,402 INFO [train.py:996] (2/4) Epoch 6, batch 17100, loss[loss=0.2401, simple_loss=0.3053, pruned_loss=0.08748, over 21847.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3286, pruned_loss=0.09489, over 4280771.93 frames. ], batch size: 98, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:14:54,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1017438.0, ans=0.2 2023-06-21 17:15:34,575 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.51 vs. limit=15.0 2023-06-21 17:16:00,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1017618.0, ans=0.0 2023-06-21 17:16:02,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1017618.0, ans=0.0 2023-06-21 17:16:26,699 INFO [train.py:996] (2/4) Epoch 6, batch 17150, loss[loss=0.2155, simple_loss=0.2805, pruned_loss=0.07527, over 21425.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3241, pruned_loss=0.09423, over 4289619.37 frames. ], batch size: 131, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:17:36,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1017918.0, ans=0.0 2023-06-21 17:17:39,114 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.347e+02 2.784e+02 3.036e+02 3.616e+02 5.334e+02, threshold=6.072e+02, percent-clipped=0.0 2023-06-21 17:18:05,511 INFO [train.py:996] (2/4) Epoch 6, batch 17200, loss[loss=0.3134, simple_loss=0.3785, pruned_loss=0.1241, over 21589.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.3244, pruned_loss=0.09431, over 4292012.45 frames. ], batch size: 415, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:19:06,959 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=22.5 2023-06-21 17:19:38,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1018338.0, ans=0.125 2023-06-21 17:19:40,271 INFO [train.py:996] (2/4) Epoch 6, batch 17250, loss[loss=0.2631, simple_loss=0.3484, pruned_loss=0.08889, over 21432.00 frames. ], tot_loss[loss=0.2592, simple_loss=0.3277, pruned_loss=0.09535, over 4285390.98 frames. ], batch size: 131, lr: 5.07e-03, grad_scale: 32.0 2023-06-21 17:19:51,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1018338.0, ans=0.125 2023-06-21 17:20:04,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1018398.0, ans=0.125 2023-06-21 17:20:51,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1018518.0, ans=0.2 2023-06-21 17:20:54,741 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.506e+02 3.363e+02 4.058e+02 5.457e+02 1.011e+03, threshold=8.116e+02, percent-clipped=16.0 2023-06-21 17:21:09,799 INFO [train.py:996] (2/4) Epoch 6, batch 17300, loss[loss=0.2989, simple_loss=0.3613, pruned_loss=0.1183, over 21283.00 frames. ], tot_loss[loss=0.2681, simple_loss=0.3373, pruned_loss=0.09945, over 4281964.67 frames. ], batch size: 143, lr: 5.07e-03, grad_scale: 16.0 2023-06-21 17:21:13,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1018638.0, ans=0.0 2023-06-21 17:21:39,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1018698.0, ans=0.0 2023-06-21 17:21:48,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1018758.0, ans=0.125 2023-06-21 17:21:51,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1018758.0, ans=0.125 2023-06-21 17:21:54,067 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.20 vs. limit=15.0 2023-06-21 17:22:40,269 INFO [train.py:996] (2/4) Epoch 6, batch 17350, loss[loss=0.2556, simple_loss=0.3568, pruned_loss=0.07726, over 21245.00 frames. ], tot_loss[loss=0.2679, simple_loss=0.3379, pruned_loss=0.09893, over 4274803.42 frames. ], batch size: 548, lr: 5.07e-03, grad_scale: 16.0 2023-06-21 17:22:40,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1018938.0, ans=0.2 2023-06-21 17:22:48,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1018938.0, ans=0.0 2023-06-21 17:23:05,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1018998.0, ans=10.0 2023-06-21 17:23:56,313 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 2.958e+02 3.315e+02 3.844e+02 7.686e+02, threshold=6.630e+02, percent-clipped=0.0 2023-06-21 17:23:56,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1019178.0, ans=0.125 2023-06-21 17:24:03,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1019178.0, ans=0.2 2023-06-21 17:24:07,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1019178.0, ans=0.125 2023-06-21 17:24:11,636 INFO [train.py:996] (2/4) Epoch 6, batch 17400, loss[loss=0.2124, simple_loss=0.2534, pruned_loss=0.08565, over 20049.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3345, pruned_loss=0.09446, over 4276003.27 frames. ], batch size: 704, lr: 5.07e-03, grad_scale: 16.0 2023-06-21 17:24:12,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1019238.0, ans=0.125 2023-06-21 17:24:33,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1019298.0, ans=0.1 2023-06-21 17:25:08,853 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.80 vs. limit=10.0 2023-06-21 17:25:22,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1019418.0, ans=0.0 2023-06-21 17:25:42,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1019478.0, ans=0.1 2023-06-21 17:25:44,352 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=6.0 2023-06-21 17:25:46,423 INFO [train.py:996] (2/4) Epoch 6, batch 17450, loss[loss=0.2146, simple_loss=0.3157, pruned_loss=0.05674, over 20736.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.33, pruned_loss=0.09116, over 4271650.02 frames. ], batch size: 608, lr: 5.07e-03, grad_scale: 16.0 2023-06-21 17:25:54,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1019538.0, ans=0.125 2023-06-21 17:26:29,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=1019658.0, ans=0.02 2023-06-21 17:26:31,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1019658.0, ans=0.125 2023-06-21 17:26:43,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1019718.0, ans=0.2 2023-06-21 17:27:05,007 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.067e+02 2.736e+02 3.255e+02 3.942e+02 6.172e+02, threshold=6.510e+02, percent-clipped=0.0 2023-06-21 17:27:13,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1019778.0, ans=0.0 2023-06-21 17:27:17,254 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.22 vs. limit=10.0 2023-06-21 17:27:18,938 INFO [train.py:996] (2/4) Epoch 6, batch 17500, loss[loss=0.2895, simple_loss=0.3389, pruned_loss=0.1201, over 21608.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3256, pruned_loss=0.08868, over 4275490.57 frames. ], batch size: 471, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:27:24,376 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.67 vs. limit=15.0 2023-06-21 17:27:34,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1019838.0, ans=0.125 2023-06-21 17:28:24,317 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.07 vs. limit=22.5 2023-06-21 17:28:26,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1020018.0, ans=0.04949747468305833 2023-06-21 17:28:40,458 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.92 vs. limit=15.0 2023-06-21 17:28:50,791 INFO [train.py:996] (2/4) Epoch 6, batch 17550, loss[loss=0.2179, simple_loss=0.3135, pruned_loss=0.06114, over 21739.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3241, pruned_loss=0.08702, over 4272817.73 frames. ], batch size: 247, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:28:54,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1020138.0, ans=0.125 2023-06-21 17:29:02,018 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=15.0 2023-06-21 17:29:20,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1020198.0, ans=0.1 2023-06-21 17:29:21,407 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.00 vs. limit=10.0 2023-06-21 17:29:36,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1020258.0, ans=0.0 2023-06-21 17:30:04,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1020378.0, ans=0.2 2023-06-21 17:30:10,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1020378.0, ans=0.0 2023-06-21 17:30:11,111 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.054e+02 2.780e+02 3.218e+02 4.154e+02 6.196e+02, threshold=6.435e+02, percent-clipped=0.0 2023-06-21 17:30:24,477 INFO [train.py:996] (2/4) Epoch 6, batch 17600, loss[loss=0.258, simple_loss=0.3279, pruned_loss=0.09404, over 21640.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3254, pruned_loss=0.08633, over 4262095.97 frames. ], batch size: 230, lr: 5.06e-03, grad_scale: 16.0 2023-06-21 17:30:27,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1020438.0, ans=0.0 2023-06-21 17:30:56,160 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.49 vs. limit=22.5 2023-06-21 17:30:56,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1020498.0, ans=0.0 2023-06-21 17:31:02,327 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=12.0 2023-06-21 17:31:09,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1020558.0, ans=0.125 2023-06-21 17:31:15,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1020558.0, ans=0.2 2023-06-21 17:31:16,027 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=22.5 2023-06-21 17:31:16,885 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 17:31:52,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1020678.0, ans=0.0 2023-06-21 17:31:59,657 INFO [train.py:996] (2/4) Epoch 6, batch 17650, loss[loss=0.2187, simple_loss=0.294, pruned_loss=0.0717, over 21668.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3238, pruned_loss=0.0871, over 4262292.04 frames. ], batch size: 351, lr: 5.06e-03, grad_scale: 16.0 2023-06-21 17:33:20,341 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.959e+02 2.977e+02 3.448e+02 4.059e+02 7.958e+02, threshold=6.896e+02, percent-clipped=7.0 2023-06-21 17:33:47,955 INFO [train.py:996] (2/4) Epoch 6, batch 17700, loss[loss=0.2334, simple_loss=0.3147, pruned_loss=0.07607, over 21649.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3191, pruned_loss=0.08446, over 4256209.45 frames. ], batch size: 263, lr: 5.06e-03, grad_scale: 16.0 2023-06-21 17:33:58,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1021038.0, ans=0.0 2023-06-21 17:34:11,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1021098.0, ans=0.125 2023-06-21 17:34:27,142 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-06-21 17:34:57,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1021278.0, ans=0.0 2023-06-21 17:35:18,347 INFO [train.py:996] (2/4) Epoch 6, batch 17750, loss[loss=0.3217, simple_loss=0.3842, pruned_loss=0.1296, over 21384.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3268, pruned_loss=0.08874, over 4258635.00 frames. ], batch size: 549, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:36:36,823 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.208e+02 2.843e+02 3.336e+02 3.898e+02 5.169e+02, threshold=6.672e+02, percent-clipped=0.0 2023-06-21 17:36:49,081 INFO [train.py:996] (2/4) Epoch 6, batch 17800, loss[loss=0.2847, simple_loss=0.3598, pruned_loss=0.1048, over 21678.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3267, pruned_loss=0.08796, over 4258231.44 frames. ], batch size: 441, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:36:58,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1021638.0, ans=0.1 2023-06-21 17:36:58,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1021638.0, ans=0.125 2023-06-21 17:37:57,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1021818.0, ans=0.0 2023-06-21 17:38:19,584 INFO [train.py:996] (2/4) Epoch 6, batch 17850, loss[loss=0.2527, simple_loss=0.3201, pruned_loss=0.09265, over 21590.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3274, pruned_loss=0.08851, over 4260482.89 frames. ], batch size: 263, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:38:21,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1021938.0, ans=0.2 2023-06-21 17:38:35,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1021938.0, ans=0.125 2023-06-21 17:38:38,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1021998.0, ans=0.1 2023-06-21 17:38:44,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1021998.0, ans=0.125 2023-06-21 17:39:15,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1022118.0, ans=0.0 2023-06-21 17:39:37,563 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.865e+02 3.033e+02 3.417e+02 4.325e+02 8.227e+02, threshold=6.834e+02, percent-clipped=5.0 2023-06-21 17:39:54,787 INFO [train.py:996] (2/4) Epoch 6, batch 17900, loss[loss=0.2966, simple_loss=0.3834, pruned_loss=0.1048, over 21866.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3312, pruned_loss=0.08976, over 4261030.14 frames. ], batch size: 371, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:39:58,635 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.50 vs. limit=15.0 2023-06-21 17:40:48,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1022358.0, ans=0.2 2023-06-21 17:41:05,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1022418.0, ans=0.07 2023-06-21 17:41:18,431 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.75 vs. limit=15.0 2023-06-21 17:41:28,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1022538.0, ans=0.125 2023-06-21 17:41:29,375 INFO [train.py:996] (2/4) Epoch 6, batch 17950, loss[loss=0.2188, simple_loss=0.3133, pruned_loss=0.06221, over 21769.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3303, pruned_loss=0.0865, over 4256198.20 frames. ], batch size: 351, lr: 5.06e-03, grad_scale: 8.0 2023-06-21 17:41:37,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1022538.0, ans=0.2 2023-06-21 17:42:45,460 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.874e+02 2.768e+02 3.134e+02 4.074e+02 6.684e+02, threshold=6.268e+02, percent-clipped=0.0 2023-06-21 17:42:48,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1022778.0, ans=0.07 2023-06-21 17:42:50,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1022778.0, ans=0.1 2023-06-21 17:42:57,283 INFO [train.py:996] (2/4) Epoch 6, batch 18000, loss[loss=0.2069, simple_loss=0.2654, pruned_loss=0.07421, over 21290.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3212, pruned_loss=0.08415, over 4257574.41 frames. ], batch size: 551, lr: 5.06e-03, grad_scale: 16.0 2023-06-21 17:42:57,283 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 17:43:13,465 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2661, simple_loss=0.365, pruned_loss=0.08355, over 1796401.00 frames. 2023-06-21 17:43:13,466 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-21 17:43:24,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1022838.0, ans=0.125 2023-06-21 17:43:52,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1022958.0, ans=0.2 2023-06-21 17:43:57,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1022958.0, ans=0.1 2023-06-21 17:44:35,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1023078.0, ans=0.1 2023-06-21 17:44:42,380 INFO [train.py:996] (2/4) Epoch 6, batch 18050, loss[loss=0.2667, simple_loss=0.323, pruned_loss=0.1052, over 21762.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3154, pruned_loss=0.08357, over 4269619.27 frames. ], batch size: 371, lr: 5.06e-03, grad_scale: 16.0 2023-06-21 17:45:08,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1023198.0, ans=0.1 2023-06-21 17:46:00,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1023378.0, ans=0.0 2023-06-21 17:46:01,045 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.370e+02 3.145e+02 3.852e+02 4.625e+02 8.498e+02, threshold=7.705e+02, percent-clipped=3.0 2023-06-21 17:46:01,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1023378.0, ans=0.2 2023-06-21 17:46:18,321 INFO [train.py:996] (2/4) Epoch 6, batch 18100, loss[loss=0.262, simple_loss=0.3388, pruned_loss=0.09255, over 21332.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3221, pruned_loss=0.08637, over 4266877.80 frames. ], batch size: 159, lr: 5.06e-03, grad_scale: 16.0 2023-06-21 17:46:48,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1023498.0, ans=0.1 2023-06-21 17:47:25,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1023618.0, ans=0.125 2023-06-21 17:47:52,807 INFO [train.py:996] (2/4) Epoch 6, batch 18150, loss[loss=0.2293, simple_loss=0.2903, pruned_loss=0.0842, over 21201.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3231, pruned_loss=0.08567, over 4265912.02 frames. ], batch size: 159, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:48:22,771 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=11.43 vs. limit=15.0 2023-06-21 17:48:29,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1023798.0, ans=0.04949747468305833 2023-06-21 17:48:37,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1023858.0, ans=0.2 2023-06-21 17:48:38,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1023858.0, ans=0.2 2023-06-21 17:48:44,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1023858.0, ans=0.125 2023-06-21 17:48:56,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1023918.0, ans=0.125 2023-06-21 17:49:04,449 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.051e+02 2.867e+02 3.393e+02 4.008e+02 7.433e+02, threshold=6.785e+02, percent-clipped=0.0 2023-06-21 17:49:16,095 INFO [train.py:996] (2/4) Epoch 6, batch 18200, loss[loss=0.2206, simple_loss=0.2862, pruned_loss=0.07752, over 21787.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3182, pruned_loss=0.08558, over 4233616.52 frames. ], batch size: 102, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:49:16,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1024038.0, ans=0.125 2023-06-21 17:49:24,701 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=15.0 2023-06-21 17:49:37,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1024038.0, ans=0.0 2023-06-21 17:50:06,081 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2023-06-21 17:50:20,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1024218.0, ans=0.0 2023-06-21 17:50:26,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1024218.0, ans=0.025 2023-06-21 17:50:47,202 INFO [train.py:996] (2/4) Epoch 6, batch 18250, loss[loss=0.2525, simple_loss=0.3169, pruned_loss=0.09406, over 21876.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3094, pruned_loss=0.08272, over 4239334.25 frames. ], batch size: 351, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:51:12,848 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2023-06-21 17:51:57,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1024518.0, ans=0.125 2023-06-21 17:52:04,324 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.822e+02 2.691e+02 3.171e+02 4.057e+02 6.181e+02, threshold=6.342e+02, percent-clipped=0.0 2023-06-21 17:52:13,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1024578.0, ans=0.0 2023-06-21 17:52:16,383 INFO [train.py:996] (2/4) Epoch 6, batch 18300, loss[loss=0.2571, simple_loss=0.3516, pruned_loss=0.08125, over 21736.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.312, pruned_loss=0.08393, over 4251043.03 frames. ], batch size: 298, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:52:19,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1024638.0, ans=0.2 2023-06-21 17:53:01,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1024758.0, ans=0.07 2023-06-21 17:53:49,935 INFO [train.py:996] (2/4) Epoch 6, batch 18350, loss[loss=0.232, simple_loss=0.297, pruned_loss=0.08347, over 21353.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3177, pruned_loss=0.08427, over 4255284.10 frames. ], batch size: 194, lr: 5.05e-03, grad_scale: 8.0 2023-06-21 17:54:03,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1024998.0, ans=0.125 2023-06-21 17:54:34,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1025058.0, ans=0.125 2023-06-21 17:54:58,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1025118.0, ans=0.125 2023-06-21 17:55:13,009 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.969e+02 3.614e+02 4.589e+02 8.811e+02, threshold=7.228e+02, percent-clipped=4.0 2023-06-21 17:55:21,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1025178.0, ans=6.0 2023-06-21 17:55:24,016 INFO [train.py:996] (2/4) Epoch 6, batch 18400, loss[loss=0.232, simple_loss=0.2916, pruned_loss=0.08615, over 21319.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3121, pruned_loss=0.08269, over 4247789.91 frames. ], batch size: 144, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:55:24,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1025238.0, ans=0.1 2023-06-21 17:55:58,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1025298.0, ans=0.125 2023-06-21 17:55:59,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1025298.0, ans=0.125 2023-06-21 17:56:57,263 INFO [train.py:996] (2/4) Epoch 6, batch 18450, loss[loss=0.1849, simple_loss=0.2694, pruned_loss=0.05021, over 21249.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3103, pruned_loss=0.07965, over 4253109.80 frames. ], batch size: 176, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:58:14,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1025778.0, ans=0.0 2023-06-21 17:58:20,293 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.010e+02 2.633e+02 3.038e+02 3.698e+02 5.788e+02, threshold=6.076e+02, percent-clipped=0.0 2023-06-21 17:58:30,578 INFO [train.py:996] (2/4) Epoch 6, batch 18500, loss[loss=0.1969, simple_loss=0.2649, pruned_loss=0.06438, over 21226.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3048, pruned_loss=0.07789, over 4243819.48 frames. ], batch size: 159, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 17:58:32,993 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-21 17:59:00,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1025898.0, ans=0.125 2023-06-21 17:59:43,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1026018.0, ans=0.125 2023-06-21 18:00:02,400 INFO [train.py:996] (2/4) Epoch 6, batch 18550, loss[loss=0.2229, simple_loss=0.2834, pruned_loss=0.08117, over 21454.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3018, pruned_loss=0.07664, over 4231794.78 frames. ], batch size: 132, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 18:00:21,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1026198.0, ans=0.05 2023-06-21 18:00:44,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1026198.0, ans=0.0 2023-06-21 18:00:45,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1026258.0, ans=0.125 2023-06-21 18:01:14,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1026318.0, ans=0.1 2023-06-21 18:01:26,095 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.169e+02 2.898e+02 3.436e+02 4.284e+02 7.618e+02, threshold=6.872e+02, percent-clipped=4.0 2023-06-21 18:01:26,501 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.112e-03 2023-06-21 18:01:36,710 INFO [train.py:996] (2/4) Epoch 6, batch 18600, loss[loss=0.1811, simple_loss=0.2556, pruned_loss=0.05336, over 21456.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3012, pruned_loss=0.07791, over 4235145.73 frames. ], batch size: 160, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 18:02:27,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1026558.0, ans=0.125 2023-06-21 18:02:32,210 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:02:55,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1026678.0, ans=0.1 2023-06-21 18:03:05,862 INFO [train.py:996] (2/4) Epoch 6, batch 18650, loss[loss=0.1974, simple_loss=0.2623, pruned_loss=0.06628, over 21169.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3021, pruned_loss=0.07852, over 4229487.59 frames. ], batch size: 143, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 18:03:36,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1026798.0, ans=0.1 2023-06-21 18:03:50,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1026858.0, ans=0.07 2023-06-21 18:04:28,180 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.096e+02 2.661e+02 3.021e+02 3.518e+02 6.281e+02, threshold=6.043e+02, percent-clipped=0.0 2023-06-21 18:04:33,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1026978.0, ans=0.025 2023-06-21 18:04:38,440 INFO [train.py:996] (2/4) Epoch 6, batch 18700, loss[loss=0.243, simple_loss=0.3044, pruned_loss=0.09084, over 21832.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.2988, pruned_loss=0.07989, over 4240503.03 frames. ], batch size: 107, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 18:05:38,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1027158.0, ans=0.1 2023-06-21 18:06:11,376 INFO [train.py:996] (2/4) Epoch 6, batch 18750, loss[loss=0.2878, simple_loss=0.358, pruned_loss=0.1088, over 21759.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3021, pruned_loss=0.08331, over 4257928.34 frames. ], batch size: 332, lr: 5.05e-03, grad_scale: 16.0 2023-06-21 18:06:25,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1027398.0, ans=0.125 2023-06-21 18:06:43,996 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=8.43 vs. limit=10.0 2023-06-21 18:07:19,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1027518.0, ans=0.125 2023-06-21 18:07:34,238 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.279e+02 2.900e+02 3.294e+02 3.931e+02 7.649e+02, threshold=6.589e+02, percent-clipped=4.0 2023-06-21 18:07:41,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1027578.0, ans=0.1 2023-06-21 18:07:45,354 INFO [train.py:996] (2/4) Epoch 6, batch 18800, loss[loss=0.2272, simple_loss=0.3131, pruned_loss=0.07061, over 21681.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3094, pruned_loss=0.08529, over 4258808.03 frames. ], batch size: 263, lr: 5.05e-03, grad_scale: 32.0 2023-06-21 18:08:26,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1027758.0, ans=0.125 2023-06-21 18:09:18,548 INFO [train.py:996] (2/4) Epoch 6, batch 18850, loss[loss=0.176, simple_loss=0.2537, pruned_loss=0.0491, over 21312.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3039, pruned_loss=0.07951, over 4265805.49 frames. ], batch size: 159, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:09:52,587 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.72 vs. limit=5.0 2023-06-21 18:09:59,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1028058.0, ans=0.05 2023-06-21 18:10:09,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1028058.0, ans=0.0 2023-06-21 18:10:36,695 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.51 vs. limit=15.0 2023-06-21 18:10:41,514 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.769e+02 2.569e+02 2.914e+02 3.331e+02 4.866e+02, threshold=5.828e+02, percent-clipped=0.0 2023-06-21 18:10:51,728 INFO [train.py:996] (2/4) Epoch 6, batch 18900, loss[loss=0.2339, simple_loss=0.2955, pruned_loss=0.08617, over 21793.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3007, pruned_loss=0.07958, over 4272961.29 frames. ], batch size: 416, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:11:08,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1028238.0, ans=0.05 2023-06-21 18:11:08,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1028238.0, ans=0.125 2023-06-21 18:12:24,799 INFO [train.py:996] (2/4) Epoch 6, batch 18950, loss[loss=0.1977, simple_loss=0.2618, pruned_loss=0.06686, over 21166.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3027, pruned_loss=0.08255, over 4276279.69 frames. ], batch size: 608, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:13:37,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1028718.0, ans=0.1 2023-06-21 18:13:48,639 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.255e+02 3.005e+02 3.685e+02 4.757e+02 8.623e+02, threshold=7.371e+02, percent-clipped=7.0 2023-06-21 18:14:04,048 INFO [train.py:996] (2/4) Epoch 6, batch 19000, loss[loss=0.25, simple_loss=0.3274, pruned_loss=0.08631, over 21380.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3121, pruned_loss=0.08373, over 4272276.91 frames. ], batch size: 176, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:14:42,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1028898.0, ans=0.0 2023-06-21 18:14:44,482 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.29 vs. limit=10.0 2023-06-21 18:15:03,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1029018.0, ans=0.2 2023-06-21 18:15:20,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1029078.0, ans=0.0 2023-06-21 18:15:36,832 INFO [train.py:996] (2/4) Epoch 6, batch 19050, loss[loss=0.2741, simple_loss=0.3268, pruned_loss=0.1107, over 21519.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3169, pruned_loss=0.08766, over 4272922.47 frames. ], batch size: 548, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:16:54,975 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.277e+02 2.879e+02 3.312e+02 3.801e+02 5.598e+02, threshold=6.624e+02, percent-clipped=0.0 2023-06-21 18:17:10,667 INFO [train.py:996] (2/4) Epoch 6, batch 19100, loss[loss=0.2148, simple_loss=0.2774, pruned_loss=0.07611, over 21672.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3148, pruned_loss=0.08843, over 4274533.73 frames. ], batch size: 282, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:18:03,916 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-21 18:18:08,458 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=22.5 2023-06-21 18:18:17,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1029618.0, ans=0.1 2023-06-21 18:18:48,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1029678.0, ans=0.125 2023-06-21 18:18:51,223 INFO [train.py:996] (2/4) Epoch 6, batch 19150, loss[loss=0.2452, simple_loss=0.334, pruned_loss=0.07825, over 21442.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3187, pruned_loss=0.08957, over 4278990.43 frames. ], batch size: 211, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:19:28,468 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-06-21 18:19:29,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1029798.0, ans=0.07 2023-06-21 18:19:38,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1029858.0, ans=0.05 2023-06-21 18:20:17,877 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.081e+02 3.009e+02 3.594e+02 4.563e+02 7.134e+02, threshold=7.188e+02, percent-clipped=1.0 2023-06-21 18:20:25,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1030038.0, ans=0.0 2023-06-21 18:20:31,688 INFO [train.py:996] (2/4) Epoch 6, batch 19200, loss[loss=0.3398, simple_loss=0.4276, pruned_loss=0.1259, over 21638.00 frames. ], tot_loss[loss=0.2558, simple_loss=0.3302, pruned_loss=0.09074, over 4278646.01 frames. ], batch size: 441, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:20:42,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1030038.0, ans=0.125 2023-06-21 18:20:50,652 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.51 vs. limit=15.0 2023-06-21 18:20:56,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1030098.0, ans=0.2 2023-06-21 18:20:56,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1030098.0, ans=0.125 2023-06-21 18:21:05,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1030158.0, ans=0.2 2023-06-21 18:21:21,013 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-21 18:21:33,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1030218.0, ans=0.5 2023-06-21 18:21:56,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1030278.0, ans=0.04949747468305833 2023-06-21 18:21:57,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1030278.0, ans=0.125 2023-06-21 18:22:00,350 INFO [train.py:996] (2/4) Epoch 6, batch 19250, loss[loss=0.2643, simple_loss=0.3461, pruned_loss=0.09122, over 21725.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3275, pruned_loss=0.08496, over 4265694.45 frames. ], batch size: 441, lr: 5.04e-03, grad_scale: 32.0 2023-06-21 18:22:00,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1030338.0, ans=0.1 2023-06-21 18:22:41,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1030458.0, ans=0.0 2023-06-21 18:22:50,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1030458.0, ans=0.125 2023-06-21 18:23:25,536 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.761e+02 2.595e+02 3.039e+02 3.485e+02 4.997e+02, threshold=6.078e+02, percent-clipped=0.0 2023-06-21 18:23:37,842 INFO [train.py:996] (2/4) Epoch 6, batch 19300, loss[loss=0.2643, simple_loss=0.3244, pruned_loss=0.1021, over 21775.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3238, pruned_loss=0.08484, over 4274962.07 frames. ], batch size: 441, lr: 5.04e-03, grad_scale: 16.0 2023-06-21 18:24:06,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1030698.0, ans=0.0 2023-06-21 18:24:09,862 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-06-21 18:24:10,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1030698.0, ans=0.125 2023-06-21 18:24:36,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1030818.0, ans=0.1 2023-06-21 18:25:17,123 INFO [train.py:996] (2/4) Epoch 6, batch 19350, loss[loss=0.2668, simple_loss=0.3499, pruned_loss=0.0918, over 21563.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3177, pruned_loss=0.08002, over 4273767.30 frames. ], batch size: 473, lr: 5.04e-03, grad_scale: 16.0 2023-06-21 18:25:28,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1030938.0, ans=0.125 2023-06-21 18:25:29,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1030938.0, ans=0.5 2023-06-21 18:25:30,056 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-21 18:26:21,453 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.38 vs. limit=22.5 2023-06-21 18:26:28,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1031178.0, ans=0.2 2023-06-21 18:26:38,757 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.965e+02 2.597e+02 3.186e+02 4.013e+02 6.947e+02, threshold=6.372e+02, percent-clipped=2.0 2023-06-21 18:26:50,812 INFO [train.py:996] (2/4) Epoch 6, batch 19400, loss[loss=0.2041, simple_loss=0.2774, pruned_loss=0.06542, over 21827.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.317, pruned_loss=0.07995, over 4268471.09 frames. ], batch size: 282, lr: 5.04e-03, grad_scale: 16.0 2023-06-21 18:27:08,246 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.26 vs. limit=10.0 2023-06-21 18:27:13,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1031298.0, ans=0.0 2023-06-21 18:27:48,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1031418.0, ans=0.125 2023-06-21 18:27:57,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1031418.0, ans=0.0 2023-06-21 18:28:24,297 INFO [train.py:996] (2/4) Epoch 6, batch 19450, loss[loss=0.2645, simple_loss=0.3172, pruned_loss=0.1059, over 21439.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3136, pruned_loss=0.08192, over 4264984.33 frames. ], batch size: 131, lr: 5.04e-03, grad_scale: 16.0 2023-06-21 18:28:32,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1031538.0, ans=0.125 2023-06-21 18:28:46,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1031598.0, ans=0.125 2023-06-21 18:28:56,567 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=15.0 2023-06-21 18:29:47,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1031778.0, ans=0.1 2023-06-21 18:29:51,191 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.868e+02 3.154e+02 3.556e+02 5.974e+02, threshold=6.308e+02, percent-clipped=0.0 2023-06-21 18:29:58,613 INFO [train.py:996] (2/4) Epoch 6, batch 19500, loss[loss=0.3111, simple_loss=0.3605, pruned_loss=0.1309, over 21422.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3103, pruned_loss=0.08358, over 4261774.06 frames. ], batch size: 507, lr: 5.04e-03, grad_scale: 16.0 2023-06-21 18:30:05,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1031838.0, ans=0.04949747468305833 2023-06-21 18:31:02,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1032018.0, ans=0.2 2023-06-21 18:31:23,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1032078.0, ans=0.0 2023-06-21 18:31:25,580 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=22.57 vs. limit=22.5 2023-06-21 18:31:34,625 INFO [train.py:996] (2/4) Epoch 6, batch 19550, loss[loss=0.2224, simple_loss=0.321, pruned_loss=0.0619, over 21753.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3058, pruned_loss=0.08192, over 4238942.01 frames. ], batch size: 332, lr: 5.03e-03, grad_scale: 16.0 2023-06-21 18:31:42,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1032138.0, ans=0.125 2023-06-21 18:31:42,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1032138.0, ans=0.0 2023-06-21 18:32:14,220 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.45 vs. limit=15.0 2023-06-21 18:32:19,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1032258.0, ans=0.125 2023-06-21 18:32:46,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1032378.0, ans=0.1 2023-06-21 18:33:00,133 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.941e+02 3.432e+02 4.314e+02 8.392e+02, threshold=6.865e+02, percent-clipped=2.0 2023-06-21 18:33:07,588 INFO [train.py:996] (2/4) Epoch 6, batch 19600, loss[loss=0.2674, simple_loss=0.3387, pruned_loss=0.09802, over 21903.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.308, pruned_loss=0.08311, over 4250589.07 frames. ], batch size: 107, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:33:53,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1032558.0, ans=0.125 2023-06-21 18:34:42,364 INFO [train.py:996] (2/4) Epoch 6, batch 19650, loss[loss=0.29, simple_loss=0.3515, pruned_loss=0.1142, over 21444.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3136, pruned_loss=0.08733, over 4255723.90 frames. ], batch size: 131, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:34:54,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1032738.0, ans=0.0 2023-06-21 18:35:14,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1032798.0, ans=0.125 2023-06-21 18:35:16,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1032858.0, ans=0.07 2023-06-21 18:35:17,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1032858.0, ans=0.09899494936611666 2023-06-21 18:35:33,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1032858.0, ans=0.125 2023-06-21 18:36:09,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1032978.0, ans=0.125 2023-06-21 18:36:10,812 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.518e+02 3.307e+02 3.868e+02 4.631e+02 9.125e+02, threshold=7.736e+02, percent-clipped=5.0 2023-06-21 18:36:23,458 INFO [train.py:996] (2/4) Epoch 6, batch 19700, loss[loss=0.2447, simple_loss=0.3317, pruned_loss=0.07887, over 21699.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3158, pruned_loss=0.08649, over 4256786.47 frames. ], batch size: 351, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:36:30,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1033038.0, ans=0.125 2023-06-21 18:36:39,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1033098.0, ans=0.125 2023-06-21 18:37:45,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1033278.0, ans=0.125 2023-06-21 18:37:47,418 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.36 vs. limit=22.5 2023-06-21 18:37:58,190 INFO [train.py:996] (2/4) Epoch 6, batch 19750, loss[loss=0.2845, simple_loss=0.3675, pruned_loss=0.1007, over 21768.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3268, pruned_loss=0.08828, over 4258743.98 frames. ], batch size: 298, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:38:26,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1033398.0, ans=0.0 2023-06-21 18:38:48,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1033458.0, ans=0.1 2023-06-21 18:38:54,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1033458.0, ans=0.125 2023-06-21 18:39:24,367 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.005e+02 3.146e+02 3.443e+02 4.036e+02 6.788e+02, threshold=6.886e+02, percent-clipped=0.0 2023-06-21 18:39:31,802 INFO [train.py:996] (2/4) Epoch 6, batch 19800, loss[loss=0.2423, simple_loss=0.3236, pruned_loss=0.08044, over 21666.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.324, pruned_loss=0.08771, over 4262141.37 frames. ], batch size: 441, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:39:37,343 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.31 vs. limit=22.5 2023-06-21 18:40:25,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1033758.0, ans=0.0 2023-06-21 18:40:42,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1033818.0, ans=0.125 2023-06-21 18:41:11,659 INFO [train.py:996] (2/4) Epoch 6, batch 19850, loss[loss=0.1782, simple_loss=0.2409, pruned_loss=0.05778, over 21821.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3171, pruned_loss=0.0834, over 4265103.60 frames. ], batch size: 102, lr: 5.03e-03, grad_scale: 16.0 2023-06-21 18:41:52,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1034058.0, ans=0.2 2023-06-21 18:42:13,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1034118.0, ans=0.125 2023-06-21 18:42:33,692 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.847e+02 2.771e+02 3.274e+02 3.883e+02 8.276e+02, threshold=6.549e+02, percent-clipped=3.0 2023-06-21 18:42:34,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1034178.0, ans=0.125 2023-06-21 18:42:43,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1034238.0, ans=0.0 2023-06-21 18:42:44,455 INFO [train.py:996] (2/4) Epoch 6, batch 19900, loss[loss=0.2186, simple_loss=0.2825, pruned_loss=0.07736, over 21183.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3171, pruned_loss=0.08097, over 4269636.60 frames. ], batch size: 176, lr: 5.03e-03, grad_scale: 16.0 2023-06-21 18:43:23,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1034358.0, ans=0.125 2023-06-21 18:43:41,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1034418.0, ans=0.125 2023-06-21 18:43:52,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1034418.0, ans=0.125 2023-06-21 18:43:58,778 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.19 vs. limit=15.0 2023-06-21 18:44:19,352 INFO [train.py:996] (2/4) Epoch 6, batch 19950, loss[loss=0.2208, simple_loss=0.2968, pruned_loss=0.07239, over 21740.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3105, pruned_loss=0.08078, over 4261343.48 frames. ], batch size: 316, lr: 5.03e-03, grad_scale: 16.0 2023-06-21 18:44:19,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1034538.0, ans=0.0 2023-06-21 18:45:04,997 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-06-21 18:45:34,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1034778.0, ans=0.0 2023-06-21 18:45:36,276 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=22.5 2023-06-21 18:45:44,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1034778.0, ans=0.125 2023-06-21 18:45:47,377 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.175e+02 2.877e+02 3.269e+02 3.817e+02 6.552e+02, threshold=6.538e+02, percent-clipped=1.0 2023-06-21 18:45:47,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1034778.0, ans=0.0 2023-06-21 18:45:53,429 INFO [train.py:996] (2/4) Epoch 6, batch 20000, loss[loss=0.2472, simple_loss=0.319, pruned_loss=0.08768, over 21642.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3112, pruned_loss=0.08145, over 4265169.42 frames. ], batch size: 263, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:45:57,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1034838.0, ans=0.0 2023-06-21 18:46:32,587 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.06 vs. limit=10.0 2023-06-21 18:46:33,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1034958.0, ans=0.125 2023-06-21 18:46:42,693 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.50 vs. limit=15.0 2023-06-21 18:46:43,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1034958.0, ans=0.0 2023-06-21 18:46:54,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1035018.0, ans=0.0 2023-06-21 18:47:05,010 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.26 vs. limit=6.0 2023-06-21 18:47:06,694 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.86 vs. limit=6.0 2023-06-21 18:47:13,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1035078.0, ans=0.125 2023-06-21 18:47:26,471 INFO [train.py:996] (2/4) Epoch 6, batch 20050, loss[loss=0.2591, simple_loss=0.3251, pruned_loss=0.0965, over 21803.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.313, pruned_loss=0.08386, over 4270188.70 frames. ], batch size: 414, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:47:31,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1035138.0, ans=0.1 2023-06-21 18:48:07,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1035258.0, ans=0.125 2023-06-21 18:48:30,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1035318.0, ans=0.125 2023-06-21 18:48:54,571 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.295e+02 2.772e+02 3.158e+02 3.739e+02 6.638e+02, threshold=6.316e+02, percent-clipped=1.0 2023-06-21 18:48:58,918 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.90 vs. limit=22.5 2023-06-21 18:49:00,939 INFO [train.py:996] (2/4) Epoch 6, batch 20100, loss[loss=0.2017, simple_loss=0.2706, pruned_loss=0.06639, over 21246.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3159, pruned_loss=0.08717, over 4282770.61 frames. ], batch size: 608, lr: 5.03e-03, grad_scale: 32.0 2023-06-21 18:49:21,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1035438.0, ans=0.1 2023-06-21 18:49:39,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1035558.0, ans=0.1 2023-06-21 18:50:15,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1035618.0, ans=0.2 2023-06-21 18:50:22,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1035678.0, ans=0.125 2023-06-21 18:50:40,279 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=15.0 2023-06-21 18:50:45,519 INFO [train.py:996] (2/4) Epoch 6, batch 20150, loss[loss=0.3193, simple_loss=0.4368, pruned_loss=0.1009, over 20817.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3272, pruned_loss=0.09149, over 4277676.16 frames. ], batch size: 607, lr: 5.03e-03, grad_scale: 16.0 2023-06-21 18:51:00,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1035798.0, ans=0.0 2023-06-21 18:51:03,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1035798.0, ans=0.125 2023-06-21 18:51:39,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1035858.0, ans=0.125 2023-06-21 18:52:13,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1035978.0, ans=0.2 2023-06-21 18:52:17,965 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.643e+02 3.447e+02 4.089e+02 4.717e+02 8.481e+02, threshold=8.179e+02, percent-clipped=7.0 2023-06-21 18:52:22,519 INFO [train.py:996] (2/4) Epoch 6, batch 20200, loss[loss=0.2994, simple_loss=0.3957, pruned_loss=0.1015, over 20797.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.334, pruned_loss=0.09447, over 4276798.01 frames. ], batch size: 607, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 18:53:21,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1036218.0, ans=0.125 2023-06-21 18:53:38,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1036218.0, ans=0.125 2023-06-21 18:54:02,158 INFO [train.py:996] (2/4) Epoch 6, batch 20250, loss[loss=0.2519, simple_loss=0.3369, pruned_loss=0.08345, over 21812.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3343, pruned_loss=0.09209, over 4280080.20 frames. ], batch size: 351, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 18:54:55,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1036518.0, ans=0.0 2023-06-21 18:55:22,004 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.148e+02 2.562e+02 2.955e+02 3.343e+02 6.024e+02, threshold=5.910e+02, percent-clipped=0.0 2023-06-21 18:55:22,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1036578.0, ans=0.125 2023-06-21 18:55:31,046 INFO [train.py:996] (2/4) Epoch 6, batch 20300, loss[loss=0.2547, simple_loss=0.3566, pruned_loss=0.07635, over 20828.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3293, pruned_loss=0.08821, over 4269948.04 frames. ], batch size: 608, lr: 5.02e-03, grad_scale: 8.0 2023-06-21 18:55:37,336 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 18:55:59,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1036698.0, ans=0.2 2023-06-21 18:56:33,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1036818.0, ans=0.125 2023-06-21 18:56:59,741 INFO [train.py:996] (2/4) Epoch 6, batch 20350, loss[loss=0.2849, simple_loss=0.3442, pruned_loss=0.1127, over 21851.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3282, pruned_loss=0.08846, over 4254106.61 frames. ], batch size: 332, lr: 5.02e-03, grad_scale: 8.0 2023-06-21 18:57:22,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1036998.0, ans=0.125 2023-06-21 18:57:34,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1036998.0, ans=0.1 2023-06-21 18:57:36,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1036998.0, ans=0.0 2023-06-21 18:57:45,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1037058.0, ans=0.0 2023-06-21 18:58:13,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1037118.0, ans=0.125 2023-06-21 18:58:34,348 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.265e+02 2.934e+02 3.391e+02 4.276e+02 6.956e+02, threshold=6.781e+02, percent-clipped=4.0 2023-06-21 18:58:37,440 INFO [train.py:996] (2/4) Epoch 6, batch 20400, loss[loss=0.3076, simple_loss=0.3755, pruned_loss=0.1198, over 21346.00 frames. ], tot_loss[loss=0.2574, simple_loss=0.3314, pruned_loss=0.09171, over 4263370.67 frames. ], batch size: 548, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 18:58:49,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1037238.0, ans=0.125 2023-06-21 18:59:06,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1037298.0, ans=0.1 2023-06-21 18:59:07,065 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=15.0 2023-06-21 18:59:21,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1037358.0, ans=0.1 2023-06-21 18:59:42,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1037418.0, ans=0.0 2023-06-21 18:59:55,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1037478.0, ans=0.0 2023-06-21 19:00:05,782 INFO [train.py:996] (2/4) Epoch 6, batch 20450, loss[loss=0.2114, simple_loss=0.2569, pruned_loss=0.08292, over 19986.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.3339, pruned_loss=0.09495, over 4261102.17 frames. ], batch size: 704, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 19:00:10,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1037538.0, ans=0.125 2023-06-21 19:00:19,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1037598.0, ans=0.125 2023-06-21 19:00:42,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1037658.0, ans=0.1 2023-06-21 19:00:44,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1037658.0, ans=15.0 2023-06-21 19:00:49,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1037658.0, ans=0.1 2023-06-21 19:01:02,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1037718.0, ans=0.125 2023-06-21 19:01:11,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1037718.0, ans=0.125 2023-06-21 19:01:11,731 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-06-21 19:01:26,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1037778.0, ans=0.125 2023-06-21 19:01:36,786 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.469e+02 3.567e+02 4.308e+02 5.221e+02 9.242e+02, threshold=8.616e+02, percent-clipped=7.0 2023-06-21 19:01:39,723 INFO [train.py:996] (2/4) Epoch 6, batch 20500, loss[loss=0.2375, simple_loss=0.308, pruned_loss=0.08347, over 21341.00 frames. ], tot_loss[loss=0.2586, simple_loss=0.3284, pruned_loss=0.09435, over 4261118.83 frames. ], batch size: 176, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 19:01:47,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1037838.0, ans=0.125 2023-06-21 19:01:48,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1037838.0, ans=0.125 2023-06-21 19:02:16,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1037958.0, ans=0.0 2023-06-21 19:02:49,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1038018.0, ans=0.125 2023-06-21 19:03:06,982 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:03:11,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1038078.0, ans=0.125 2023-06-21 19:03:14,146 INFO [train.py:996] (2/4) Epoch 6, batch 20550, loss[loss=0.2141, simple_loss=0.297, pruned_loss=0.06558, over 21568.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3196, pruned_loss=0.0923, over 4261432.10 frames. ], batch size: 263, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 19:03:23,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1038138.0, ans=0.05 2023-06-21 19:03:25,661 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-21 19:03:39,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1038198.0, ans=0.2 2023-06-21 19:03:58,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1038258.0, ans=0.2 2023-06-21 19:03:58,924 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=22.5 2023-06-21 19:04:39,185 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.55 vs. limit=15.0 2023-06-21 19:04:45,835 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.402e+02 3.022e+02 3.769e+02 4.529e+02 7.328e+02, threshold=7.538e+02, percent-clipped=0.0 2023-06-21 19:04:48,981 INFO [train.py:996] (2/4) Epoch 6, batch 20600, loss[loss=0.2566, simple_loss=0.331, pruned_loss=0.09108, over 21531.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3214, pruned_loss=0.08977, over 4249336.83 frames. ], batch size: 548, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 19:05:12,571 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-21 19:05:13,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1038498.0, ans=0.125 2023-06-21 19:05:15,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=1038498.0, ans=0.025 2023-06-21 19:06:21,549 INFO [train.py:996] (2/4) Epoch 6, batch 20650, loss[loss=0.248, simple_loss=0.3041, pruned_loss=0.09597, over 21289.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3182, pruned_loss=0.09029, over 4249317.16 frames. ], batch size: 143, lr: 5.02e-03, grad_scale: 8.0 2023-06-21 19:07:12,075 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.35 vs. limit=10.0 2023-06-21 19:07:53,760 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.08 vs. limit=15.0 2023-06-21 19:07:55,779 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.861e+02 2.764e+02 3.114e+02 3.548e+02 5.059e+02, threshold=6.228e+02, percent-clipped=0.0 2023-06-21 19:07:57,288 INFO [train.py:996] (2/4) Epoch 6, batch 20700, loss[loss=0.2018, simple_loss=0.2844, pruned_loss=0.0596, over 21706.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3116, pruned_loss=0.08731, over 4255446.60 frames. ], batch size: 298, lr: 5.02e-03, grad_scale: 8.0 2023-06-21 19:08:14,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1039038.0, ans=0.05 2023-06-21 19:08:20,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1039098.0, ans=0.2 2023-06-21 19:09:37,896 INFO [train.py:996] (2/4) Epoch 6, batch 20750, loss[loss=0.3084, simple_loss=0.408, pruned_loss=0.1044, over 21666.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3129, pruned_loss=0.08568, over 4254694.10 frames. ], batch size: 389, lr: 5.02e-03, grad_scale: 8.0 2023-06-21 19:11:05,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=1039578.0, ans=0.1 2023-06-21 19:11:08,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1039578.0, ans=0.125 2023-06-21 19:11:11,334 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.333e+02 3.291e+02 4.399e+02 5.919e+02 1.160e+03, threshold=8.798e+02, percent-clipped=22.0 2023-06-21 19:11:12,892 INFO [train.py:996] (2/4) Epoch 6, batch 20800, loss[loss=0.252, simple_loss=0.2959, pruned_loss=0.1041, over 21423.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3165, pruned_loss=0.08711, over 4255611.70 frames. ], batch size: 160, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 19:11:23,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1039638.0, ans=0.125 2023-06-21 19:11:53,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1039758.0, ans=0.025 2023-06-21 19:12:01,286 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 19:12:45,675 INFO [train.py:996] (2/4) Epoch 6, batch 20850, loss[loss=0.1743, simple_loss=0.2472, pruned_loss=0.0507, over 21658.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3115, pruned_loss=0.08551, over 4255468.96 frames. ], batch size: 247, lr: 5.02e-03, grad_scale: 16.0 2023-06-21 19:12:47,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1039938.0, ans=0.2 2023-06-21 19:13:00,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1039998.0, ans=0.5 2023-06-21 19:14:00,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1040118.0, ans=0.0 2023-06-21 19:14:02,240 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.92 vs. limit=6.0 2023-06-21 19:14:09,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1040178.0, ans=0.2 2023-06-21 19:14:17,409 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.782e+02 3.461e+02 4.341e+02 9.177e+02, threshold=6.922e+02, percent-clipped=2.0 2023-06-21 19:14:18,811 INFO [train.py:996] (2/4) Epoch 6, batch 20900, loss[loss=0.2256, simple_loss=0.3038, pruned_loss=0.07369, over 21599.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3127, pruned_loss=0.08672, over 4268451.89 frames. ], batch size: 263, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:15:12,258 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-21 19:15:34,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1040418.0, ans=0.0 2023-06-21 19:15:35,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1040478.0, ans=0.125 2023-06-21 19:15:39,356 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.70 vs. limit=12.0 2023-06-21 19:15:50,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1040538.0, ans=0.0 2023-06-21 19:15:51,577 INFO [train.py:996] (2/4) Epoch 6, batch 20950, loss[loss=0.2265, simple_loss=0.3001, pruned_loss=0.07646, over 21775.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3077, pruned_loss=0.08264, over 4258365.96 frames. ], batch size: 414, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:15:57,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1040538.0, ans=0.04949747468305833 2023-06-21 19:15:59,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1040538.0, ans=0.0 2023-06-21 19:16:11,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1040598.0, ans=0.04949747468305833 2023-06-21 19:16:39,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1040718.0, ans=0.2 2023-06-21 19:17:17,647 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.20 vs. limit=15.0 2023-06-21 19:17:17,881 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.962e+02 2.529e+02 2.877e+02 3.319e+02 6.338e+02, threshold=5.753e+02, percent-clipped=0.0 2023-06-21 19:17:19,464 INFO [train.py:996] (2/4) Epoch 6, batch 21000, loss[loss=0.2635, simple_loss=0.3196, pruned_loss=0.1037, over 21575.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3055, pruned_loss=0.08217, over 4262301.75 frames. ], batch size: 548, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:17:19,464 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 19:17:35,758 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2688, simple_loss=0.3681, pruned_loss=0.08473, over 1796401.00 frames. 2023-06-21 19:17:35,758 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-21 19:17:45,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1040838.0, ans=0.0 2023-06-21 19:17:58,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1040898.0, ans=0.125 2023-06-21 19:18:17,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1040958.0, ans=0.1 2023-06-21 19:18:32,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1041018.0, ans=0.1 2023-06-21 19:18:58,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1041078.0, ans=0.1 2023-06-21 19:19:06,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1041138.0, ans=0.125 2023-06-21 19:19:08,163 INFO [train.py:996] (2/4) Epoch 6, batch 21050, loss[loss=0.2295, simple_loss=0.2904, pruned_loss=0.08428, over 21321.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3041, pruned_loss=0.08299, over 4256997.66 frames. ], batch size: 144, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:19:26,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1041198.0, ans=10.0 2023-06-21 19:20:34,869 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.795e+02 3.116e+02 3.832e+02 6.783e+02, threshold=6.232e+02, percent-clipped=3.0 2023-06-21 19:20:36,442 INFO [train.py:996] (2/4) Epoch 6, batch 21100, loss[loss=0.2137, simple_loss=0.2764, pruned_loss=0.07548, over 21828.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3006, pruned_loss=0.08278, over 4260131.26 frames. ], batch size: 318, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:21:19,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1041558.0, ans=0.125 2023-06-21 19:22:10,050 INFO [train.py:996] (2/4) Epoch 6, batch 21150, loss[loss=0.2287, simple_loss=0.3093, pruned_loss=0.07407, over 16120.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.2985, pruned_loss=0.08394, over 4248460.83 frames. ], batch size: 62, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:22:16,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1041738.0, ans=0.125 2023-06-21 19:22:42,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1041798.0, ans=0.125 2023-06-21 19:22:58,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1041858.0, ans=0.0 2023-06-21 19:23:23,405 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2023-06-21 19:23:41,594 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.952e+02 2.856e+02 3.274e+02 4.026e+02 6.885e+02, threshold=6.548e+02, percent-clipped=2.0 2023-06-21 19:23:42,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1042038.0, ans=0.125 2023-06-21 19:23:43,129 INFO [train.py:996] (2/4) Epoch 6, batch 21200, loss[loss=0.2231, simple_loss=0.2844, pruned_loss=0.08084, over 21406.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.2961, pruned_loss=0.08341, over 4243339.65 frames. ], batch size: 131, lr: 5.01e-03, grad_scale: 32.0 2023-06-21 19:24:15,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1042098.0, ans=0.125 2023-06-21 19:24:25,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1042158.0, ans=0.125 2023-06-21 19:24:51,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1042278.0, ans=0.07 2023-06-21 19:25:12,306 INFO [train.py:996] (2/4) Epoch 6, batch 21250, loss[loss=0.2007, simple_loss=0.2622, pruned_loss=0.06962, over 21716.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.2942, pruned_loss=0.08352, over 4244631.45 frames. ], batch size: 124, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:25:47,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1042458.0, ans=0.125 2023-06-21 19:26:41,157 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.210e+02 3.008e+02 3.484e+02 4.132e+02 8.300e+02, threshold=6.969e+02, percent-clipped=3.0 2023-06-21 19:26:41,177 INFO [train.py:996] (2/4) Epoch 6, batch 21300, loss[loss=0.2443, simple_loss=0.3107, pruned_loss=0.08899, over 21494.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3013, pruned_loss=0.08617, over 4249707.43 frames. ], batch size: 212, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:26:41,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1042638.0, ans=0.125 2023-06-21 19:26:47,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1042638.0, ans=0.125 2023-06-21 19:27:16,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1042698.0, ans=0.125 2023-06-21 19:27:37,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1042758.0, ans=0.2 2023-06-21 19:28:12,332 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.36 vs. limit=15.0 2023-06-21 19:28:14,538 INFO [train.py:996] (2/4) Epoch 6, batch 21350, loss[loss=0.2589, simple_loss=0.319, pruned_loss=0.09937, over 21803.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3056, pruned_loss=0.08737, over 4263922.79 frames. ], batch size: 112, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:29:05,997 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=22.5 2023-06-21 19:29:09,042 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.36 vs. limit=15.0 2023-06-21 19:29:10,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1043058.0, ans=10.0 2023-06-21 19:29:35,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1043178.0, ans=0.125 2023-06-21 19:29:49,098 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 2.778e+02 3.087e+02 3.779e+02 5.135e+02, threshold=6.175e+02, percent-clipped=0.0 2023-06-21 19:29:49,119 INFO [train.py:996] (2/4) Epoch 6, batch 21400, loss[loss=0.2424, simple_loss=0.2958, pruned_loss=0.09449, over 20218.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.309, pruned_loss=0.08634, over 4263111.33 frames. ], batch size: 703, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:29:50,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1043238.0, ans=0.125 2023-06-21 19:30:15,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1043298.0, ans=0.125 2023-06-21 19:30:20,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1043298.0, ans=0.2 2023-06-21 19:30:28,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1043358.0, ans=0.125 2023-06-21 19:30:48,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1043418.0, ans=0.2 2023-06-21 19:31:22,532 INFO [train.py:996] (2/4) Epoch 6, batch 21450, loss[loss=0.2601, simple_loss=0.324, pruned_loss=0.09812, over 21801.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3138, pruned_loss=0.08869, over 4270742.43 frames. ], batch size: 441, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:31:51,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1043598.0, ans=0.0 2023-06-21 19:32:55,665 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.398e+02 2.827e+02 3.165e+02 3.718e+02 5.694e+02, threshold=6.329e+02, percent-clipped=0.0 2023-06-21 19:32:55,686 INFO [train.py:996] (2/4) Epoch 6, batch 21500, loss[loss=0.2492, simple_loss=0.3074, pruned_loss=0.09547, over 21986.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3125, pruned_loss=0.08994, over 4269870.91 frames. ], batch size: 103, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:33:06,014 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=15.0 2023-06-21 19:33:34,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1043898.0, ans=0.125 2023-06-21 19:34:11,583 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=12.0 2023-06-21 19:34:29,721 INFO [train.py:996] (2/4) Epoch 6, batch 21550, loss[loss=0.2341, simple_loss=0.2953, pruned_loss=0.08638, over 21690.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3059, pruned_loss=0.08683, over 4260701.65 frames. ], batch size: 333, lr: 5.01e-03, grad_scale: 16.0 2023-06-21 19:34:33,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1044138.0, ans=0.2 2023-06-21 19:35:34,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1044318.0, ans=0.125 2023-06-21 19:36:04,912 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 2.859e+02 3.362e+02 4.302e+02 8.120e+02, threshold=6.725e+02, percent-clipped=2.0 2023-06-21 19:36:04,933 INFO [train.py:996] (2/4) Epoch 6, batch 21600, loss[loss=0.2202, simple_loss=0.2883, pruned_loss=0.076, over 21983.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3004, pruned_loss=0.08433, over 4251342.41 frames. ], batch size: 103, lr: 5.00e-03, grad_scale: 32.0 2023-06-21 19:36:17,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1044438.0, ans=0.0 2023-06-21 19:37:13,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1044618.0, ans=0.1 2023-06-21 19:37:39,254 INFO [train.py:996] (2/4) Epoch 6, batch 21650, loss[loss=0.2136, simple_loss=0.3, pruned_loss=0.06364, over 21403.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3034, pruned_loss=0.08238, over 4248116.08 frames. ], batch size: 131, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:38:17,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1044798.0, ans=0.05 2023-06-21 19:38:19,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1044798.0, ans=0.1 2023-06-21 19:39:00,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1044978.0, ans=0.2 2023-06-21 19:39:13,276 INFO [train.py:996] (2/4) Epoch 6, batch 21700, loss[loss=0.2344, simple_loss=0.3103, pruned_loss=0.07925, over 21713.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3035, pruned_loss=0.07974, over 4252956.62 frames. ], batch size: 298, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:39:14,694 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.795e+02 3.316e+02 4.085e+02 7.380e+02, threshold=6.631e+02, percent-clipped=1.0 2023-06-21 19:40:04,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1045158.0, ans=0.0 2023-06-21 19:40:15,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1045218.0, ans=0.125 2023-06-21 19:40:24,747 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.86 vs. limit=6.0 2023-06-21 19:40:27,979 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=5.37 vs. limit=5.0 2023-06-21 19:40:34,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1045278.0, ans=0.1 2023-06-21 19:40:34,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1045278.0, ans=0.0 2023-06-21 19:40:40,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1045278.0, ans=0.0 2023-06-21 19:40:41,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1045278.0, ans=0.04949747468305833 2023-06-21 19:40:45,986 INFO [train.py:996] (2/4) Epoch 6, batch 21750, loss[loss=0.2364, simple_loss=0.2988, pruned_loss=0.08705, over 15502.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.2994, pruned_loss=0.07904, over 4242921.51 frames. ], batch size: 60, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:41:15,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1045398.0, ans=0.125 2023-06-21 19:41:16,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1045398.0, ans=0.125 2023-06-21 19:41:30,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1045458.0, ans=0.2 2023-06-21 19:41:54,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1045518.0, ans=0.2 2023-06-21 19:42:01,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1045518.0, ans=0.1 2023-06-21 19:42:17,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1045578.0, ans=0.0 2023-06-21 19:42:19,842 INFO [train.py:996] (2/4) Epoch 6, batch 21800, loss[loss=0.2261, simple_loss=0.2947, pruned_loss=0.07877, over 21273.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2976, pruned_loss=0.08091, over 4245088.87 frames. ], batch size: 176, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:42:21,225 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 2.706e+02 3.025e+02 3.711e+02 5.699e+02, threshold=6.051e+02, percent-clipped=0.0 2023-06-21 19:42:24,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1045638.0, ans=0.0 2023-06-21 19:42:42,704 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-21 19:43:42,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1045878.0, ans=0.0 2023-06-21 19:43:53,868 INFO [train.py:996] (2/4) Epoch 6, batch 21850, loss[loss=0.2321, simple_loss=0.3311, pruned_loss=0.06656, over 21798.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.305, pruned_loss=0.08242, over 4258043.69 frames. ], batch size: 351, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:43:54,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1045938.0, ans=0.125 2023-06-21 19:44:44,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1046058.0, ans=0.1 2023-06-21 19:44:46,701 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.10 vs. limit=22.5 2023-06-21 19:45:13,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1046178.0, ans=0.2 2023-06-21 19:45:21,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1046178.0, ans=0.125 2023-06-21 19:45:24,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1046178.0, ans=0.125 2023-06-21 19:45:26,630 INFO [train.py:996] (2/4) Epoch 6, batch 21900, loss[loss=0.2574, simple_loss=0.3152, pruned_loss=0.09978, over 21574.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.307, pruned_loss=0.08421, over 4261526.89 frames. ], batch size: 391, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:45:28,144 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.124e+02 2.966e+02 3.405e+02 4.071e+02 6.584e+02, threshold=6.811e+02, percent-clipped=2.0 2023-06-21 19:45:33,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1046238.0, ans=0.125 2023-06-21 19:45:42,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1046238.0, ans=0.125 2023-06-21 19:46:10,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1046358.0, ans=0.125 2023-06-21 19:47:00,486 INFO [train.py:996] (2/4) Epoch 6, batch 21950, loss[loss=0.1752, simple_loss=0.2551, pruned_loss=0.04768, over 21555.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3002, pruned_loss=0.0824, over 4256125.94 frames. ], batch size: 230, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:47:16,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1046538.0, ans=0.125 2023-06-21 19:47:23,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1046598.0, ans=0.1 2023-06-21 19:47:38,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1046598.0, ans=0.0 2023-06-21 19:47:44,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1046658.0, ans=0.0 2023-06-21 19:47:53,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1046658.0, ans=0.0 2023-06-21 19:48:16,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1046778.0, ans=0.0 2023-06-21 19:48:23,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1046778.0, ans=0.0 2023-06-21 19:48:34,343 INFO [train.py:996] (2/4) Epoch 6, batch 22000, loss[loss=0.2121, simple_loss=0.2748, pruned_loss=0.07468, over 21680.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2933, pruned_loss=0.0785, over 4261632.70 frames. ], batch size: 282, lr: 5.00e-03, grad_scale: 32.0 2023-06-21 19:48:40,677 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.834e+02 2.414e+02 2.927e+02 3.631e+02 6.492e+02, threshold=5.855e+02, percent-clipped=0.0 2023-06-21 19:49:12,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1046898.0, ans=0.125 2023-06-21 19:49:13,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1046898.0, ans=0.125 2023-06-21 19:49:19,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1046958.0, ans=0.125 2023-06-21 19:49:25,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1046958.0, ans=0.125 2023-06-21 19:50:14,052 INFO [train.py:996] (2/4) Epoch 6, batch 22050, loss[loss=0.3083, simple_loss=0.3747, pruned_loss=0.1209, over 21719.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2975, pruned_loss=0.08024, over 4256933.48 frames. ], batch size: 441, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:50:40,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1047198.0, ans=0.07 2023-06-21 19:50:52,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1047258.0, ans=0.125 2023-06-21 19:51:48,061 INFO [train.py:996] (2/4) Epoch 6, batch 22100, loss[loss=0.2994, simple_loss=0.3643, pruned_loss=0.1173, over 21866.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3107, pruned_loss=0.08572, over 4258866.69 frames. ], batch size: 371, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:51:51,162 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.919e+02 3.410e+02 3.908e+02 4.704e+02 7.568e+02, threshold=7.817e+02, percent-clipped=7.0 2023-06-21 19:53:22,041 INFO [train.py:996] (2/4) Epoch 6, batch 22150, loss[loss=0.2372, simple_loss=0.3171, pruned_loss=0.07862, over 21462.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3136, pruned_loss=0.08816, over 4267312.65 frames. ], batch size: 194, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:53:22,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1047738.0, ans=0.0 2023-06-21 19:53:42,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1047738.0, ans=0.125 2023-06-21 19:54:05,686 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.39 vs. limit=15.0 2023-06-21 19:55:00,797 INFO [train.py:996] (2/4) Epoch 6, batch 22200, loss[loss=0.2362, simple_loss=0.312, pruned_loss=0.08016, over 17579.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3149, pruned_loss=0.08884, over 4268055.40 frames. ], batch size: 61, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:55:08,463 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.486e+02 3.160e+02 3.693e+02 4.495e+02 7.335e+02, threshold=7.385e+02, percent-clipped=0.0 2023-06-21 19:55:13,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1048038.0, ans=0.125 2023-06-21 19:55:34,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1048158.0, ans=0.0 2023-06-21 19:56:20,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1048278.0, ans=0.0 2023-06-21 19:56:34,373 INFO [train.py:996] (2/4) Epoch 6, batch 22250, loss[loss=0.3209, simple_loss=0.3786, pruned_loss=0.1316, over 21468.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.323, pruned_loss=0.09075, over 4278898.46 frames. ], batch size: 471, lr: 5.00e-03, grad_scale: 16.0 2023-06-21 19:57:35,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1048518.0, ans=0.125 2023-06-21 19:57:57,994 INFO [train.py:996] (2/4) Epoch 6, batch 22300, loss[loss=0.2493, simple_loss=0.3125, pruned_loss=0.09303, over 21231.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3238, pruned_loss=0.09268, over 4286681.25 frames. ], batch size: 143, lr: 4.99e-03, grad_scale: 16.0 2023-06-21 19:58:05,485 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.426e+02 3.056e+02 3.498e+02 3.964e+02 6.113e+02, threshold=6.996e+02, percent-clipped=0.0 2023-06-21 19:59:02,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1048818.0, ans=0.0 2023-06-21 19:59:10,700 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=12.0 2023-06-21 19:59:27,793 INFO [train.py:996] (2/4) Epoch 6, batch 22350, loss[loss=0.2895, simple_loss=0.3454, pruned_loss=0.1168, over 21837.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3217, pruned_loss=0.09336, over 4298061.77 frames. ], batch size: 441, lr: 4.99e-03, grad_scale: 16.0 2023-06-21 20:00:12,424 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=22.5 2023-06-21 20:00:28,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1049118.0, ans=0.0 2023-06-21 20:00:34,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1049118.0, ans=0.125 2023-06-21 20:00:46,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1049178.0, ans=0.125 2023-06-21 20:00:57,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1049178.0, ans=0.0 2023-06-21 20:01:01,245 INFO [train.py:996] (2/4) Epoch 6, batch 22400, loss[loss=0.234, simple_loss=0.2979, pruned_loss=0.08503, over 21894.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3182, pruned_loss=0.08991, over 4297398.56 frames. ], batch size: 107, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:01:03,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1049238.0, ans=0.125 2023-06-21 20:01:04,242 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.993e+02 2.868e+02 3.552e+02 4.171e+02 5.869e+02, threshold=7.104e+02, percent-clipped=0.0 2023-06-21 20:01:17,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1049298.0, ans=0.2 2023-06-21 20:02:34,800 INFO [train.py:996] (2/4) Epoch 6, batch 22450, loss[loss=0.2165, simple_loss=0.2753, pruned_loss=0.07886, over 21452.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3116, pruned_loss=0.08768, over 4299511.43 frames. ], batch size: 212, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:03:13,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1049658.0, ans=0.1 2023-06-21 20:04:08,820 INFO [train.py:996] (2/4) Epoch 6, batch 22500, loss[loss=0.2361, simple_loss=0.2899, pruned_loss=0.09118, over 20035.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3063, pruned_loss=0.0877, over 4285952.95 frames. ], batch size: 702, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:04:10,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1049838.0, ans=0.0 2023-06-21 20:04:11,776 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.255e+02 2.833e+02 3.380e+02 4.088e+02 7.765e+02, threshold=6.760e+02, percent-clipped=2.0 2023-06-21 20:04:12,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1049838.0, ans=0.125 2023-06-21 20:04:16,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1049838.0, ans=0.0 2023-06-21 20:04:29,241 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=22.5 2023-06-21 20:04:43,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1049958.0, ans=0.0 2023-06-21 20:05:42,956 INFO [train.py:996] (2/4) Epoch 6, batch 22550, loss[loss=0.2307, simple_loss=0.3191, pruned_loss=0.07118, over 21035.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3111, pruned_loss=0.08765, over 4285230.47 frames. ], batch size: 607, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:06:28,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1050258.0, ans=0.2 2023-06-21 20:06:34,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1050258.0, ans=0.0 2023-06-21 20:06:46,217 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=15.0 2023-06-21 20:07:14,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1050378.0, ans=0.0 2023-06-21 20:07:18,460 INFO [train.py:996] (2/4) Epoch 6, batch 22600, loss[loss=0.3313, simple_loss=0.4068, pruned_loss=0.1279, over 21509.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3152, pruned_loss=0.08853, over 4289498.69 frames. ], batch size: 471, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:07:21,333 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.549e+02 3.122e+02 3.804e+02 4.633e+02 7.875e+02, threshold=7.609e+02, percent-clipped=4.0 2023-06-21 20:07:44,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1050498.0, ans=0.125 2023-06-21 20:08:47,119 INFO [train.py:996] (2/4) Epoch 6, batch 22650, loss[loss=0.2647, simple_loss=0.3804, pruned_loss=0.07454, over 19817.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3128, pruned_loss=0.08789, over 4286901.00 frames. ], batch size: 703, lr: 4.99e-03, grad_scale: 16.0 2023-06-21 20:09:29,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1050858.0, ans=0.125 2023-06-21 20:10:06,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1050978.0, ans=0.0 2023-06-21 20:10:19,395 INFO [train.py:996] (2/4) Epoch 6, batch 22700, loss[loss=0.2628, simple_loss=0.3112, pruned_loss=0.1071, over 21505.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3065, pruned_loss=0.08777, over 4285212.85 frames. ], batch size: 442, lr: 4.99e-03, grad_scale: 16.0 2023-06-21 20:10:23,803 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.406e+02 3.096e+02 3.667e+02 4.331e+02 7.482e+02, threshold=7.334e+02, percent-clipped=0.0 2023-06-21 20:11:50,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1051278.0, ans=0.2 2023-06-21 20:11:53,616 INFO [train.py:996] (2/4) Epoch 6, batch 22750, loss[loss=0.26, simple_loss=0.3304, pruned_loss=0.09484, over 21383.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3076, pruned_loss=0.08952, over 4276050.53 frames. ], batch size: 549, lr: 4.99e-03, grad_scale: 16.0 2023-06-21 20:11:54,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1051338.0, ans=0.2 2023-06-21 20:12:05,167 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=22.5 2023-06-21 20:12:46,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1051458.0, ans=0.125 2023-06-21 20:13:06,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1051518.0, ans=0.2 2023-06-21 20:13:23,674 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:13:26,411 INFO [train.py:996] (2/4) Epoch 6, batch 22800, loss[loss=0.2761, simple_loss=0.3281, pruned_loss=0.112, over 21603.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3113, pruned_loss=0.09185, over 4286815.25 frames. ], batch size: 471, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:13:29,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1051638.0, ans=0.125 2023-06-21 20:13:30,790 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.282e+02 2.968e+02 3.368e+02 3.965e+02 6.508e+02, threshold=6.737e+02, percent-clipped=0.0 2023-06-21 20:13:44,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1051698.0, ans=0.125 2023-06-21 20:13:51,757 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.71 vs. limit=5.0 2023-06-21 20:14:15,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1051758.0, ans=0.125 2023-06-21 20:14:20,142 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=12.0 2023-06-21 20:14:28,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1051818.0, ans=0.1 2023-06-21 20:14:57,341 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.06 vs. limit=6.0 2023-06-21 20:14:59,379 INFO [train.py:996] (2/4) Epoch 6, batch 22850, loss[loss=0.2473, simple_loss=0.3151, pruned_loss=0.08981, over 22034.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3076, pruned_loss=0.09018, over 4283496.33 frames. ], batch size: 103, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:15:31,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1051998.0, ans=0.0 2023-06-21 20:16:34,408 INFO [train.py:996] (2/4) Epoch 6, batch 22900, loss[loss=0.2843, simple_loss=0.4094, pruned_loss=0.07959, over 19758.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3114, pruned_loss=0.08925, over 4277012.53 frames. ], batch size: 702, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:16:39,199 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.267e+02 2.845e+02 3.273e+02 3.939e+02 6.144e+02, threshold=6.547e+02, percent-clipped=0.0 2023-06-21 20:16:41,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1052238.0, ans=0.0 2023-06-21 20:17:34,688 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=22.5 2023-06-21 20:18:03,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1052478.0, ans=0.125 2023-06-21 20:18:04,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1052478.0, ans=0.125 2023-06-21 20:18:11,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1052478.0, ans=0.0 2023-06-21 20:18:15,286 INFO [train.py:996] (2/4) Epoch 6, batch 22950, loss[loss=0.2426, simple_loss=0.3838, pruned_loss=0.05071, over 20764.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3249, pruned_loss=0.08712, over 4272711.82 frames. ], batch size: 608, lr: 4.99e-03, grad_scale: 32.0 2023-06-21 20:18:33,044 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.18 vs. limit=22.5 2023-06-21 20:18:59,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1052658.0, ans=0.0 2023-06-21 20:19:24,075 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=15.0 2023-06-21 20:19:26,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1052718.0, ans=0.0 2023-06-21 20:19:30,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1052778.0, ans=0.1 2023-06-21 20:19:32,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1052778.0, ans=0.125 2023-06-21 20:19:32,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1052778.0, ans=0.125 2023-06-21 20:19:49,044 INFO [train.py:996] (2/4) Epoch 6, batch 23000, loss[loss=0.2612, simple_loss=0.3249, pruned_loss=0.09879, over 21505.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3236, pruned_loss=0.08554, over 4272527.98 frames. ], batch size: 131, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:19:53,522 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.125e+02 2.747e+02 3.155e+02 3.821e+02 7.452e+02, threshold=6.310e+02, percent-clipped=2.0 2023-06-21 20:19:54,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1052838.0, ans=0.07 2023-06-21 20:20:00,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1052838.0, ans=0.5 2023-06-21 20:20:19,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1052898.0, ans=0.0 2023-06-21 20:21:29,276 INFO [train.py:996] (2/4) Epoch 6, batch 23050, loss[loss=0.2453, simple_loss=0.3219, pruned_loss=0.08437, over 21815.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3251, pruned_loss=0.08822, over 4279982.37 frames. ], batch size: 118, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:22:07,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1053258.0, ans=0.2 2023-06-21 20:23:02,819 INFO [train.py:996] (2/4) Epoch 6, batch 23100, loss[loss=0.2554, simple_loss=0.3123, pruned_loss=0.09929, over 15361.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3213, pruned_loss=0.08936, over 4266135.40 frames. ], batch size: 62, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:23:07,167 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.588e+02 3.234e+02 3.747e+02 4.482e+02 8.068e+02, threshold=7.493e+02, percent-clipped=4.0 2023-06-21 20:23:19,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1053498.0, ans=0.0 2023-06-21 20:23:38,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1053558.0, ans=0.1 2023-06-21 20:23:40,877 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.85 vs. limit=15.0 2023-06-21 20:23:54,278 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=12.0 2023-06-21 20:23:55,354 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:23:59,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=1053618.0, ans=0.1 2023-06-21 20:24:23,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1053678.0, ans=0.125 2023-06-21 20:24:35,497 INFO [train.py:996] (2/4) Epoch 6, batch 23150, loss[loss=0.2299, simple_loss=0.2971, pruned_loss=0.08135, over 21829.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.314, pruned_loss=0.08782, over 4274670.03 frames. ], batch size: 332, lr: 4.98e-03, grad_scale: 16.0 2023-06-21 20:25:42,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1053978.0, ans=0.2 2023-06-21 20:25:58,098 INFO [train.py:996] (2/4) Epoch 6, batch 23200, loss[loss=0.2358, simple_loss=0.2912, pruned_loss=0.09021, over 21726.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3125, pruned_loss=0.08849, over 4284069.88 frames. ], batch size: 230, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:25:59,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1054038.0, ans=0.0 2023-06-21 20:26:13,452 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.087e+02 2.774e+02 3.196e+02 3.706e+02 6.362e+02, threshold=6.391e+02, percent-clipped=0.0 2023-06-21 20:26:16,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1054038.0, ans=0.125 2023-06-21 20:26:50,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1054218.0, ans=0.0 2023-06-21 20:27:30,830 INFO [train.py:996] (2/4) Epoch 6, batch 23250, loss[loss=0.2258, simple_loss=0.2877, pruned_loss=0.08189, over 21576.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3126, pruned_loss=0.0904, over 4291917.51 frames. ], batch size: 548, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:28:06,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1054458.0, ans=0.1 2023-06-21 20:29:00,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1054578.0, ans=0.0 2023-06-21 20:29:05,991 INFO [train.py:996] (2/4) Epoch 6, batch 23300, loss[loss=0.2762, simple_loss=0.3915, pruned_loss=0.08043, over 21839.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3221, pruned_loss=0.09265, over 4291271.38 frames. ], batch size: 371, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:29:10,246 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=15.0 2023-06-21 20:29:12,270 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.269e+02 2.961e+02 3.509e+02 4.048e+02 6.618e+02, threshold=7.018e+02, percent-clipped=1.0 2023-06-21 20:29:35,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1054758.0, ans=0.035 2023-06-21 20:30:40,395 INFO [train.py:996] (2/4) Epoch 6, batch 23350, loss[loss=0.2173, simple_loss=0.3185, pruned_loss=0.05799, over 20827.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3267, pruned_loss=0.0917, over 4290394.50 frames. ], batch size: 607, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:30:51,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1054938.0, ans=0.125 2023-06-21 20:30:53,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1054938.0, ans=0.125 2023-06-21 20:32:13,213 INFO [train.py:996] (2/4) Epoch 6, batch 23400, loss[loss=0.2315, simple_loss=0.2921, pruned_loss=0.08539, over 21348.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3193, pruned_loss=0.08719, over 4294367.05 frames. ], batch size: 176, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:32:18,971 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.059e+02 2.966e+02 3.517e+02 4.346e+02 6.933e+02, threshold=7.034e+02, percent-clipped=0.0 2023-06-21 20:32:43,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1055358.0, ans=0.025 2023-06-21 20:33:35,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1055478.0, ans=0.0 2023-06-21 20:33:35,995 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-21 20:33:41,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1055478.0, ans=0.125 2023-06-21 20:33:47,364 INFO [train.py:996] (2/4) Epoch 6, batch 23450, loss[loss=0.3456, simple_loss=0.3877, pruned_loss=0.1518, over 21441.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3211, pruned_loss=0.09038, over 4298253.72 frames. ], batch size: 471, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:33:50,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1055538.0, ans=0.125 2023-06-21 20:34:00,436 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.01 vs. limit=22.5 2023-06-21 20:34:40,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1055718.0, ans=0.0 2023-06-21 20:34:45,614 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-06-21 20:34:57,678 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=22.5 2023-06-21 20:35:20,274 INFO [train.py:996] (2/4) Epoch 6, batch 23500, loss[loss=0.1873, simple_loss=0.2499, pruned_loss=0.0624, over 19919.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3206, pruned_loss=0.09219, over 4289949.78 frames. ], batch size: 703, lr: 4.98e-03, grad_scale: 16.0 2023-06-21 20:35:27,703 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.361e+02 2.940e+02 3.315e+02 3.870e+02 5.953e+02, threshold=6.630e+02, percent-clipped=0.0 2023-06-21 20:35:41,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1055898.0, ans=0.2 2023-06-21 20:36:30,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1056018.0, ans=0.0 2023-06-21 20:36:41,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1056078.0, ans=0.125 2023-06-21 20:36:53,698 INFO [train.py:996] (2/4) Epoch 6, batch 23550, loss[loss=0.2037, simple_loss=0.2615, pruned_loss=0.07292, over 21554.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3143, pruned_loss=0.09151, over 4288996.92 frames. ], batch size: 212, lr: 4.98e-03, grad_scale: 16.0 2023-06-21 20:37:03,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1056138.0, ans=0.125 2023-06-21 20:37:04,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1056138.0, ans=0.0 2023-06-21 20:37:18,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1056198.0, ans=0.2 2023-06-21 20:37:53,008 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:38:27,630 INFO [train.py:996] (2/4) Epoch 6, batch 23600, loss[loss=0.2473, simple_loss=0.321, pruned_loss=0.0868, over 21552.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3166, pruned_loss=0.0921, over 4292174.59 frames. ], batch size: 389, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:38:34,987 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 2.807e+02 3.254e+02 4.113e+02 6.430e+02, threshold=6.509e+02, percent-clipped=0.0 2023-06-21 20:40:01,263 INFO [train.py:996] (2/4) Epoch 6, batch 23650, loss[loss=0.2607, simple_loss=0.3306, pruned_loss=0.09543, over 21453.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3168, pruned_loss=0.08954, over 4282974.31 frames. ], batch size: 131, lr: 4.98e-03, grad_scale: 32.0 2023-06-21 20:40:01,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1056738.0, ans=0.1 2023-06-21 20:40:18,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1056738.0, ans=0.0 2023-06-21 20:41:39,602 INFO [train.py:996] (2/4) Epoch 6, batch 23700, loss[loss=0.2338, simple_loss=0.3134, pruned_loss=0.0771, over 19925.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3181, pruned_loss=0.0883, over 4278048.38 frames. ], batch size: 704, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:41:51,458 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-06-21 20:41:51,804 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.178e+02 2.889e+02 3.360e+02 4.132e+02 7.517e+02, threshold=6.720e+02, percent-clipped=1.0 2023-06-21 20:41:52,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1057038.0, ans=0.125 2023-06-21 20:42:31,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1057158.0, ans=0.1 2023-06-21 20:42:43,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1057218.0, ans=0.2 2023-06-21 20:43:08,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1057278.0, ans=0.125 2023-06-21 20:43:20,238 INFO [train.py:996] (2/4) Epoch 6, batch 23750, loss[loss=0.2269, simple_loss=0.3186, pruned_loss=0.06761, over 21765.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3225, pruned_loss=0.0894, over 4278109.22 frames. ], batch size: 298, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:43:56,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1057398.0, ans=0.04949747468305833 2023-06-21 20:44:05,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1057458.0, ans=0.125 2023-06-21 20:44:09,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1057458.0, ans=0.125 2023-06-21 20:44:55,716 INFO [train.py:996] (2/4) Epoch 6, batch 23800, loss[loss=0.244, simple_loss=0.3236, pruned_loss=0.08217, over 21357.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3192, pruned_loss=0.08703, over 4280144.77 frames. ], batch size: 211, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:45:03,401 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.185e+02 2.613e+02 2.976e+02 3.389e+02 5.789e+02, threshold=5.953e+02, percent-clipped=0.0 2023-06-21 20:45:04,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1057638.0, ans=0.1 2023-06-21 20:45:15,507 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-21 20:45:40,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1057758.0, ans=0.1 2023-06-21 20:45:49,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1057818.0, ans=0.05 2023-06-21 20:46:17,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1057878.0, ans=0.125 2023-06-21 20:46:19,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1057878.0, ans=0.1 2023-06-21 20:46:22,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1057878.0, ans=0.2 2023-06-21 20:46:25,563 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.40 vs. limit=10.0 2023-06-21 20:46:30,957 INFO [train.py:996] (2/4) Epoch 6, batch 23850, loss[loss=0.2581, simple_loss=0.333, pruned_loss=0.09157, over 21342.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3306, pruned_loss=0.09039, over 4286267.07 frames. ], batch size: 549, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:46:38,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1057938.0, ans=0.125 2023-06-21 20:46:58,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1057998.0, ans=0.0 2023-06-21 20:47:06,471 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-06-21 20:47:21,735 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:48:00,280 INFO [train.py:996] (2/4) Epoch 6, batch 23900, loss[loss=0.2189, simple_loss=0.296, pruned_loss=0.07094, over 21489.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3374, pruned_loss=0.09278, over 4289634.82 frames. ], batch size: 230, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:48:03,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1058238.0, ans=0.125 2023-06-21 20:48:07,686 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.398e+02 3.320e+02 3.834e+02 4.673e+02 6.802e+02, threshold=7.669e+02, percent-clipped=5.0 2023-06-21 20:48:30,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1058358.0, ans=0.125 2023-06-21 20:49:08,516 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-21 20:49:18,503 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 20:49:33,461 INFO [train.py:996] (2/4) Epoch 6, batch 23950, loss[loss=0.2178, simple_loss=0.2813, pruned_loss=0.07716, over 21609.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3308, pruned_loss=0.09276, over 4283093.28 frames. ], batch size: 247, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:50:54,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1058778.0, ans=0.1 2023-06-21 20:51:08,212 INFO [train.py:996] (2/4) Epoch 6, batch 24000, loss[loss=0.2507, simple_loss=0.3209, pruned_loss=0.09022, over 21728.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3302, pruned_loss=0.09501, over 4281116.92 frames. ], batch size: 298, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:51:08,212 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 20:51:24,748 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2687, simple_loss=0.3663, pruned_loss=0.08552, over 1796401.00 frames. 2023-06-21 20:51:24,749 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-21 20:51:32,337 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.640e+02 3.190e+02 3.718e+02 4.654e+02 6.990e+02, threshold=7.435e+02, percent-clipped=0.0 2023-06-21 20:51:55,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1058898.0, ans=0.2 2023-06-21 20:52:32,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1059018.0, ans=0.0 2023-06-21 20:52:38,784 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-21 20:52:58,909 INFO [train.py:996] (2/4) Epoch 6, batch 24050, loss[loss=0.2654, simple_loss=0.3448, pruned_loss=0.09304, over 21865.00 frames. ], tot_loss[loss=0.262, simple_loss=0.332, pruned_loss=0.09597, over 4279658.90 frames. ], batch size: 371, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:53:00,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1059138.0, ans=0.125 2023-06-21 20:53:02,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1059138.0, ans=0.125 2023-06-21 20:53:18,118 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-06-21 20:53:47,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1059258.0, ans=0.0 2023-06-21 20:53:51,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1059258.0, ans=0.025 2023-06-21 20:54:07,039 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=15.64 vs. limit=22.5 2023-06-21 20:54:12,971 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=22.5 2023-06-21 20:54:27,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1059378.0, ans=0.125 2023-06-21 20:54:33,342 INFO [train.py:996] (2/4) Epoch 6, batch 24100, loss[loss=0.2396, simple_loss=0.2967, pruned_loss=0.09126, over 20153.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3326, pruned_loss=0.09442, over 4277091.39 frames. ], batch size: 703, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 20:54:39,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1059438.0, ans=0.0 2023-06-21 20:54:40,891 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.112e+02 2.753e+02 3.093e+02 3.531e+02 5.265e+02, threshold=6.186e+02, percent-clipped=0.0 2023-06-21 20:55:05,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1059498.0, ans=0.125 2023-06-21 20:55:57,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1059678.0, ans=0.2 2023-06-21 20:55:57,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1059678.0, ans=0.125 2023-06-21 20:56:07,329 INFO [train.py:996] (2/4) Epoch 6, batch 24150, loss[loss=0.2276, simple_loss=0.2941, pruned_loss=0.08056, over 21880.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3317, pruned_loss=0.09568, over 4279652.85 frames. ], batch size: 298, lr: 4.97e-03, grad_scale: 16.0 2023-06-21 20:56:33,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1059738.0, ans=0.125 2023-06-21 20:56:51,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1059858.0, ans=0.125 2023-06-21 20:57:51,182 INFO [train.py:996] (2/4) Epoch 6, batch 24200, loss[loss=0.2579, simple_loss=0.3315, pruned_loss=0.0921, over 21642.00 frames. ], tot_loss[loss=0.2654, simple_loss=0.3353, pruned_loss=0.09775, over 4289204.20 frames. ], batch size: 263, lr: 4.97e-03, grad_scale: 16.0 2023-06-21 20:57:56,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1060038.0, ans=0.125 2023-06-21 20:57:59,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1060038.0, ans=0.125 2023-06-21 20:58:05,321 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.424e+02 3.135e+02 3.604e+02 4.507e+02 8.443e+02, threshold=7.208e+02, percent-clipped=5.0 2023-06-21 20:58:14,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1060098.0, ans=0.2 2023-06-21 20:58:28,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1060098.0, ans=0.0 2023-06-21 20:59:30,805 INFO [train.py:996] (2/4) Epoch 6, batch 24250, loss[loss=0.2115, simple_loss=0.3, pruned_loss=0.06148, over 21629.00 frames. ], tot_loss[loss=0.2555, simple_loss=0.3306, pruned_loss=0.09019, over 4285952.58 frames. ], batch size: 230, lr: 4.97e-03, grad_scale: 16.0 2023-06-21 20:59:40,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1060338.0, ans=0.0 2023-06-21 20:59:56,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1060398.0, ans=0.0 2023-06-21 21:00:17,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1060458.0, ans=0.125 2023-06-21 21:01:01,973 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=22.5 2023-06-21 21:01:03,896 INFO [train.py:996] (2/4) Epoch 6, batch 24300, loss[loss=0.2107, simple_loss=0.3033, pruned_loss=0.05911, over 21341.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3227, pruned_loss=0.08427, over 4284599.54 frames. ], batch size: 548, lr: 4.97e-03, grad_scale: 16.0 2023-06-21 21:01:06,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=1060638.0, ans=0.02 2023-06-21 21:01:12,932 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.875e+02 2.484e+02 3.071e+02 3.742e+02 5.232e+02, threshold=6.142e+02, percent-clipped=0.0 2023-06-21 21:02:37,563 INFO [train.py:996] (2/4) Epoch 6, batch 24350, loss[loss=0.2569, simple_loss=0.323, pruned_loss=0.0954, over 21678.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3208, pruned_loss=0.08503, over 4288193.39 frames. ], batch size: 263, lr: 4.97e-03, grad_scale: 16.0 2023-06-21 21:02:51,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1060938.0, ans=0.125 2023-06-21 21:02:53,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1060938.0, ans=0.125 2023-06-21 21:03:00,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1060998.0, ans=0.125 2023-06-21 21:03:03,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1060998.0, ans=0.0 2023-06-21 21:03:18,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1061058.0, ans=0.125 2023-06-21 21:03:20,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1061058.0, ans=15.0 2023-06-21 21:03:22,355 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=12.0 2023-06-21 21:03:34,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1061118.0, ans=0.125 2023-06-21 21:04:16,128 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=12.0 2023-06-21 21:04:16,764 INFO [train.py:996] (2/4) Epoch 6, batch 24400, loss[loss=0.2375, simple_loss=0.3038, pruned_loss=0.08563, over 20795.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3259, pruned_loss=0.08896, over 4284295.49 frames. ], batch size: 608, lr: 4.97e-03, grad_scale: 32.0 2023-06-21 21:04:17,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1061238.0, ans=0.125 2023-06-21 21:04:21,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1061238.0, ans=0.125 2023-06-21 21:04:25,988 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.402e+02 3.143e+02 3.570e+02 4.226e+02 5.954e+02, threshold=7.140e+02, percent-clipped=0.0 2023-06-21 21:04:30,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1061298.0, ans=0.125 2023-06-21 21:04:34,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1061298.0, ans=0.125 2023-06-21 21:04:55,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1061358.0, ans=0.2 2023-06-21 21:05:18,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1061418.0, ans=0.0 2023-06-21 21:05:23,634 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=15.0 2023-06-21 21:05:30,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1061478.0, ans=0.0 2023-06-21 21:05:44,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1061478.0, ans=0.125 2023-06-21 21:05:51,477 INFO [train.py:996] (2/4) Epoch 6, batch 24450, loss[loss=0.2474, simple_loss=0.3374, pruned_loss=0.07873, over 21695.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3272, pruned_loss=0.08968, over 4284459.82 frames. ], batch size: 298, lr: 4.96e-03, grad_scale: 32.0 2023-06-21 21:06:50,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1061718.0, ans=0.125 2023-06-21 21:06:53,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1061718.0, ans=0.1 2023-06-21 21:07:05,234 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:07:24,541 INFO [train.py:996] (2/4) Epoch 6, batch 24500, loss[loss=0.2745, simple_loss=0.3369, pruned_loss=0.106, over 21549.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3264, pruned_loss=0.08938, over 4280147.53 frames. ], batch size: 548, lr: 4.96e-03, grad_scale: 32.0 2023-06-21 21:07:33,654 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.214e+02 2.842e+02 3.184e+02 3.780e+02 5.341e+02, threshold=6.369e+02, percent-clipped=0.0 2023-06-21 21:07:36,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1061838.0, ans=0.125 2023-06-21 21:08:18,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1061958.0, ans=0.2 2023-06-21 21:08:43,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1062078.0, ans=0.0 2023-06-21 21:08:53,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1062078.0, ans=0.0 2023-06-21 21:08:58,974 INFO [train.py:996] (2/4) Epoch 6, batch 24550, loss[loss=0.2727, simple_loss=0.3458, pruned_loss=0.09977, over 21201.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.329, pruned_loss=0.09161, over 4286355.59 frames. ], batch size: 143, lr: 4.96e-03, grad_scale: 16.0 2023-06-21 21:09:49,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1062258.0, ans=0.125 2023-06-21 21:10:27,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1062378.0, ans=0.2 2023-06-21 21:10:32,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1062438.0, ans=0.125 2023-06-21 21:10:33,717 INFO [train.py:996] (2/4) Epoch 6, batch 24600, loss[loss=0.2272, simple_loss=0.2895, pruned_loss=0.08242, over 21689.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3245, pruned_loss=0.09161, over 4283472.11 frames. ], batch size: 333, lr: 4.96e-03, grad_scale: 16.0 2023-06-21 21:10:34,572 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.29 vs. limit=10.0 2023-06-21 21:10:44,186 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.313e+02 2.960e+02 3.461e+02 4.086e+02 6.859e+02, threshold=6.922e+02, percent-clipped=1.0 2023-06-21 21:11:17,201 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.48 vs. limit=15.0 2023-06-21 21:11:33,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1062618.0, ans=0.5 2023-06-21 21:12:08,364 INFO [train.py:996] (2/4) Epoch 6, batch 24650, loss[loss=0.2142, simple_loss=0.2777, pruned_loss=0.07534, over 21349.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3175, pruned_loss=0.0899, over 4283973.09 frames. ], batch size: 131, lr: 4.96e-03, grad_scale: 16.0 2023-06-21 21:12:18,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1062738.0, ans=0.0 2023-06-21 21:13:01,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1062858.0, ans=0.125 2023-06-21 21:13:41,568 INFO [train.py:996] (2/4) Epoch 6, batch 24700, loss[loss=0.2376, simple_loss=0.2978, pruned_loss=0.08868, over 21558.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3152, pruned_loss=0.08755, over 4275489.38 frames. ], batch size: 263, lr: 4.96e-03, grad_scale: 16.0 2023-06-21 21:13:44,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1063038.0, ans=15.0 2023-06-21 21:13:44,208 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=22.5 2023-06-21 21:13:48,643 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.42 vs. limit=10.0 2023-06-21 21:13:51,762 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.340e+02 2.793e+02 3.149e+02 3.525e+02 6.939e+02, threshold=6.298e+02, percent-clipped=1.0 2023-06-21 21:14:28,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1063158.0, ans=22.5 2023-06-21 21:14:36,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1063158.0, ans=0.2 2023-06-21 21:14:48,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1063218.0, ans=0.0 2023-06-21 21:14:54,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1063218.0, ans=0.0 2023-06-21 21:15:15,566 INFO [train.py:996] (2/4) Epoch 6, batch 24750, loss[loss=0.2067, simple_loss=0.264, pruned_loss=0.07466, over 21189.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3105, pruned_loss=0.08437, over 4255402.61 frames. ], batch size: 176, lr: 4.96e-03, grad_scale: 16.0 2023-06-21 21:15:16,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1063338.0, ans=0.125 2023-06-21 21:15:19,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1063338.0, ans=0.07 2023-06-21 21:16:33,802 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.65 vs. limit=22.5 2023-06-21 21:16:34,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1063578.0, ans=0.125 2023-06-21 21:16:39,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1063578.0, ans=0.1 2023-06-21 21:16:49,357 INFO [train.py:996] (2/4) Epoch 6, batch 24800, loss[loss=0.2671, simple_loss=0.3266, pruned_loss=0.1038, over 21917.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3044, pruned_loss=0.08419, over 4260402.62 frames. ], batch size: 333, lr: 4.96e-03, grad_scale: 32.0 2023-06-21 21:16:51,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1063638.0, ans=0.125 2023-06-21 21:17:07,678 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.211e+02 2.811e+02 3.326e+02 3.870e+02 1.010e+03, threshold=6.653e+02, percent-clipped=1.0 2023-06-21 21:17:30,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1063698.0, ans=0.0 2023-06-21 21:18:22,757 INFO [train.py:996] (2/4) Epoch 6, batch 24850, loss[loss=0.2052, simple_loss=0.2735, pruned_loss=0.06845, over 21630.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3052, pruned_loss=0.08596, over 4273606.28 frames. ], batch size: 230, lr: 4.96e-03, grad_scale: 8.0 2023-06-21 21:18:29,849 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2023-06-21 21:18:33,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1063938.0, ans=0.125 2023-06-21 21:18:44,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1063998.0, ans=0.2 2023-06-21 21:19:02,613 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=9.33 vs. limit=15.0 2023-06-21 21:19:57,067 INFO [train.py:996] (2/4) Epoch 6, batch 24900, loss[loss=0.2447, simple_loss=0.322, pruned_loss=0.08371, over 21748.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3075, pruned_loss=0.0868, over 4277422.75 frames. ], batch size: 332, lr: 4.96e-03, grad_scale: 8.0 2023-06-21 21:20:15,236 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.413e+02 3.136e+02 3.665e+02 4.988e+02 9.346e+02, threshold=7.330e+02, percent-clipped=11.0 2023-06-21 21:21:38,221 INFO [train.py:996] (2/4) Epoch 6, batch 24950, loss[loss=0.2862, simple_loss=0.3605, pruned_loss=0.1059, over 21819.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3157, pruned_loss=0.09135, over 4276414.45 frames. ], batch size: 118, lr: 4.96e-03, grad_scale: 8.0 2023-06-21 21:22:40,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1064718.0, ans=0.07 2023-06-21 21:22:48,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1064718.0, ans=0.1 2023-06-21 21:23:06,009 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=15.0 2023-06-21 21:23:10,204 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2023-06-21 21:23:18,865 INFO [train.py:996] (2/4) Epoch 6, batch 25000, loss[loss=0.2244, simple_loss=0.3136, pruned_loss=0.06764, over 20725.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3215, pruned_loss=0.09267, over 4282471.42 frames. ], batch size: 607, lr: 4.96e-03, grad_scale: 8.0 2023-06-21 21:23:26,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1064838.0, ans=0.2 2023-06-21 21:23:27,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1064838.0, ans=0.0 2023-06-21 21:23:36,910 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.352e+02 2.934e+02 3.469e+02 4.480e+02 7.234e+02, threshold=6.939e+02, percent-clipped=0.0 2023-06-21 21:23:48,300 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.98 vs. limit=15.0 2023-06-21 21:24:25,549 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.73 vs. limit=10.0 2023-06-21 21:24:37,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1065078.0, ans=0.2 2023-06-21 21:24:52,474 INFO [train.py:996] (2/4) Epoch 6, batch 25050, loss[loss=0.2325, simple_loss=0.2938, pruned_loss=0.08562, over 22017.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3135, pruned_loss=0.09034, over 4278720.44 frames. ], batch size: 103, lr: 4.96e-03, grad_scale: 8.0 2023-06-21 21:25:21,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1065198.0, ans=0.95 2023-06-21 21:26:27,059 INFO [train.py:996] (2/4) Epoch 6, batch 25100, loss[loss=0.2386, simple_loss=0.3317, pruned_loss=0.07273, over 21676.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3083, pruned_loss=0.0892, over 4269433.49 frames. ], batch size: 332, lr: 4.96e-03, grad_scale: 8.0 2023-06-21 21:26:45,418 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.294e+02 2.865e+02 3.430e+02 4.483e+02 9.616e+02, threshold=6.861e+02, percent-clipped=4.0 2023-06-21 21:27:28,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1065618.0, ans=0.1 2023-06-21 21:28:01,872 INFO [train.py:996] (2/4) Epoch 6, batch 25150, loss[loss=0.259, simple_loss=0.339, pruned_loss=0.08948, over 21826.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3111, pruned_loss=0.0867, over 4264764.33 frames. ], batch size: 351, lr: 4.95e-03, grad_scale: 8.0 2023-06-21 21:28:14,617 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-21 21:28:49,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1065858.0, ans=0.1 2023-06-21 21:29:18,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1065978.0, ans=0.0 2023-06-21 21:29:32,227 INFO [train.py:996] (2/4) Epoch 6, batch 25200, loss[loss=0.2012, simple_loss=0.2828, pruned_loss=0.0598, over 21386.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3113, pruned_loss=0.08526, over 4268126.61 frames. ], batch size: 131, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:29:55,467 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.130e+02 2.627e+02 3.080e+02 3.902e+02 5.113e+02, threshold=6.160e+02, percent-clipped=0.0 2023-06-21 21:29:59,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1066098.0, ans=0.0 2023-06-21 21:30:09,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1066098.0, ans=0.1 2023-06-21 21:30:10,057 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-06-21 21:30:36,728 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:30:40,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1066218.0, ans=0.125 2023-06-21 21:31:06,420 INFO [train.py:996] (2/4) Epoch 6, batch 25250, loss[loss=0.2312, simple_loss=0.2902, pruned_loss=0.08607, over 21227.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3084, pruned_loss=0.08339, over 4260123.79 frames. ], batch size: 548, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:31:17,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1066338.0, ans=0.125 2023-06-21 21:32:07,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1066518.0, ans=0.125 2023-06-21 21:32:46,514 INFO [train.py:996] (2/4) Epoch 6, batch 25300, loss[loss=0.2304, simple_loss=0.3078, pruned_loss=0.07645, over 21723.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3064, pruned_loss=0.08321, over 4249015.67 frames. ], batch size: 298, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:33:04,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1066638.0, ans=0.0 2023-06-21 21:33:05,217 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.215e+02 2.867e+02 3.250e+02 3.935e+02 6.834e+02, threshold=6.501e+02, percent-clipped=3.0 2023-06-21 21:34:02,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1066878.0, ans=0.125 2023-06-21 21:34:21,341 INFO [train.py:996] (2/4) Epoch 6, batch 25350, loss[loss=0.1945, simple_loss=0.2776, pruned_loss=0.05572, over 21751.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3092, pruned_loss=0.08328, over 4239294.53 frames. ], batch size: 282, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:34:28,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1066938.0, ans=0.125 2023-06-21 21:34:39,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1066998.0, ans=0.04949747468305833 2023-06-21 21:35:06,498 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.93 vs. limit=6.0 2023-06-21 21:35:11,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1067058.0, ans=0.125 2023-06-21 21:35:49,711 INFO [train.py:996] (2/4) Epoch 6, batch 25400, loss[loss=0.2501, simple_loss=0.314, pruned_loss=0.09309, over 21745.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3047, pruned_loss=0.08187, over 4244811.95 frames. ], batch size: 112, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:35:51,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1067238.0, ans=0.125 2023-06-21 21:36:13,082 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.194e+02 2.660e+02 3.051e+02 3.605e+02 5.899e+02, threshold=6.102e+02, percent-clipped=0.0 2023-06-21 21:37:20,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1067478.0, ans=0.125 2023-06-21 21:37:30,719 INFO [train.py:996] (2/4) Epoch 6, batch 25450, loss[loss=0.2213, simple_loss=0.3189, pruned_loss=0.06181, over 21791.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3063, pruned_loss=0.08314, over 4243739.90 frames. ], batch size: 282, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:38:12,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1067658.0, ans=0.2 2023-06-21 21:38:46,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1067778.0, ans=0.125 2023-06-21 21:38:47,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1067778.0, ans=0.125 2023-06-21 21:39:10,619 INFO [train.py:996] (2/4) Epoch 6, batch 25500, loss[loss=0.2079, simple_loss=0.3001, pruned_loss=0.05784, over 21730.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3066, pruned_loss=0.08087, over 4243254.07 frames. ], batch size: 332, lr: 4.95e-03, grad_scale: 8.0 2023-06-21 21:39:20,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1067838.0, ans=0.1 2023-06-21 21:39:25,733 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.037e+02 2.857e+02 3.431e+02 4.303e+02 7.136e+02, threshold=6.862e+02, percent-clipped=5.0 2023-06-21 21:40:45,935 INFO [train.py:996] (2/4) Epoch 6, batch 25550, loss[loss=0.2096, simple_loss=0.3293, pruned_loss=0.04497, over 21261.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3115, pruned_loss=0.08031, over 4240670.79 frames. ], batch size: 548, lr: 4.95e-03, grad_scale: 8.0 2023-06-21 21:40:50,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1068138.0, ans=0.1 2023-06-21 21:40:50,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1068138.0, ans=0.125 2023-06-21 21:41:18,143 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-21 21:41:24,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1068258.0, ans=0.0 2023-06-21 21:41:32,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1068258.0, ans=0.0 2023-06-21 21:41:33,937 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:41:40,282 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:42:21,197 INFO [train.py:996] (2/4) Epoch 6, batch 25600, loss[loss=0.2939, simple_loss=0.4166, pruned_loss=0.08558, over 19752.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3186, pruned_loss=0.08137, over 4247353.39 frames. ], batch size: 702, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:42:41,374 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.229e+02 2.868e+02 3.276e+02 3.835e+02 9.464e+02, threshold=6.552e+02, percent-clipped=3.0 2023-06-21 21:42:52,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1068498.0, ans=0.1 2023-06-21 21:43:03,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1068558.0, ans=0.05 2023-06-21 21:43:56,052 INFO [train.py:996] (2/4) Epoch 6, batch 25650, loss[loss=0.226, simple_loss=0.2955, pruned_loss=0.07822, over 21773.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3202, pruned_loss=0.08447, over 4250677.40 frames. ], batch size: 124, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:44:21,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1068798.0, ans=0.0 2023-06-21 21:44:31,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1068858.0, ans=0.2 2023-06-21 21:44:39,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1068858.0, ans=0.125 2023-06-21 21:44:44,273 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0 2023-06-21 21:44:56,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1068918.0, ans=0.1 2023-06-21 21:45:17,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1068978.0, ans=0.125 2023-06-21 21:45:24,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1068978.0, ans=0.125 2023-06-21 21:45:28,619 INFO [train.py:996] (2/4) Epoch 6, batch 25700, loss[loss=0.2697, simple_loss=0.3414, pruned_loss=0.09904, over 21528.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3167, pruned_loss=0.08604, over 4251378.62 frames. ], batch size: 471, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:45:30,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1069038.0, ans=0.2 2023-06-21 21:45:48,635 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.399e+02 2.859e+02 3.225e+02 3.794e+02 7.100e+02, threshold=6.450e+02, percent-clipped=1.0 2023-06-21 21:46:27,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1069218.0, ans=0.125 2023-06-21 21:46:29,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1069218.0, ans=0.125 2023-06-21 21:46:51,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1069278.0, ans=0.1 2023-06-21 21:47:05,082 INFO [train.py:996] (2/4) Epoch 6, batch 25750, loss[loss=0.3185, simple_loss=0.3894, pruned_loss=0.1239, over 21702.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3212, pruned_loss=0.08913, over 4249206.45 frames. ], batch size: 351, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:47:56,253 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.951e-02 2023-06-21 21:48:00,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1069458.0, ans=0.1 2023-06-21 21:48:17,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1069518.0, ans=0.125 2023-06-21 21:48:50,362 INFO [train.py:996] (2/4) Epoch 6, batch 25800, loss[loss=0.256, simple_loss=0.3232, pruned_loss=0.09447, over 20744.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3312, pruned_loss=0.09311, over 4254127.87 frames. ], batch size: 607, lr: 4.95e-03, grad_scale: 16.0 2023-06-21 21:49:10,565 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.284e+02 3.168e+02 3.870e+02 4.969e+02 1.145e+03, threshold=7.739e+02, percent-clipped=13.0 2023-06-21 21:50:05,633 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.16 vs. limit=15.0 2023-06-21 21:50:25,966 INFO [train.py:996] (2/4) Epoch 6, batch 25850, loss[loss=0.2099, simple_loss=0.2867, pruned_loss=0.06652, over 21426.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3336, pruned_loss=0.09258, over 4260263.84 frames. ], batch size: 211, lr: 4.94e-03, grad_scale: 16.0 2023-06-21 21:50:44,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1069938.0, ans=0.0 2023-06-21 21:51:12,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1070058.0, ans=0.0 2023-06-21 21:51:29,452 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:51:30,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1070118.0, ans=0.1 2023-06-21 21:51:34,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1070118.0, ans=0.0 2023-06-21 21:51:43,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1070118.0, ans=0.0 2023-06-21 21:51:50,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1070178.0, ans=0.0 2023-06-21 21:51:51,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1070178.0, ans=0.125 2023-06-21 21:52:10,940 INFO [train.py:996] (2/4) Epoch 6, batch 25900, loss[loss=0.3508, simple_loss=0.4412, pruned_loss=0.1302, over 21305.00 frames. ], tot_loss[loss=0.2621, simple_loss=0.337, pruned_loss=0.09365, over 4261615.12 frames. ], batch size: 548, lr: 4.94e-03, grad_scale: 16.0 2023-06-21 21:52:15,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1070238.0, ans=0.2 2023-06-21 21:52:25,940 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.151e+02 3.005e+02 3.553e+02 4.246e+02 7.646e+02, threshold=7.106e+02, percent-clipped=0.0 2023-06-21 21:52:31,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1070298.0, ans=0.125 2023-06-21 21:52:34,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1070298.0, ans=0.125 2023-06-21 21:52:38,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1070298.0, ans=0.125 2023-06-21 21:52:46,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1070358.0, ans=0.07 2023-06-21 21:52:48,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1070358.0, ans=0.2 2023-06-21 21:53:18,581 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 21:53:27,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1070478.0, ans=0.125 2023-06-21 21:53:31,098 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=15.0 2023-06-21 21:53:45,640 INFO [train.py:996] (2/4) Epoch 6, batch 25950, loss[loss=0.2648, simple_loss=0.3418, pruned_loss=0.09388, over 21757.00 frames. ], tot_loss[loss=0.2667, simple_loss=0.3416, pruned_loss=0.0959, over 4270300.45 frames. ], batch size: 332, lr: 4.94e-03, grad_scale: 16.0 2023-06-21 21:53:49,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1070538.0, ans=0.025 2023-06-21 21:54:18,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1070598.0, ans=0.0 2023-06-21 21:54:20,901 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=12.0 2023-06-21 21:54:22,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1070658.0, ans=0.125 2023-06-21 21:54:37,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1070658.0, ans=0.1 2023-06-21 21:55:09,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1070778.0, ans=0.0 2023-06-21 21:55:21,664 INFO [train.py:996] (2/4) Epoch 6, batch 26000, loss[loss=0.2853, simple_loss=0.3659, pruned_loss=0.1023, over 21486.00 frames. ], tot_loss[loss=0.2641, simple_loss=0.3411, pruned_loss=0.0936, over 4268240.57 frames. ], batch size: 131, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 21:55:41,609 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.013e+02 3.120e+02 3.589e+02 4.615e+02 8.181e+02, threshold=7.178e+02, percent-clipped=1.0 2023-06-21 21:56:09,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1070958.0, ans=0.125 2023-06-21 21:56:55,913 INFO [train.py:996] (2/4) Epoch 6, batch 26050, loss[loss=0.2382, simple_loss=0.3, pruned_loss=0.08817, over 21712.00 frames. ], tot_loss[loss=0.2661, simple_loss=0.3418, pruned_loss=0.09523, over 4273589.80 frames. ], batch size: 230, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 21:57:02,827 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=15.0 2023-06-21 21:57:11,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1071138.0, ans=0.2 2023-06-21 21:57:22,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1071198.0, ans=0.05 2023-06-21 21:57:22,745 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.65 vs. limit=15.0 2023-06-21 21:58:11,000 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-06-21 21:58:29,349 INFO [train.py:996] (2/4) Epoch 6, batch 26100, loss[loss=0.247, simple_loss=0.3103, pruned_loss=0.09186, over 21913.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.3344, pruned_loss=0.09474, over 4277688.63 frames. ], batch size: 414, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 21:58:46,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1071438.0, ans=0.04949747468305833 2023-06-21 21:58:49,049 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.433e+02 3.066e+02 3.551e+02 4.321e+02 9.246e+02, threshold=7.101e+02, percent-clipped=1.0 2023-06-21 21:59:18,541 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.26 vs. limit=15.0 2023-06-21 21:59:55,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1071678.0, ans=0.2 2023-06-21 21:59:58,669 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-21 22:00:03,484 INFO [train.py:996] (2/4) Epoch 6, batch 26150, loss[loss=0.3467, simple_loss=0.3903, pruned_loss=0.1515, over 21528.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.3321, pruned_loss=0.09548, over 4283835.99 frames. ], batch size: 510, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:01:43,203 INFO [train.py:996] (2/4) Epoch 6, batch 26200, loss[loss=0.2767, simple_loss=0.3849, pruned_loss=0.08419, over 21219.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.3324, pruned_loss=0.09262, over 4281815.78 frames. ], batch size: 548, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:01:58,602 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.206e+02 2.878e+02 3.123e+02 3.619e+02 5.924e+02, threshold=6.246e+02, percent-clipped=0.0 2023-06-21 22:02:28,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1072158.0, ans=0.0 2023-06-21 22:02:37,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1072158.0, ans=0.125 2023-06-21 22:02:53,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1072218.0, ans=0.5 2023-06-21 22:03:17,230 INFO [train.py:996] (2/4) Epoch 6, batch 26250, loss[loss=0.2513, simple_loss=0.3204, pruned_loss=0.09112, over 21684.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3336, pruned_loss=0.09079, over 4278868.70 frames. ], batch size: 263, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:03:43,825 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.31 vs. limit=12.0 2023-06-21 22:04:03,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1072458.0, ans=0.1 2023-06-21 22:04:50,863 INFO [train.py:996] (2/4) Epoch 6, batch 26300, loss[loss=0.2797, simple_loss=0.3408, pruned_loss=0.1093, over 21529.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.331, pruned_loss=0.09164, over 4285810.60 frames. ], batch size: 131, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:05:10,180 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.129e+02 2.912e+02 3.361e+02 4.041e+02 6.857e+02, threshold=6.722e+02, percent-clipped=1.0 2023-06-21 22:05:25,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1072698.0, ans=0.2 2023-06-21 22:05:45,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1072758.0, ans=0.0 2023-06-21 22:05:52,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1072818.0, ans=0.0 2023-06-21 22:06:09,177 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-21 22:06:13,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1072878.0, ans=0.2 2023-06-21 22:06:14,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1072878.0, ans=0.125 2023-06-21 22:06:24,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1072878.0, ans=0.125 2023-06-21 22:06:26,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1072878.0, ans=0.125 2023-06-21 22:06:29,502 INFO [train.py:996] (2/4) Epoch 6, batch 26350, loss[loss=0.2157, simple_loss=0.2783, pruned_loss=0.07653, over 21193.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3299, pruned_loss=0.09234, over 4293580.10 frames. ], batch size: 608, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:06:49,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1072998.0, ans=0.0 2023-06-21 22:07:02,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1072998.0, ans=0.2 2023-06-21 22:07:31,980 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:07:41,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1073178.0, ans=0.2 2023-06-21 22:07:57,343 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=15.0 2023-06-21 22:08:02,659 INFO [train.py:996] (2/4) Epoch 6, batch 26400, loss[loss=0.2535, simple_loss=0.3055, pruned_loss=0.1007, over 21241.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3235, pruned_loss=0.09222, over 4297269.33 frames. ], batch size: 143, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:08:19,175 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=15.0 2023-06-21 22:08:22,543 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.529e+02 3.009e+02 3.283e+02 3.744e+02 6.986e+02, threshold=6.566e+02, percent-clipped=1.0 2023-06-21 22:08:51,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1073358.0, ans=0.125 2023-06-21 22:09:11,817 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=22.90 vs. limit=22.5 2023-06-21 22:09:29,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1073478.0, ans=0.0 2023-06-21 22:09:43,568 INFO [train.py:996] (2/4) Epoch 6, batch 26450, loss[loss=0.217, simple_loss=0.2882, pruned_loss=0.07295, over 21300.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.322, pruned_loss=0.0914, over 4290332.10 frames. ], batch size: 144, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:10:01,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1073538.0, ans=0.125 2023-06-21 22:11:23,583 INFO [train.py:996] (2/4) Epoch 6, batch 26500, loss[loss=0.206, simple_loss=0.2591, pruned_loss=0.07643, over 21348.00 frames. ], tot_loss[loss=0.2528, simple_loss=0.3243, pruned_loss=0.09067, over 4284688.65 frames. ], batch size: 131, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:11:38,658 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.495e+02 3.232e+02 3.914e+02 4.900e+02 8.574e+02, threshold=7.829e+02, percent-clipped=7.0 2023-06-21 22:12:03,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1073958.0, ans=0.0 2023-06-21 22:12:15,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1073958.0, ans=0.2 2023-06-21 22:12:52,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1074078.0, ans=0.1 2023-06-21 22:12:59,774 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.66 vs. limit=15.0 2023-06-21 22:12:59,952 INFO [train.py:996] (2/4) Epoch 6, batch 26550, loss[loss=0.2221, simple_loss=0.3168, pruned_loss=0.06373, over 21719.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3241, pruned_loss=0.08763, over 4269261.60 frames. ], batch size: 351, lr: 4.94e-03, grad_scale: 32.0 2023-06-21 22:13:17,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1074138.0, ans=0.125 2023-06-21 22:13:29,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1074198.0, ans=0.125 2023-06-21 22:14:10,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1074318.0, ans=0.1 2023-06-21 22:14:18,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1074378.0, ans=0.0 2023-06-21 22:14:21,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1074378.0, ans=0.125 2023-06-21 22:14:24,460 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.24 vs. limit=10.0 2023-06-21 22:14:34,293 INFO [train.py:996] (2/4) Epoch 6, batch 26600, loss[loss=0.2154, simple_loss=0.2913, pruned_loss=0.06974, over 21576.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3229, pruned_loss=0.08421, over 4273842.63 frames. ], batch size: 263, lr: 4.93e-03, grad_scale: 32.0 2023-06-21 22:15:00,586 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.868e+02 3.429e+02 4.174e+02 7.700e+02, threshold=6.858e+02, percent-clipped=0.0 2023-06-21 22:15:07,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1074498.0, ans=0.2 2023-06-21 22:15:07,589 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-21 22:16:01,078 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:16:13,051 INFO [train.py:996] (2/4) Epoch 6, batch 26650, loss[loss=0.2095, simple_loss=0.2721, pruned_loss=0.0734, over 21553.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3175, pruned_loss=0.08377, over 4263615.24 frames. ], batch size: 263, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:17:28,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1074978.0, ans=0.125 2023-06-21 22:17:50,865 INFO [train.py:996] (2/4) Epoch 6, batch 26700, loss[loss=0.1589, simple_loss=0.2419, pruned_loss=0.03798, over 21678.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3097, pruned_loss=0.08076, over 4267386.88 frames. ], batch size: 263, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:18:07,347 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.978e+02 2.839e+02 3.537e+02 4.249e+02 6.809e+02, threshold=7.074e+02, percent-clipped=0.0 2023-06-21 22:18:24,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1075158.0, ans=0.0 2023-06-21 22:18:46,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1075218.0, ans=0.0 2023-06-21 22:19:24,615 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2023-06-21 22:19:25,052 INFO [train.py:996] (2/4) Epoch 6, batch 26750, loss[loss=0.2941, simple_loss=0.3678, pruned_loss=0.1103, over 21452.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3106, pruned_loss=0.08, over 4274517.04 frames. ], batch size: 131, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:19:25,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1075338.0, ans=0.125 2023-06-21 22:20:03,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1075458.0, ans=0.0 2023-06-21 22:20:18,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1075518.0, ans=0.015 2023-06-21 22:20:26,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1075518.0, ans=0.125 2023-06-21 22:20:57,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1075578.0, ans=0.2 2023-06-21 22:21:00,185 INFO [train.py:996] (2/4) Epoch 6, batch 26800, loss[loss=0.2942, simple_loss=0.3545, pruned_loss=0.1169, over 21478.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3173, pruned_loss=0.08332, over 4272918.37 frames. ], batch size: 194, lr: 4.93e-03, grad_scale: 32.0 2023-06-21 22:21:09,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1075638.0, ans=0.0 2023-06-21 22:21:25,843 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.025e+02 2.803e+02 3.255e+02 3.983e+02 6.627e+02, threshold=6.510e+02, percent-clipped=0.0 2023-06-21 22:21:29,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1075698.0, ans=0.0 2023-06-21 22:21:34,294 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.42 vs. limit=10.0 2023-06-21 22:22:18,326 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=22.5 2023-06-21 22:22:29,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1075878.0, ans=0.125 2023-06-21 22:22:38,493 INFO [train.py:996] (2/4) Epoch 6, batch 26850, loss[loss=0.2366, simple_loss=0.3035, pruned_loss=0.08483, over 21462.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3195, pruned_loss=0.08692, over 4279064.55 frames. ], batch size: 131, lr: 4.93e-03, grad_scale: 32.0 2023-06-21 22:23:39,338 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.74 vs. limit=12.0 2023-06-21 22:24:00,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1076178.0, ans=0.125 2023-06-21 22:24:06,565 INFO [train.py:996] (2/4) Epoch 6, batch 26900, loss[loss=0.1861, simple_loss=0.2513, pruned_loss=0.06047, over 21363.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3118, pruned_loss=0.08631, over 4283724.00 frames. ], batch size: 211, lr: 4.93e-03, grad_scale: 32.0 2023-06-21 22:24:16,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1076238.0, ans=0.125 2023-06-21 22:24:32,544 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.314e+02 2.927e+02 3.403e+02 4.314e+02 6.686e+02, threshold=6.806e+02, percent-clipped=1.0 2023-06-21 22:24:54,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1076358.0, ans=0.125 2023-06-21 22:25:33,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1076478.0, ans=0.125 2023-06-21 22:25:38,815 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.75 vs. limit=15.0 2023-06-21 22:25:40,908 INFO [train.py:996] (2/4) Epoch 6, batch 26950, loss[loss=0.239, simple_loss=0.3328, pruned_loss=0.07262, over 21690.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3115, pruned_loss=0.08713, over 4279080.18 frames. ], batch size: 332, lr: 4.93e-03, grad_scale: 32.0 2023-06-21 22:26:11,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1076598.0, ans=0.0 2023-06-21 22:26:47,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1076718.0, ans=0.1 2023-06-21 22:27:20,698 INFO [train.py:996] (2/4) Epoch 6, batch 27000, loss[loss=0.271, simple_loss=0.3629, pruned_loss=0.08961, over 21546.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3117, pruned_loss=0.08456, over 4271030.22 frames. ], batch size: 441, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:27:20,698 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-21 22:27:39,475 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2469, simple_loss=0.3452, pruned_loss=0.07428, over 1796401.00 frames. 2023-06-21 22:27:39,476 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-21 22:27:44,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1076838.0, ans=0.125 2023-06-21 22:27:57,431 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.007e+02 2.871e+02 3.391e+02 3.871e+02 6.119e+02, threshold=6.783e+02, percent-clipped=0.0 2023-06-21 22:28:20,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1076958.0, ans=0.125 2023-06-21 22:28:28,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1077018.0, ans=0.125 2023-06-21 22:28:31,758 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=12.0 2023-06-21 22:29:00,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1077078.0, ans=0.125 2023-06-21 22:29:06,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1077078.0, ans=0.2 2023-06-21 22:29:09,016 INFO [train.py:996] (2/4) Epoch 6, batch 27050, loss[loss=0.2412, simple_loss=0.3521, pruned_loss=0.06511, over 20725.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3133, pruned_loss=0.08123, over 4267560.70 frames. ], batch size: 607, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:29:18,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1077138.0, ans=0.125 2023-06-21 22:29:25,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1077198.0, ans=0.2 2023-06-21 22:29:41,803 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.26 vs. limit=15.0 2023-06-21 22:30:07,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1077318.0, ans=0.1 2023-06-21 22:30:38,686 INFO [train.py:996] (2/4) Epoch 6, batch 27100, loss[loss=0.2269, simple_loss=0.3041, pruned_loss=0.07485, over 16983.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3155, pruned_loss=0.08279, over 4270979.46 frames. ], batch size: 60, lr: 4.93e-03, grad_scale: 8.0 2023-06-21 22:30:39,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1077438.0, ans=0.125 2023-06-21 22:31:08,042 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.819e+02 3.363e+02 4.112e+02 5.749e+02, threshold=6.726e+02, percent-clipped=0.0 2023-06-21 22:31:13,413 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.69 vs. limit=10.0 2023-06-21 22:31:48,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1077618.0, ans=0.125 2023-06-21 22:32:04,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1077678.0, ans=0.0 2023-06-21 22:32:13,369 INFO [train.py:996] (2/4) Epoch 6, batch 27150, loss[loss=0.2546, simple_loss=0.3281, pruned_loss=0.09054, over 21281.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3264, pruned_loss=0.08611, over 4271736.32 frames. ], batch size: 159, lr: 4.93e-03, grad_scale: 8.0 2023-06-21 22:32:13,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1077738.0, ans=0.125 2023-06-21 22:32:23,489 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-21 22:32:56,793 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-06-21 22:33:29,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1077918.0, ans=0.2 2023-06-21 22:33:44,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1077978.0, ans=0.125 2023-06-21 22:33:47,066 INFO [train.py:996] (2/4) Epoch 6, batch 27200, loss[loss=0.2208, simple_loss=0.3105, pruned_loss=0.06557, over 21447.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3338, pruned_loss=0.08904, over 4268785.10 frames. ], batch size: 211, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:33:53,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1078038.0, ans=0.0 2023-06-21 22:34:15,786 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.516e+02 3.235e+02 3.777e+02 4.284e+02 9.441e+02, threshold=7.555e+02, percent-clipped=8.0 2023-06-21 22:34:16,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1078098.0, ans=0.125 2023-06-21 22:34:48,909 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2023-06-21 22:35:08,583 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-21 22:35:30,910 INFO [train.py:996] (2/4) Epoch 6, batch 27250, loss[loss=0.2517, simple_loss=0.3238, pruned_loss=0.08979, over 22006.00 frames. ], tot_loss[loss=0.2614, simple_loss=0.3364, pruned_loss=0.09317, over 4264271.98 frames. ], batch size: 317, lr: 4.93e-03, grad_scale: 16.0 2023-06-21 22:35:47,392 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=5.02 vs. limit=6.0 2023-06-21 22:37:06,682 INFO [train.py:996] (2/4) Epoch 6, batch 27300, loss[loss=0.2708, simple_loss=0.3545, pruned_loss=0.09357, over 21785.00 frames. ], tot_loss[loss=0.2643, simple_loss=0.3389, pruned_loss=0.09482, over 4265255.62 frames. ], batch size: 124, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:37:36,248 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.428e+02 3.091e+02 3.407e+02 3.961e+02 5.625e+02, threshold=6.815e+02, percent-clipped=0.0 2023-06-21 22:37:54,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1078758.0, ans=0.2 2023-06-21 22:38:45,640 INFO [train.py:996] (2/4) Epoch 6, batch 27350, loss[loss=0.2532, simple_loss=0.3376, pruned_loss=0.08434, over 21412.00 frames. ], tot_loss[loss=0.2658, simple_loss=0.3415, pruned_loss=0.09511, over 4271111.11 frames. ], batch size: 211, lr: 4.92e-03, grad_scale: 8.0 2023-06-21 22:39:08,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1078998.0, ans=0.125 2023-06-21 22:40:17,995 INFO [train.py:996] (2/4) Epoch 6, batch 27400, loss[loss=0.2363, simple_loss=0.3015, pruned_loss=0.08556, over 21846.00 frames. ], tot_loss[loss=0.2625, simple_loss=0.3366, pruned_loss=0.09417, over 4265990.08 frames. ], batch size: 118, lr: 4.92e-03, grad_scale: 8.0 2023-06-21 22:40:18,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1079238.0, ans=0.1 2023-06-21 22:40:24,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1079238.0, ans=0.0 2023-06-21 22:40:43,756 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 2.934e+02 3.230e+02 3.710e+02 5.363e+02, threshold=6.461e+02, percent-clipped=0.0 2023-06-21 22:41:29,150 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=15.0 2023-06-21 22:41:31,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1079478.0, ans=0.125 2023-06-21 22:41:51,780 INFO [train.py:996] (2/4) Epoch 6, batch 27450, loss[loss=0.2503, simple_loss=0.3309, pruned_loss=0.08487, over 21639.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3301, pruned_loss=0.09253, over 4260749.66 frames. ], batch size: 231, lr: 4.92e-03, grad_scale: 8.0 2023-06-21 22:42:53,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1079718.0, ans=0.0 2023-06-21 22:42:58,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1079718.0, ans=0.5 2023-06-21 22:43:20,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1079778.0, ans=0.0 2023-06-21 22:43:22,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1079778.0, ans=0.2 2023-06-21 22:43:24,740 INFO [train.py:996] (2/4) Epoch 6, batch 27500, loss[loss=0.2549, simple_loss=0.3214, pruned_loss=0.09423, over 21247.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3272, pruned_loss=0.09238, over 4266651.35 frames. ], batch size: 143, lr: 4.92e-03, grad_scale: 8.0 2023-06-21 22:43:25,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1079838.0, ans=0.0 2023-06-21 22:43:50,267 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.559e+02 2.999e+02 3.729e+02 4.399e+02 9.645e+02, threshold=7.458e+02, percent-clipped=3.0 2023-06-21 22:44:10,227 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=12.0 2023-06-21 22:44:27,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1080018.0, ans=0.0 2023-06-21 22:44:58,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1080138.0, ans=0.125 2023-06-21 22:44:59,258 INFO [train.py:996] (2/4) Epoch 6, batch 27550, loss[loss=0.2463, simple_loss=0.3132, pruned_loss=0.0897, over 21592.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.325, pruned_loss=0.09009, over 4267189.37 frames. ], batch size: 414, lr: 4.92e-03, grad_scale: 8.0 2023-06-21 22:45:36,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1080198.0, ans=0.125 2023-06-21 22:46:18,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1080378.0, ans=0.0 2023-06-21 22:46:37,807 INFO [train.py:996] (2/4) Epoch 6, batch 27600, loss[loss=0.2425, simple_loss=0.2993, pruned_loss=0.09285, over 21302.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3185, pruned_loss=0.08886, over 4260269.28 frames. ], batch size: 471, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:46:38,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1080438.0, ans=0.125 2023-06-21 22:46:48,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1080438.0, ans=0.0 2023-06-21 22:46:50,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1080438.0, ans=0.125 2023-06-21 22:46:57,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1080498.0, ans=0.125 2023-06-21 22:46:58,587 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.246e+02 2.844e+02 3.346e+02 3.964e+02 7.072e+02, threshold=6.692e+02, percent-clipped=0.0 2023-06-21 22:48:06,204 INFO [train.py:996] (2/4) Epoch 6, batch 27650, loss[loss=0.2849, simple_loss=0.3578, pruned_loss=0.1061, over 21693.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3126, pruned_loss=0.08826, over 4256377.46 frames. ], batch size: 414, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:48:26,303 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.79 vs. limit=15.0 2023-06-21 22:48:39,433 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-21 22:49:08,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1080918.0, ans=0.2 2023-06-21 22:49:44,269 INFO [train.py:996] (2/4) Epoch 6, batch 27700, loss[loss=0.2051, simple_loss=0.2573, pruned_loss=0.07642, over 20205.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3114, pruned_loss=0.08588, over 4260739.32 frames. ], batch size: 703, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:50:05,748 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.051e+02 2.845e+02 3.268e+02 3.924e+02 7.341e+02, threshold=6.535e+02, percent-clipped=1.0 2023-06-21 22:50:30,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1081158.0, ans=0.125 2023-06-21 22:50:31,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1081158.0, ans=0.07 2023-06-21 22:51:10,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1081278.0, ans=0.125 2023-06-21 22:51:18,675 INFO [train.py:996] (2/4) Epoch 6, batch 27750, loss[loss=0.2639, simple_loss=0.3631, pruned_loss=0.08237, over 21237.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3149, pruned_loss=0.08613, over 4259895.19 frames. ], batch size: 548, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:51:23,939 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=16.16 vs. limit=15.0 2023-06-21 22:52:01,212 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.26 vs. limit=15.0 2023-06-21 22:52:34,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1081578.0, ans=0.125 2023-06-21 22:52:51,608 INFO [train.py:996] (2/4) Epoch 6, batch 27800, loss[loss=0.2095, simple_loss=0.2792, pruned_loss=0.06993, over 21453.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3143, pruned_loss=0.0866, over 4268348.54 frames. ], batch size: 211, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:52:53,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1081638.0, ans=0.125 2023-06-21 22:53:02,892 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.12 vs. limit=15.0 2023-06-21 22:53:12,158 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 2.907e+02 3.249e+02 3.877e+02 6.679e+02, threshold=6.497e+02, percent-clipped=1.0 2023-06-21 22:53:36,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1081758.0, ans=0.2 2023-06-21 22:54:14,818 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.74 vs. limit=12.0 2023-06-21 22:54:25,052 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.49 vs. limit=10.0 2023-06-21 22:54:25,871 INFO [train.py:996] (2/4) Epoch 6, batch 27850, loss[loss=0.2659, simple_loss=0.3551, pruned_loss=0.08839, over 21757.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3148, pruned_loss=0.08805, over 4281216.66 frames. ], batch size: 298, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:54:59,430 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2023-06-21 22:55:10,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1082058.0, ans=0.125 2023-06-21 22:55:25,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1082118.0, ans=0.125 2023-06-21 22:55:25,913 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=15.0 2023-06-21 22:55:45,725 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.39 vs. limit=15.0 2023-06-21 22:55:54,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1082178.0, ans=0.125 2023-06-21 22:56:01,481 INFO [train.py:996] (2/4) Epoch 6, batch 27900, loss[loss=0.2392, simple_loss=0.3034, pruned_loss=0.08754, over 21233.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3205, pruned_loss=0.08766, over 4281922.23 frames. ], batch size: 608, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:56:15,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1082238.0, ans=0.0 2023-06-21 22:56:27,608 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 2.965e+02 3.401e+02 4.272e+02 8.717e+02, threshold=6.802e+02, percent-clipped=4.0 2023-06-21 22:56:36,667 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.79 vs. limit=12.0 2023-06-21 22:57:06,428 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-21 22:57:21,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1082478.0, ans=0.0 2023-06-21 22:57:30,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1082478.0, ans=0.125 2023-06-21 22:57:42,212 INFO [train.py:996] (2/4) Epoch 6, batch 27950, loss[loss=0.2851, simple_loss=0.3655, pruned_loss=0.1024, over 21905.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3213, pruned_loss=0.08452, over 4285779.43 frames. ], batch size: 372, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:57:50,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1082538.0, ans=0.0 2023-06-21 22:58:51,206 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-06-21 22:59:02,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1082778.0, ans=0.0 2023-06-21 22:59:15,428 INFO [train.py:996] (2/4) Epoch 6, batch 28000, loss[loss=0.229, simple_loss=0.2972, pruned_loss=0.08038, over 19974.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3192, pruned_loss=0.08187, over 4286025.47 frames. ], batch size: 702, lr: 4.92e-03, grad_scale: 16.0 2023-06-21 22:59:39,644 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.03 vs. limit=6.0 2023-06-21 22:59:43,149 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.149e+02 2.927e+02 3.364e+02 4.265e+02 7.771e+02, threshold=6.727e+02, percent-clipped=2.0 2023-06-21 22:59:49,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1082898.0, ans=0.125 2023-06-21 23:00:09,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1082958.0, ans=0.2 2023-06-21 23:00:46,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1083078.0, ans=0.0 2023-06-21 23:00:50,682 INFO [train.py:996] (2/4) Epoch 6, batch 28050, loss[loss=0.22, simple_loss=0.2871, pruned_loss=0.07651, over 21616.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3165, pruned_loss=0.08331, over 4285313.36 frames. ], batch size: 230, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:01:07,944 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:01:20,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1083198.0, ans=0.125 2023-06-21 23:01:29,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1083198.0, ans=0.025 2023-06-21 23:01:36,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1083258.0, ans=0.125 2023-06-21 23:01:50,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1083318.0, ans=0.09899494936611666 2023-06-21 23:01:58,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1083318.0, ans=0.0 2023-06-21 23:02:12,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1083378.0, ans=0.0 2023-06-21 23:02:19,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1083378.0, ans=0.0 2023-06-21 23:02:28,982 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.49 vs. limit=15.0 2023-06-21 23:02:29,345 INFO [train.py:996] (2/4) Epoch 6, batch 28100, loss[loss=0.2286, simple_loss=0.2837, pruned_loss=0.08673, over 21515.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.317, pruned_loss=0.08396, over 4285142.79 frames. ], batch size: 414, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:03:00,985 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 3.119e+02 3.918e+02 4.692e+02 8.833e+02, threshold=7.836e+02, percent-clipped=5.0 2023-06-21 23:03:02,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1083498.0, ans=0.125 2023-06-21 23:03:29,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1083618.0, ans=0.1 2023-06-21 23:04:01,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1083738.0, ans=0.125 2023-06-21 23:04:02,319 INFO [train.py:996] (2/4) Epoch 6, batch 28150, loss[loss=0.2374, simple_loss=0.2942, pruned_loss=0.09032, over 14941.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3117, pruned_loss=0.08393, over 4279778.64 frames. ], batch size: 62, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:04:38,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1083798.0, ans=0.2 2023-06-21 23:04:47,824 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.37 vs. limit=22.5 2023-06-21 23:04:56,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1083918.0, ans=0.2 2023-06-21 23:05:06,180 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.48 vs. limit=15.0 2023-06-21 23:05:17,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1083978.0, ans=0.0 2023-06-21 23:05:40,588 INFO [train.py:996] (2/4) Epoch 6, batch 28200, loss[loss=0.2634, simple_loss=0.3281, pruned_loss=0.09933, over 21703.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3097, pruned_loss=0.08605, over 4284168.46 frames. ], batch size: 351, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:06:07,618 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.278e+02 3.112e+02 3.798e+02 4.464e+02 8.953e+02, threshold=7.596e+02, percent-clipped=1.0 2023-06-21 23:06:15,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1084158.0, ans=0.0 2023-06-21 23:06:27,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1084158.0, ans=0.125 2023-06-21 23:06:45,998 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.78 vs. limit=15.0 2023-06-21 23:07:14,387 INFO [train.py:996] (2/4) Epoch 6, batch 28250, loss[loss=0.2451, simple_loss=0.3005, pruned_loss=0.09488, over 21604.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3123, pruned_loss=0.08854, over 4272778.60 frames. ], batch size: 415, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:07:16,953 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=12.0 2023-06-21 23:07:18,408 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2023-06-21 23:08:20,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1084518.0, ans=0.125 2023-06-21 23:08:41,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1084578.0, ans=0.05 2023-06-21 23:08:54,450 INFO [train.py:996] (2/4) Epoch 6, batch 28300, loss[loss=0.1896, simple_loss=0.2783, pruned_loss=0.0505, over 21758.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3111, pruned_loss=0.08714, over 4265447.22 frames. ], batch size: 282, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:09:17,328 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.819e+02 3.236e+02 3.708e+02 8.201e+02, threshold=6.472e+02, percent-clipped=2.0 2023-06-21 23:09:25,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1084758.0, ans=0.0 2023-06-21 23:09:26,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1084758.0, ans=0.0 2023-06-21 23:09:50,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1084818.0, ans=0.2 2023-06-21 23:09:58,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1084818.0, ans=0.2 2023-06-21 23:10:06,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1084818.0, ans=0.125 2023-06-21 23:10:19,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1084878.0, ans=0.125 2023-06-21 23:10:28,259 INFO [train.py:996] (2/4) Epoch 6, batch 28350, loss[loss=0.1879, simple_loss=0.2595, pruned_loss=0.05814, over 21543.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3077, pruned_loss=0.08213, over 4256326.96 frames. ], batch size: 230, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:10:33,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1084938.0, ans=0.125 2023-06-21 23:10:48,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1084998.0, ans=0.125 2023-06-21 23:11:00,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1085058.0, ans=0.125 2023-06-21 23:11:02,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1085058.0, ans=0.05 2023-06-21 23:11:57,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1085178.0, ans=0.1 2023-06-21 23:12:00,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1085178.0, ans=0.125 2023-06-21 23:12:03,502 INFO [train.py:996] (2/4) Epoch 6, batch 28400, loss[loss=0.2125, simple_loss=0.2777, pruned_loss=0.0737, over 21464.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3043, pruned_loss=0.08075, over 4255145.55 frames. ], batch size: 441, lr: 4.91e-03, grad_scale: 32.0 2023-06-21 23:12:26,112 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.897e+02 2.685e+02 3.251e+02 3.858e+02 5.974e+02, threshold=6.502e+02, percent-clipped=0.0 2023-06-21 23:13:34,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1085478.0, ans=0.125 2023-06-21 23:13:37,458 INFO [train.py:996] (2/4) Epoch 6, batch 28450, loss[loss=0.2597, simple_loss=0.3189, pruned_loss=0.1002, over 21642.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3084, pruned_loss=0.08475, over 4251926.67 frames. ], batch size: 263, lr: 4.91e-03, grad_scale: 32.0 2023-06-21 23:14:01,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1085598.0, ans=0.1 2023-06-21 23:14:22,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1085658.0, ans=0.125 2023-06-21 23:14:32,511 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=12.0 2023-06-21 23:14:51,621 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-21 23:15:10,661 INFO [train.py:996] (2/4) Epoch 6, batch 28500, loss[loss=0.2627, simple_loss=0.331, pruned_loss=0.09721, over 21430.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3127, pruned_loss=0.08802, over 4263742.23 frames. ], batch size: 194, lr: 4.91e-03, grad_scale: 32.0 2023-06-21 23:15:23,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1085838.0, ans=0.125 2023-06-21 23:15:28,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1085898.0, ans=0.2 2023-06-21 23:15:37,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1085898.0, ans=0.125 2023-06-21 23:15:38,220 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.422e+02 3.148e+02 3.464e+02 4.022e+02 7.400e+02, threshold=6.927e+02, percent-clipped=1.0 2023-06-21 23:15:48,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1085898.0, ans=0.125 2023-06-21 23:16:29,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1086078.0, ans=0.0 2023-06-21 23:16:45,731 INFO [train.py:996] (2/4) Epoch 6, batch 28550, loss[loss=0.3083, simple_loss=0.3883, pruned_loss=0.1142, over 21738.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3191, pruned_loss=0.09055, over 4268416.82 frames. ], batch size: 441, lr: 4.91e-03, grad_scale: 32.0 2023-06-21 23:16:46,733 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.90 vs. limit=12.0 2023-06-21 23:16:51,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1086138.0, ans=15.0 2023-06-21 23:16:57,374 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.34 vs. limit=15.0 2023-06-21 23:17:17,696 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=22.5 2023-06-21 23:17:31,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1086258.0, ans=0.1 2023-06-21 23:17:54,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1086318.0, ans=0.09899494936611666 2023-06-21 23:17:55,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1086318.0, ans=0.0 2023-06-21 23:18:19,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1086438.0, ans=0.025 2023-06-21 23:18:20,994 INFO [train.py:996] (2/4) Epoch 6, batch 28600, loss[loss=0.2474, simple_loss=0.3217, pruned_loss=0.08658, over 21654.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3258, pruned_loss=0.09304, over 4272496.74 frames. ], batch size: 230, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:18:32,524 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=15.0 2023-06-21 23:18:58,678 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.626e+02 3.164e+02 3.571e+02 4.573e+02 8.343e+02, threshold=7.141e+02, percent-clipped=3.0 2023-06-21 23:19:05,772 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=22.5 2023-06-21 23:19:20,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1086558.0, ans=0.0 2023-06-21 23:19:59,186 INFO [train.py:996] (2/4) Epoch 6, batch 28650, loss[loss=0.2217, simple_loss=0.2823, pruned_loss=0.08057, over 21579.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3208, pruned_loss=0.09211, over 4271774.53 frames. ], batch size: 415, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:20:18,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1086798.0, ans=0.5 2023-06-21 23:20:41,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1086798.0, ans=0.125 2023-06-21 23:20:46,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1086858.0, ans=0.0 2023-06-21 23:21:12,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1086978.0, ans=0.125 2023-06-21 23:21:14,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1086978.0, ans=0.125 2023-06-21 23:21:38,617 INFO [train.py:996] (2/4) Epoch 6, batch 28700, loss[loss=0.275, simple_loss=0.3479, pruned_loss=0.1011, over 21897.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3212, pruned_loss=0.09359, over 4265932.93 frames. ], batch size: 118, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:21:39,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1087038.0, ans=0.0 2023-06-21 23:21:40,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1087038.0, ans=0.0 2023-06-21 23:22:07,880 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.373e+02 3.119e+02 3.496e+02 4.060e+02 9.079e+02, threshold=6.992e+02, percent-clipped=1.0 2023-06-21 23:22:27,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1087158.0, ans=0.125 2023-06-21 23:22:30,090 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.86 vs. limit=6.0 2023-06-21 23:23:09,078 INFO [train.py:996] (2/4) Epoch 6, batch 28750, loss[loss=0.2319, simple_loss=0.2922, pruned_loss=0.08577, over 21613.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3204, pruned_loss=0.09347, over 4273816.48 frames. ], batch size: 548, lr: 4.91e-03, grad_scale: 16.0 2023-06-21 23:23:59,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1087458.0, ans=0.1 2023-06-21 23:24:09,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1087518.0, ans=0.2 2023-06-21 23:24:43,702 INFO [train.py:996] (2/4) Epoch 6, batch 28800, loss[loss=0.2589, simple_loss=0.3278, pruned_loss=0.09496, over 21902.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3252, pruned_loss=0.09423, over 4273968.74 frames. ], batch size: 316, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:24:57,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1087638.0, ans=0.035 2023-06-21 23:25:05,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1087638.0, ans=0.125 2023-06-21 23:25:16,853 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.095e+02 2.957e+02 3.291e+02 3.824e+02 6.486e+02, threshold=6.582e+02, percent-clipped=0.0 2023-06-21 23:26:21,870 INFO [train.py:996] (2/4) Epoch 6, batch 28850, loss[loss=0.2622, simple_loss=0.33, pruned_loss=0.09726, over 21817.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3277, pruned_loss=0.09554, over 4274975.72 frames. ], batch size: 441, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:26:46,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1087998.0, ans=0.0 2023-06-21 23:27:03,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1088058.0, ans=0.0 2023-06-21 23:27:03,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1088058.0, ans=0.0 2023-06-21 23:27:28,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1088118.0, ans=0.0 2023-06-21 23:27:29,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1088118.0, ans=0.05 2023-06-21 23:28:01,563 INFO [train.py:996] (2/4) Epoch 6, batch 28900, loss[loss=0.2806, simple_loss=0.3387, pruned_loss=0.1112, over 21613.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.3309, pruned_loss=0.09801, over 4281635.36 frames. ], batch size: 263, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:28:26,425 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.480e+02 3.260e+02 3.561e+02 4.329e+02 7.781e+02, threshold=7.122e+02, percent-clipped=1.0 2023-06-21 23:28:52,538 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2023-06-21 23:29:38,522 INFO [train.py:996] (2/4) Epoch 6, batch 28950, loss[loss=0.1863, simple_loss=0.2525, pruned_loss=0.0601, over 21071.00 frames. ], tot_loss[loss=0.2616, simple_loss=0.3297, pruned_loss=0.09675, over 4275150.89 frames. ], batch size: 143, lr: 4.90e-03, grad_scale: 16.0 2023-06-21 23:30:05,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1088598.0, ans=0.2 2023-06-21 23:30:36,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1088718.0, ans=0.125 2023-06-21 23:31:09,156 INFO [train.py:996] (2/4) Epoch 6, batch 29000, loss[loss=0.3268, simple_loss=0.3799, pruned_loss=0.1368, over 21352.00 frames. ], tot_loss[loss=0.263, simple_loss=0.3334, pruned_loss=0.09632, over 4277118.85 frames. ], batch size: 507, lr: 4.90e-03, grad_scale: 16.0 2023-06-21 23:31:13,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1088838.0, ans=0.125 2023-06-21 23:31:15,661 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-06-21 23:31:43,614 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.438e+02 3.241e+02 3.719e+02 4.877e+02 7.775e+02, threshold=7.438e+02, percent-clipped=3.0 2023-06-21 23:32:30,902 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=15.0 2023-06-21 23:32:42,127 INFO [train.py:996] (2/4) Epoch 6, batch 29050, loss[loss=0.279, simple_loss=0.3404, pruned_loss=0.1088, over 22006.00 frames. ], tot_loss[loss=0.2615, simple_loss=0.332, pruned_loss=0.09548, over 4283151.85 frames. ], batch size: 416, lr: 4.90e-03, grad_scale: 16.0 2023-06-21 23:32:59,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1089138.0, ans=0.0 2023-06-21 23:33:23,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1089258.0, ans=0.1 2023-06-21 23:33:27,991 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-06-21 23:33:50,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1089318.0, ans=0.0 2023-06-21 23:34:11,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1089378.0, ans=0.0 2023-06-21 23:34:12,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1089378.0, ans=0.125 2023-06-21 23:34:15,074 INFO [train.py:996] (2/4) Epoch 6, batch 29100, loss[loss=0.215, simple_loss=0.2784, pruned_loss=0.07583, over 21561.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3233, pruned_loss=0.09324, over 4287578.89 frames. ], batch size: 391, lr: 4.90e-03, grad_scale: 16.0 2023-06-21 23:34:27,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1089438.0, ans=0.125 2023-06-21 23:34:38,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1089498.0, ans=0.0 2023-06-21 23:34:49,733 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.284e+02 2.936e+02 3.266e+02 3.955e+02 6.605e+02, threshold=6.533e+02, percent-clipped=0.0 2023-06-21 23:35:48,090 INFO [train.py:996] (2/4) Epoch 6, batch 29150, loss[loss=0.1929, simple_loss=0.2461, pruned_loss=0.0698, over 20678.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3205, pruned_loss=0.0913, over 4286765.94 frames. ], batch size: 608, lr: 4.90e-03, grad_scale: 16.0 2023-06-21 23:36:52,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1089918.0, ans=0.2 2023-06-21 23:37:12,324 INFO [train.py:996] (2/4) Epoch 6, batch 29200, loss[loss=0.2196, simple_loss=0.2797, pruned_loss=0.07975, over 21580.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3164, pruned_loss=0.09018, over 4288653.38 frames. ], batch size: 298, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:37:47,286 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.378e+02 2.885e+02 3.375e+02 4.210e+02 7.193e+02, threshold=6.750e+02, percent-clipped=2.0 2023-06-21 23:38:24,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1090218.0, ans=0.125 2023-06-21 23:38:24,678 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.73 vs. limit=12.0 2023-06-21 23:38:55,719 INFO [train.py:996] (2/4) Epoch 6, batch 29250, loss[loss=0.2719, simple_loss=0.3616, pruned_loss=0.09114, over 21577.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3153, pruned_loss=0.08752, over 4278067.21 frames. ], batch size: 441, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:39:00,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1090338.0, ans=0.0 2023-06-21 23:39:08,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1090338.0, ans=0.1 2023-06-21 23:40:29,842 INFO [train.py:996] (2/4) Epoch 6, batch 29300, loss[loss=0.2474, simple_loss=0.2966, pruned_loss=0.09911, over 21148.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3162, pruned_loss=0.08614, over 4276802.15 frames. ], batch size: 143, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:40:37,233 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.32 vs. limit=15.0 2023-06-21 23:40:45,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1090698.0, ans=0.125 2023-06-21 23:41:00,630 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 2.989e+02 3.718e+02 4.652e+02 8.892e+02, threshold=7.436e+02, percent-clipped=6.0 2023-06-21 23:41:22,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1090758.0, ans=0.125 2023-06-21 23:41:24,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1090818.0, ans=0.125 2023-06-21 23:41:40,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1090878.0, ans=0.1 2023-06-21 23:41:57,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1090878.0, ans=0.125 2023-06-21 23:42:00,613 INFO [train.py:996] (2/4) Epoch 6, batch 29350, loss[loss=0.2107, simple_loss=0.2826, pruned_loss=0.06937, over 21786.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.313, pruned_loss=0.08582, over 4276012.85 frames. ], batch size: 372, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:42:26,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1090998.0, ans=0.2 2023-06-21 23:42:37,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1090998.0, ans=0.125 2023-06-21 23:42:57,713 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.89 vs. limit=10.0 2023-06-21 23:43:32,543 INFO [train.py:996] (2/4) Epoch 6, batch 29400, loss[loss=0.1863, simple_loss=0.2586, pruned_loss=0.05703, over 21766.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3119, pruned_loss=0.083, over 4275773.93 frames. ], batch size: 282, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:43:36,540 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.41 vs. limit=6.0 2023-06-21 23:43:53,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1091298.0, ans=0.1 2023-06-21 23:44:03,856 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.322e+02 2.776e+02 3.211e+02 3.938e+02 7.454e+02, threshold=6.422e+02, percent-clipped=1.0 2023-06-21 23:44:07,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1091358.0, ans=0.125 2023-06-21 23:44:09,598 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=22.5 2023-06-21 23:44:39,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1091418.0, ans=0.1 2023-06-21 23:44:58,835 INFO [train.py:996] (2/4) Epoch 6, batch 29450, loss[loss=0.3082, simple_loss=0.3757, pruned_loss=0.1204, over 21618.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3105, pruned_loss=0.08275, over 4268399.76 frames. ], batch size: 389, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:45:11,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1091538.0, ans=0.125 2023-06-21 23:45:14,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1091538.0, ans=0.0 2023-06-21 23:45:47,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1091718.0, ans=0.125 2023-06-21 23:46:08,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1091718.0, ans=0.0 2023-06-21 23:46:14,576 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.84 vs. limit=15.0 2023-06-21 23:46:26,836 INFO [train.py:996] (2/4) Epoch 6, batch 29500, loss[loss=0.2581, simple_loss=0.3192, pruned_loss=0.09847, over 21855.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3152, pruned_loss=0.08658, over 4277122.31 frames. ], batch size: 298, lr: 4.90e-03, grad_scale: 32.0 2023-06-21 23:47:01,554 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.458e+02 2.999e+02 3.395e+02 3.971e+02 6.244e+02, threshold=6.790e+02, percent-clipped=0.0 2023-06-21 23:48:05,511 INFO [train.py:996] (2/4) Epoch 6, batch 29550, loss[loss=0.2178, simple_loss=0.2839, pruned_loss=0.07587, over 21763.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3156, pruned_loss=0.08863, over 4279873.75 frames. ], batch size: 247, lr: 4.89e-03, grad_scale: 32.0 2023-06-21 23:48:56,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1092318.0, ans=0.125 2023-06-21 23:49:23,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1092378.0, ans=0.125 2023-06-21 23:49:26,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1092378.0, ans=0.125 2023-06-21 23:49:37,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1092378.0, ans=0.125 2023-06-21 23:49:44,844 INFO [train.py:996] (2/4) Epoch 6, batch 29600, loss[loss=0.3545, simple_loss=0.4287, pruned_loss=0.1401, over 21289.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.323, pruned_loss=0.09165, over 4282679.72 frames. ], batch size: 548, lr: 4.89e-03, grad_scale: 32.0 2023-06-21 23:49:57,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1092438.0, ans=0.0 2023-06-21 23:50:11,744 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.415e+02 3.056e+02 3.521e+02 4.458e+02 7.696e+02, threshold=7.042e+02, percent-clipped=3.0 2023-06-21 23:50:19,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1092558.0, ans=0.1 2023-06-21 23:50:28,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1092558.0, ans=0.125 2023-06-21 23:50:55,647 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.35 vs. limit=15.0 2023-06-21 23:51:06,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1092678.0, ans=0.0 2023-06-21 23:51:17,479 INFO [train.py:996] (2/4) Epoch 6, batch 29650, loss[loss=0.3063, simple_loss=0.3563, pruned_loss=0.1282, over 21578.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3211, pruned_loss=0.08803, over 4282397.38 frames. ], batch size: 471, lr: 4.89e-03, grad_scale: 8.0 2023-06-21 23:52:08,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1092858.0, ans=0.125 2023-06-21 23:52:09,462 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=15.0 2023-06-21 23:52:22,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1092918.0, ans=0.125 2023-06-21 23:52:50,739 INFO [train.py:996] (2/4) Epoch 6, batch 29700, loss[loss=0.2515, simple_loss=0.3263, pruned_loss=0.08835, over 21757.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3204, pruned_loss=0.08795, over 4284506.62 frames. ], batch size: 112, lr: 4.89e-03, grad_scale: 8.0 2023-06-21 23:53:19,202 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.085e+02 2.737e+02 3.024e+02 3.717e+02 5.941e+02, threshold=6.048e+02, percent-clipped=0.0 2023-06-21 23:53:24,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1093158.0, ans=0.125 2023-06-21 23:53:52,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1093218.0, ans=0.125 2023-06-21 23:54:24,078 INFO [train.py:996] (2/4) Epoch 6, batch 29750, loss[loss=0.2661, simple_loss=0.3505, pruned_loss=0.09084, over 21034.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3252, pruned_loss=0.08768, over 4283860.41 frames. ], batch size: 607, lr: 4.89e-03, grad_scale: 8.0 2023-06-21 23:54:53,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1093458.0, ans=0.125 2023-06-21 23:54:59,565 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=15.0 2023-06-21 23:55:06,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1093458.0, ans=0.125 2023-06-21 23:55:09,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1093458.0, ans=0.0 2023-06-21 23:55:49,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1093578.0, ans=10.0 2023-06-21 23:55:52,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1093578.0, ans=0.125 2023-06-21 23:55:56,888 INFO [train.py:996] (2/4) Epoch 6, batch 29800, loss[loss=0.2247, simple_loss=0.3016, pruned_loss=0.07386, over 21879.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3255, pruned_loss=0.08749, over 4288949.17 frames. ], batch size: 332, lr: 4.89e-03, grad_scale: 8.0 2023-06-21 23:55:57,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1093638.0, ans=0.5 2023-06-21 23:56:03,739 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.97 vs. limit=6.0 2023-06-21 23:56:18,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1093698.0, ans=0.125 2023-06-21 23:56:21,970 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.57 vs. limit=15.0 2023-06-21 23:56:25,356 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.253e+02 2.711e+02 3.022e+02 3.723e+02 5.120e+02, threshold=6.044e+02, percent-clipped=0.0 2023-06-21 23:57:12,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1093878.0, ans=0.0 2023-06-21 23:57:21,993 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.37 vs. limit=15.0 2023-06-21 23:57:29,986 INFO [train.py:996] (2/4) Epoch 6, batch 29850, loss[loss=0.2404, simple_loss=0.3558, pruned_loss=0.06249, over 19750.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3216, pruned_loss=0.08546, over 4282280.22 frames. ], batch size: 703, lr: 4.89e-03, grad_scale: 8.0 2023-06-21 23:57:33,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1093938.0, ans=0.0 2023-06-21 23:57:44,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1093998.0, ans=0.1 2023-06-21 23:58:05,545 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=15.0 2023-06-21 23:58:26,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1094118.0, ans=0.1 2023-06-21 23:58:30,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1094118.0, ans=0.125 2023-06-21 23:59:02,888 INFO [train.py:996] (2/4) Epoch 6, batch 29900, loss[loss=0.2682, simple_loss=0.3264, pruned_loss=0.105, over 21041.00 frames. ], tot_loss[loss=0.247, simple_loss=0.32, pruned_loss=0.08697, over 4285608.97 frames. ], batch size: 608, lr: 4.89e-03, grad_scale: 8.0 2023-06-21 23:59:07,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1094238.0, ans=0.0 2023-06-21 23:59:25,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1094298.0, ans=0.125 2023-06-21 23:59:26,654 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2023-06-21 23:59:36,082 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.940e+02 3.118e+02 4.055e+02 5.716e+02 1.068e+03, threshold=8.110e+02, percent-clipped=21.0 2023-06-22 00:00:13,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1094418.0, ans=0.125 2023-06-22 00:00:37,268 INFO [train.py:996] (2/4) Epoch 6, batch 29950, loss[loss=0.2539, simple_loss=0.3251, pruned_loss=0.09139, over 21376.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3241, pruned_loss=0.09109, over 4280289.45 frames. ], batch size: 549, lr: 4.89e-03, grad_scale: 8.0 2023-06-22 00:00:44,592 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.70 vs. limit=15.0 2023-06-22 00:00:51,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1094598.0, ans=0.05 2023-06-22 00:01:02,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1094598.0, ans=0.125 2023-06-22 00:01:44,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1094718.0, ans=0.125 2023-06-22 00:01:50,132 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-06-22 00:02:00,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1094778.0, ans=0.1 2023-06-22 00:02:11,810 INFO [train.py:996] (2/4) Epoch 6, batch 30000, loss[loss=0.2377, simple_loss=0.336, pruned_loss=0.06975, over 21812.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3261, pruned_loss=0.09163, over 4283015.24 frames. ], batch size: 371, lr: 4.89e-03, grad_scale: 16.0 2023-06-22 00:02:11,811 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-22 00:02:30,096 INFO [train.py:1028] (2/4) Epoch 6, validation: loss=0.2467, simple_loss=0.3478, pruned_loss=0.07276, over 1796401.00 frames. 2023-06-22 00:02:30,096 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-22 00:03:14,300 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.137e+02 2.660e+02 3.036e+02 3.460e+02 6.733e+02, threshold=6.073e+02, percent-clipped=0.0 2023-06-22 00:03:40,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1095018.0, ans=0.09899494936611666 2023-06-22 00:03:41,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=1095018.0, ans=0.2 2023-06-22 00:03:42,268 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-06-22 00:03:43,752 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.34 vs. limit=10.0 2023-06-22 00:04:17,402 INFO [train.py:996] (2/4) Epoch 6, batch 30050, loss[loss=0.2429, simple_loss=0.3255, pruned_loss=0.08014, over 21424.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3284, pruned_loss=0.08749, over 4281811.39 frames. ], batch size: 194, lr: 4.89e-03, grad_scale: 16.0 2023-06-22 00:04:39,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1095198.0, ans=0.125 2023-06-22 00:05:15,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1095318.0, ans=0.125 2023-06-22 00:05:50,552 INFO [train.py:996] (2/4) Epoch 6, batch 30100, loss[loss=0.2217, simple_loss=0.2782, pruned_loss=0.08261, over 21175.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3247, pruned_loss=0.08711, over 4269813.20 frames. ], batch size: 159, lr: 4.89e-03, grad_scale: 16.0 2023-06-22 00:05:57,829 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.35 vs. limit=10.0 2023-06-22 00:06:24,501 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.037e+02 3.799e+02 4.739e+02 8.498e+02, threshold=7.598e+02, percent-clipped=11.0 2023-06-22 00:06:46,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1095618.0, ans=0.125 2023-06-22 00:07:04,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1095678.0, ans=0.0 2023-06-22 00:07:06,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1095678.0, ans=0.2 2023-06-22 00:07:25,730 INFO [train.py:996] (2/4) Epoch 6, batch 30150, loss[loss=0.2876, simple_loss=0.3486, pruned_loss=0.1133, over 21433.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3215, pruned_loss=0.08843, over 4264213.43 frames. ], batch size: 471, lr: 4.89e-03, grad_scale: 16.0 2023-06-22 00:08:50,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1095978.0, ans=0.1 2023-06-22 00:09:07,436 INFO [train.py:996] (2/4) Epoch 6, batch 30200, loss[loss=0.3096, simple_loss=0.3863, pruned_loss=0.1165, over 21427.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3235, pruned_loss=0.08743, over 4268173.38 frames. ], batch size: 507, lr: 4.89e-03, grad_scale: 16.0 2023-06-22 00:09:46,251 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.987e+02 3.001e+02 3.567e+02 4.107e+02 7.558e+02, threshold=7.134e+02, percent-clipped=0.0 2023-06-22 00:10:08,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1096218.0, ans=0.0 2023-06-22 00:10:43,115 INFO [train.py:996] (2/4) Epoch 6, batch 30250, loss[loss=0.3177, simple_loss=0.4124, pruned_loss=0.1115, over 21784.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.3328, pruned_loss=0.08997, over 4268280.67 frames. ], batch size: 332, lr: 4.89e-03, grad_scale: 16.0 2023-06-22 00:11:22,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1096458.0, ans=0.5 2023-06-22 00:11:45,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1096518.0, ans=0.07 2023-06-22 00:12:17,395 INFO [train.py:996] (2/4) Epoch 6, batch 30300, loss[loss=0.2162, simple_loss=0.2779, pruned_loss=0.0773, over 21234.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3307, pruned_loss=0.09034, over 4273839.82 frames. ], batch size: 549, lr: 4.88e-03, grad_scale: 16.0 2023-06-22 00:12:17,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1096638.0, ans=0.2 2023-06-22 00:12:24,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1096638.0, ans=0.0 2023-06-22 00:12:31,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1096638.0, ans=0.125 2023-06-22 00:13:00,637 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.400e+02 3.155e+02 3.767e+02 4.351e+02 8.059e+02, threshold=7.534e+02, percent-clipped=2.0 2023-06-22 00:13:01,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1096758.0, ans=0.125 2023-06-22 00:13:09,356 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-06-22 00:13:12,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1096758.0, ans=0.1 2023-06-22 00:13:19,412 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.29 vs. limit=15.0 2023-06-22 00:13:31,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1096818.0, ans=0.0 2023-06-22 00:13:46,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1096878.0, ans=0.1 2023-06-22 00:14:03,143 INFO [train.py:996] (2/4) Epoch 6, batch 30350, loss[loss=0.2707, simple_loss=0.3228, pruned_loss=0.1093, over 20027.00 frames. ], tot_loss[loss=0.2597, simple_loss=0.3334, pruned_loss=0.09297, over 4270694.92 frames. ], batch size: 702, lr: 4.88e-03, grad_scale: 16.0 2023-06-22 00:14:05,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1096938.0, ans=0.125 2023-06-22 00:15:05,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1097178.0, ans=0.2 2023-06-22 00:15:06,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1097178.0, ans=0.0 2023-06-22 00:15:21,233 INFO [train.py:996] (2/4) Epoch 6, batch 30400, loss[loss=0.2406, simple_loss=0.2907, pruned_loss=0.09528, over 20199.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3254, pruned_loss=0.09055, over 4247358.63 frames. ], batch size: 703, lr: 4.88e-03, grad_scale: 32.0 2023-06-22 00:15:28,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1097238.0, ans=0.125 2023-06-22 00:15:36,036 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=12.0 2023-06-22 00:15:39,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1097298.0, ans=0.07 2023-06-22 00:15:50,255 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.186e+02 3.413e+02 4.097e+02 5.278e+02 1.616e+03, threshold=8.194e+02, percent-clipped=3.0 2023-06-22 00:15:57,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1097358.0, ans=0.0 2023-06-22 00:16:10,203 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.39 vs. limit=15.0 2023-06-22 00:16:39,257 INFO [train.py:996] (2/4) Epoch 6, batch 30450, loss[loss=0.3425, simple_loss=0.4482, pruned_loss=0.1184, over 19770.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3268, pruned_loss=0.08993, over 4190999.28 frames. ], batch size: 702, lr: 4.88e-03, grad_scale: 16.0 2023-06-22 00:17:00,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1097598.0, ans=0.0 2023-06-22 00:17:16,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1097658.0, ans=0.0 2023-06-22 00:17:30,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1097718.0, ans=0.1 2023-06-22 00:17:31,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1097718.0, ans=0.0 2023-06-22 00:19:20,712 INFO [train.py:996] (2/4) Epoch 7, batch 0, loss[loss=0.2686, simple_loss=0.3335, pruned_loss=0.1018, over 21852.00 frames. ], tot_loss[loss=0.2686, simple_loss=0.3335, pruned_loss=0.1018, over 21852.00 frames. ], batch size: 107, lr: 4.48e-03, grad_scale: 32.0 2023-06-22 00:19:20,713 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-22 00:19:38,899 INFO [train.py:1028] (2/4) Epoch 7, validation: loss=0.2422, simple_loss=0.3486, pruned_loss=0.06787, over 1796401.00 frames. 2023-06-22 00:19:38,899 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-22 00:20:26,921 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 4.648e+02 5.934e+02 9.527e+02 2.892e+03, threshold=1.187e+03, percent-clipped=31.0 2023-06-22 00:21:02,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1098042.0, ans=0.125 2023-06-22 00:21:07,561 INFO [train.py:996] (2/4) Epoch 7, batch 50, loss[loss=0.2735, simple_loss=0.3905, pruned_loss=0.07829, over 21724.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3291, pruned_loss=0.08923, over 960586.14 frames. ], batch size: 247, lr: 4.48e-03, grad_scale: 32.0 2023-06-22 00:21:40,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1098162.0, ans=0.125 2023-06-22 00:21:48,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1098222.0, ans=0.125 2023-06-22 00:22:43,748 INFO [train.py:996] (2/4) Epoch 7, batch 100, loss[loss=0.2691, simple_loss=0.3662, pruned_loss=0.086, over 21390.00 frames. ], tot_loss[loss=0.2632, simple_loss=0.3441, pruned_loss=0.09113, over 1683309.62 frames. ], batch size: 194, lr: 4.48e-03, grad_scale: 32.0 2023-06-22 00:23:34,392 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.53 vs. limit=15.0 2023-06-22 00:23:37,732 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.096e+02 2.827e+02 3.336e+02 3.937e+02 6.913e+02, threshold=6.673e+02, percent-clipped=0.0 2023-06-22 00:24:14,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1098642.0, ans=0.125 2023-06-22 00:24:19,912 INFO [train.py:996] (2/4) Epoch 7, batch 150, loss[loss=0.234, simple_loss=0.3046, pruned_loss=0.08168, over 21217.00 frames. ], tot_loss[loss=0.2635, simple_loss=0.344, pruned_loss=0.09146, over 2248453.96 frames. ], batch size: 159, lr: 4.48e-03, grad_scale: 32.0 2023-06-22 00:24:32,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1098702.0, ans=0.0 2023-06-22 00:24:35,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1098702.0, ans=0.125 2023-06-22 00:25:12,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1098822.0, ans=0.0 2023-06-22 00:25:34,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1098882.0, ans=0.0 2023-06-22 00:25:40,890 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.96 vs. limit=15.0 2023-06-22 00:25:51,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1098942.0, ans=0.125 2023-06-22 00:25:57,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1099002.0, ans=0.125 2023-06-22 00:25:58,212 INFO [train.py:996] (2/4) Epoch 7, batch 200, loss[loss=0.2628, simple_loss=0.351, pruned_loss=0.08726, over 21728.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3412, pruned_loss=0.0905, over 2688711.23 frames. ], batch size: 351, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:26:36,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1099062.0, ans=0.0 2023-06-22 00:26:56,358 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.932e+02 3.409e+02 3.929e+02 8.481e+02, threshold=6.818e+02, percent-clipped=3.0 2023-06-22 00:26:56,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1099122.0, ans=0.0 2023-06-22 00:27:36,477 INFO [train.py:996] (2/4) Epoch 7, batch 250, loss[loss=0.2204, simple_loss=0.2844, pruned_loss=0.07825, over 21779.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3361, pruned_loss=0.08751, over 3041440.76 frames. ], batch size: 124, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:28:12,200 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=22.5 2023-06-22 00:28:33,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1099422.0, ans=0.2 2023-06-22 00:29:00,736 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=15.0 2023-06-22 00:29:14,501 INFO [train.py:996] (2/4) Epoch 7, batch 300, loss[loss=0.2369, simple_loss=0.3015, pruned_loss=0.08616, over 21329.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3292, pruned_loss=0.08709, over 3304776.92 frames. ], batch size: 176, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:29:30,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1099602.0, ans=0.1 2023-06-22 00:29:35,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1099662.0, ans=0.125 2023-06-22 00:30:00,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1099722.0, ans=0.1 2023-06-22 00:30:11,200 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.248e+02 2.968e+02 3.407e+02 3.987e+02 5.179e+02, threshold=6.813e+02, percent-clipped=0.0 2023-06-22 00:30:48,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1099842.0, ans=0.125 2023-06-22 00:30:52,959 INFO [train.py:996] (2/4) Epoch 7, batch 350, loss[loss=0.2036, simple_loss=0.2628, pruned_loss=0.07218, over 21557.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3225, pruned_loss=0.08701, over 3525738.83 frames. ], batch size: 247, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:31:07,346 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=15.0 2023-06-22 00:31:20,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1099962.0, ans=0.1 2023-06-22 00:31:44,054 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.85 vs. limit=8.0 2023-06-22 00:31:59,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1100082.0, ans=0.125 2023-06-22 00:32:02,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1100082.0, ans=0.1 2023-06-22 00:32:17,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1100142.0, ans=0.125 2023-06-22 00:32:36,677 INFO [train.py:996] (2/4) Epoch 7, batch 400, loss[loss=0.2081, simple_loss=0.2751, pruned_loss=0.07056, over 21586.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3169, pruned_loss=0.08615, over 3681381.96 frames. ], batch size: 332, lr: 4.48e-03, grad_scale: 32.0 2023-06-22 00:33:07,471 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-22 00:33:13,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1100262.0, ans=0.1 2023-06-22 00:33:29,206 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.277e+02 3.211e+02 3.756e+02 4.853e+02 8.203e+02, threshold=7.513e+02, percent-clipped=4.0 2023-06-22 00:33:34,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1100382.0, ans=0.125 2023-06-22 00:33:41,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1100382.0, ans=0.125 2023-06-22 00:33:43,424 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-22 00:33:55,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1100442.0, ans=0.0 2023-06-22 00:34:15,908 INFO [train.py:996] (2/4) Epoch 7, batch 450, loss[loss=0.284, simple_loss=0.3536, pruned_loss=0.1072, over 21893.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3172, pruned_loss=0.08573, over 3810179.23 frames. ], batch size: 316, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:34:19,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1100502.0, ans=0.1 2023-06-22 00:34:35,127 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 00:34:38,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1100562.0, ans=0.125 2023-06-22 00:35:16,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1100682.0, ans=0.0 2023-06-22 00:35:30,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1100682.0, ans=0.125 2023-06-22 00:35:35,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1100742.0, ans=0.0 2023-06-22 00:35:53,608 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=15.0 2023-06-22 00:36:00,500 INFO [train.py:996] (2/4) Epoch 7, batch 500, loss[loss=0.2937, simple_loss=0.3834, pruned_loss=0.102, over 21850.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3173, pruned_loss=0.08515, over 3903195.49 frames. ], batch size: 371, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:36:14,770 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=22.5 2023-06-22 00:36:36,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1100922.0, ans=0.125 2023-06-22 00:36:40,545 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-22 00:36:50,807 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.217e+02 2.984e+02 3.762e+02 4.525e+02 7.787e+02, threshold=7.525e+02, percent-clipped=1.0 2023-06-22 00:36:51,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1100922.0, ans=0.125 2023-06-22 00:36:57,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1100982.0, ans=0.1 2023-06-22 00:37:12,278 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.36 vs. limit=10.0 2023-06-22 00:37:16,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1101042.0, ans=0.125 2023-06-22 00:37:18,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1101042.0, ans=0.125 2023-06-22 00:37:42,355 INFO [train.py:996] (2/4) Epoch 7, batch 550, loss[loss=0.1912, simple_loss=0.2824, pruned_loss=0.04999, over 21159.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3142, pruned_loss=0.08343, over 3977635.97 frames. ], batch size: 548, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:39:20,770 INFO [train.py:996] (2/4) Epoch 7, batch 600, loss[loss=0.2585, simple_loss=0.3503, pruned_loss=0.08339, over 21649.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.319, pruned_loss=0.08292, over 4048362.96 frames. ], batch size: 263, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:39:35,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1101462.0, ans=10.0 2023-06-22 00:39:55,135 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 00:39:59,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1101522.0, ans=0.125 2023-06-22 00:40:10,557 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.213e+02 2.944e+02 3.479e+02 4.173e+02 5.834e+02, threshold=6.959e+02, percent-clipped=0.0 2023-06-22 00:40:10,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1101522.0, ans=0.125 2023-06-22 00:40:33,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1101642.0, ans=0.0 2023-06-22 00:40:33,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1101642.0, ans=0.2 2023-06-22 00:41:00,139 INFO [train.py:996] (2/4) Epoch 7, batch 650, loss[loss=0.2643, simple_loss=0.3216, pruned_loss=0.1035, over 21859.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3212, pruned_loss=0.08357, over 4094622.59 frames. ], batch size: 414, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:41:03,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1101702.0, ans=0.0 2023-06-22 00:41:36,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1101822.0, ans=0.95 2023-06-22 00:41:47,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1101822.0, ans=22.5 2023-06-22 00:42:05,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1101882.0, ans=0.1 2023-06-22 00:42:19,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1101942.0, ans=0.0 2023-06-22 00:42:19,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1101942.0, ans=0.2 2023-06-22 00:42:25,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1101942.0, ans=10.0 2023-06-22 00:42:32,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1101942.0, ans=0.2 2023-06-22 00:42:38,116 INFO [train.py:996] (2/4) Epoch 7, batch 700, loss[loss=0.2504, simple_loss=0.3185, pruned_loss=0.09122, over 21917.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3228, pruned_loss=0.08527, over 4144258.41 frames. ], batch size: 351, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:43:03,800 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.68 vs. limit=6.0 2023-06-22 00:43:27,976 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.306e+02 4.325e+02 5.540e+02 9.236e+02, threshold=8.651e+02, percent-clipped=10.0 2023-06-22 00:43:28,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1102122.0, ans=0.0 2023-06-22 00:43:42,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1102182.0, ans=0.125 2023-06-22 00:44:16,055 INFO [train.py:996] (2/4) Epoch 7, batch 750, loss[loss=0.2474, simple_loss=0.3056, pruned_loss=0.09457, over 15024.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3237, pruned_loss=0.08651, over 4168004.92 frames. ], batch size: 60, lr: 4.48e-03, grad_scale: 16.0 2023-06-22 00:44:53,543 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2023-06-22 00:44:53,674 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.95 vs. limit=10.0 2023-06-22 00:45:08,035 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.70 vs. limit=10.0 2023-06-22 00:45:53,746 INFO [train.py:996] (2/4) Epoch 7, batch 800, loss[loss=0.2781, simple_loss=0.3672, pruned_loss=0.0945, over 21716.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.321, pruned_loss=0.08678, over 4194130.90 frames. ], batch size: 298, lr: 4.48e-03, grad_scale: 32.0 2023-06-22 00:46:03,990 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-22 00:46:14,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1102662.0, ans=0.125 2023-06-22 00:46:17,766 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 00:46:20,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1102662.0, ans=0.035 2023-06-22 00:46:42,859 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.348e+02 3.353e+02 3.931e+02 5.238e+02 1.056e+03, threshold=7.862e+02, percent-clipped=1.0 2023-06-22 00:46:44,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1102782.0, ans=0.125 2023-06-22 00:46:58,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1102782.0, ans=0.125 2023-06-22 00:47:25,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1102842.0, ans=0.0 2023-06-22 00:47:31,043 INFO [train.py:996] (2/4) Epoch 7, batch 850, loss[loss=0.2544, simple_loss=0.3138, pruned_loss=0.09752, over 21945.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3198, pruned_loss=0.08701, over 4213600.49 frames. ], batch size: 316, lr: 4.47e-03, grad_scale: 32.0 2023-06-22 00:47:47,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1102902.0, ans=0.125 2023-06-22 00:47:49,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1102902.0, ans=0.0 2023-06-22 00:47:52,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1102962.0, ans=0.0 2023-06-22 00:48:07,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1103022.0, ans=0.125 2023-06-22 00:49:04,175 INFO [train.py:996] (2/4) Epoch 7, batch 900, loss[loss=0.2605, simple_loss=0.3073, pruned_loss=0.1068, over 21204.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3142, pruned_loss=0.08568, over 4232382.69 frames. ], batch size: 176, lr: 4.47e-03, grad_scale: 32.0 2023-06-22 00:49:53,740 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.119e+02 2.788e+02 3.273e+02 3.960e+02 6.263e+02, threshold=6.546e+02, percent-clipped=0.0 2023-06-22 00:50:48,001 INFO [train.py:996] (2/4) Epoch 7, batch 950, loss[loss=0.3458, simple_loss=0.3898, pruned_loss=0.1509, over 21449.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3134, pruned_loss=0.08619, over 4243814.23 frames. ], batch size: 507, lr: 4.47e-03, grad_scale: 32.0 2023-06-22 00:51:09,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1103562.0, ans=0.09899494936611666 2023-06-22 00:52:09,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1103742.0, ans=0.2 2023-06-22 00:52:26,737 INFO [train.py:996] (2/4) Epoch 7, batch 1000, loss[loss=0.228, simple_loss=0.2983, pruned_loss=0.07884, over 21755.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3138, pruned_loss=0.08626, over 4256250.15 frames. ], batch size: 247, lr: 4.47e-03, grad_scale: 32.0 2023-06-22 00:52:42,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1103862.0, ans=0.125 2023-06-22 00:52:45,053 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-06-22 00:53:01,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1103922.0, ans=0.0 2023-06-22 00:53:25,785 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.390e+02 2.971e+02 3.488e+02 4.258e+02 7.403e+02, threshold=6.977e+02, percent-clipped=1.0 2023-06-22 00:53:40,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1103982.0, ans=0.1 2023-06-22 00:53:48,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1104042.0, ans=0.1 2023-06-22 00:54:07,777 INFO [train.py:996] (2/4) Epoch 7, batch 1050, loss[loss=0.2491, simple_loss=0.3241, pruned_loss=0.08709, over 21822.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3103, pruned_loss=0.08468, over 4262121.85 frames. ], batch size: 247, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 00:54:14,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1104102.0, ans=0.125 2023-06-22 00:54:15,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1104102.0, ans=0.1 2023-06-22 00:54:34,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1104162.0, ans=0.125 2023-06-22 00:55:47,353 INFO [train.py:996] (2/4) Epoch 7, batch 1100, loss[loss=0.2387, simple_loss=0.3056, pruned_loss=0.08587, over 21415.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3107, pruned_loss=0.08497, over 4263227.49 frames. ], batch size: 194, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 00:55:49,323 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.346e-02 2023-06-22 00:56:13,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1104462.0, ans=0.1 2023-06-22 00:56:40,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1104522.0, ans=0.1 2023-06-22 00:56:48,281 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 2.865e+02 3.379e+02 3.962e+02 8.205e+02, threshold=6.758e+02, percent-clipped=2.0 2023-06-22 00:56:50,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1104582.0, ans=0.125 2023-06-22 00:57:27,375 INFO [train.py:996] (2/4) Epoch 7, batch 1150, loss[loss=0.2577, simple_loss=0.332, pruned_loss=0.09166, over 21299.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.312, pruned_loss=0.08449, over 4265358.12 frames. ], batch size: 131, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 00:57:45,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1104702.0, ans=0.125 2023-06-22 00:58:08,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1104762.0, ans=0.125 2023-06-22 00:58:13,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1104822.0, ans=0.0 2023-06-22 00:58:33,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1104882.0, ans=0.125 2023-06-22 00:59:05,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1104942.0, ans=0.0 2023-06-22 00:59:08,543 INFO [train.py:996] (2/4) Epoch 7, batch 1200, loss[loss=0.323, simple_loss=0.3965, pruned_loss=0.1248, over 21927.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3154, pruned_loss=0.08557, over 4276936.17 frames. ], batch size: 372, lr: 4.47e-03, grad_scale: 32.0 2023-06-22 00:59:13,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1105002.0, ans=0.0 2023-06-22 00:59:45,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1105062.0, ans=0.125 2023-06-22 00:59:51,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1105062.0, ans=0.0 2023-06-22 01:00:08,554 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.38 vs. limit=22.5 2023-06-22 01:00:10,589 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 3.150e+02 3.657e+02 4.212e+02 7.667e+02, threshold=7.313e+02, percent-clipped=2.0 2023-06-22 01:00:41,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1105242.0, ans=0.1 2023-06-22 01:00:48,905 INFO [train.py:996] (2/4) Epoch 7, batch 1250, loss[loss=0.2902, simple_loss=0.3495, pruned_loss=0.1154, over 21645.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3182, pruned_loss=0.08657, over 4280849.49 frames. ], batch size: 507, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:01:10,831 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=15.0 2023-06-22 01:02:23,130 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-06-22 01:02:28,456 INFO [train.py:996] (2/4) Epoch 7, batch 1300, loss[loss=0.41, simple_loss=0.4608, pruned_loss=0.1796, over 21524.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3205, pruned_loss=0.08665, over 4275239.40 frames. ], batch size: 507, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:03:03,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1105662.0, ans=0.125 2023-06-22 01:03:17,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1105722.0, ans=0.2 2023-06-22 01:03:36,717 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 3.032e+02 3.671e+02 4.552e+02 8.321e+02, threshold=7.341e+02, percent-clipped=3.0 2023-06-22 01:04:13,367 INFO [train.py:996] (2/4) Epoch 7, batch 1350, loss[loss=0.2189, simple_loss=0.2939, pruned_loss=0.07192, over 21775.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3199, pruned_loss=0.08613, over 4280061.93 frames. ], batch size: 102, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:04:17,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1105902.0, ans=0.125 2023-06-22 01:04:39,713 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.72 vs. limit=6.0 2023-06-22 01:05:05,041 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-22 01:05:19,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1106082.0, ans=0.125 2023-06-22 01:05:35,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1106142.0, ans=0.0 2023-06-22 01:05:42,746 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=22.5 2023-06-22 01:05:51,561 INFO [train.py:996] (2/4) Epoch 7, batch 1400, loss[loss=0.1974, simple_loss=0.2697, pruned_loss=0.06256, over 21207.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.318, pruned_loss=0.08626, over 4280940.72 frames. ], batch size: 607, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:06:23,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1106262.0, ans=0.2 2023-06-22 01:06:40,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1106322.0, ans=0.125 2023-06-22 01:06:54,466 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.431e+02 3.168e+02 3.476e+02 4.011e+02 7.450e+02, threshold=6.951e+02, percent-clipped=1.0 2023-06-22 01:06:59,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1106382.0, ans=0.125 2023-06-22 01:07:05,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1106382.0, ans=0.125 2023-06-22 01:07:26,209 INFO [train.py:996] (2/4) Epoch 7, batch 1450, loss[loss=0.2714, simple_loss=0.3432, pruned_loss=0.09978, over 21799.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3185, pruned_loss=0.08629, over 4286135.63 frames. ], batch size: 124, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:07:34,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1106502.0, ans=0.0 2023-06-22 01:07:47,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1106502.0, ans=0.125 2023-06-22 01:07:49,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1106502.0, ans=0.1 2023-06-22 01:07:55,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1106562.0, ans=0.0 2023-06-22 01:08:37,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1106682.0, ans=0.125 2023-06-22 01:08:47,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1106682.0, ans=0.125 2023-06-22 01:09:00,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1106742.0, ans=0.125 2023-06-22 01:09:11,597 INFO [train.py:996] (2/4) Epoch 7, batch 1500, loss[loss=0.2231, simple_loss=0.2851, pruned_loss=0.08057, over 21441.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3211, pruned_loss=0.08821, over 4289463.55 frames. ], batch size: 389, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:09:23,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1106802.0, ans=0.125 2023-06-22 01:10:15,557 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.335e+02 2.952e+02 3.357e+02 3.782e+02 8.287e+02, threshold=6.713e+02, percent-clipped=2.0 2023-06-22 01:11:03,388 INFO [train.py:996] (2/4) Epoch 7, batch 1550, loss[loss=0.2928, simple_loss=0.3578, pruned_loss=0.1139, over 20731.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3201, pruned_loss=0.08774, over 4285832.68 frames. ], batch size: 607, lr: 4.47e-03, grad_scale: 16.0 2023-06-22 01:11:41,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1107222.0, ans=0.05 2023-06-22 01:11:45,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1107222.0, ans=0.0 2023-06-22 01:11:50,900 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:12:36,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1107342.0, ans=0.1 2023-06-22 01:12:43,971 INFO [train.py:996] (2/4) Epoch 7, batch 1600, loss[loss=0.3026, simple_loss=0.3788, pruned_loss=0.1132, over 21634.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3201, pruned_loss=0.08767, over 4281585.16 frames. ], batch size: 414, lr: 4.47e-03, grad_scale: 32.0 2023-06-22 01:12:55,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1107402.0, ans=0.1 2023-06-22 01:13:01,384 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=12.0 2023-06-22 01:13:24,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1107522.0, ans=0.1 2023-06-22 01:13:39,486 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.225e+02 3.079e+02 3.590e+02 4.621e+02 8.115e+02, threshold=7.180e+02, percent-clipped=4.0 2023-06-22 01:14:17,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1107642.0, ans=0.0 2023-06-22 01:14:22,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1107642.0, ans=0.125 2023-06-22 01:14:25,769 INFO [train.py:996] (2/4) Epoch 7, batch 1650, loss[loss=0.2265, simple_loss=0.3011, pruned_loss=0.07596, over 21676.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3192, pruned_loss=0.08662, over 4277022.04 frames. ], batch size: 332, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:14:26,028 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:14:31,286 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.99 vs. limit=15.0 2023-06-22 01:14:45,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1107762.0, ans=0.125 2023-06-22 01:16:07,493 INFO [train.py:996] (2/4) Epoch 7, batch 1700, loss[loss=0.2357, simple_loss=0.3054, pruned_loss=0.08302, over 21617.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3221, pruned_loss=0.08744, over 4285846.44 frames. ], batch size: 263, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:16:31,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1108062.0, ans=0.1 2023-06-22 01:16:42,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1108062.0, ans=0.1 2023-06-22 01:16:42,866 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.40 vs. limit=15.0 2023-06-22 01:16:59,531 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.76 vs. limit=15.0 2023-06-22 01:17:13,167 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.308e+02 3.011e+02 3.603e+02 4.344e+02 6.909e+02, threshold=7.205e+02, percent-clipped=0.0 2023-06-22 01:17:36,329 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:17:37,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1108242.0, ans=0.125 2023-06-22 01:17:53,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1108302.0, ans=0.125 2023-06-22 01:17:54,161 INFO [train.py:996] (2/4) Epoch 7, batch 1750, loss[loss=0.1571, simple_loss=0.2224, pruned_loss=0.04587, over 21271.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3208, pruned_loss=0.08654, over 4286607.95 frames. ], batch size: 143, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:17:54,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1108302.0, ans=0.05 2023-06-22 01:18:15,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1108362.0, ans=0.0 2023-06-22 01:18:27,239 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.45 vs. limit=22.5 2023-06-22 01:18:48,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1108422.0, ans=0.07 2023-06-22 01:19:13,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1108542.0, ans=0.0 2023-06-22 01:19:31,347 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.09 vs. limit=6.0 2023-06-22 01:19:36,788 INFO [train.py:996] (2/4) Epoch 7, batch 1800, loss[loss=0.2858, simple_loss=0.3572, pruned_loss=0.1072, over 21389.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3207, pruned_loss=0.08484, over 4281238.38 frames. ], batch size: 549, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:19:42,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1108602.0, ans=0.0 2023-06-22 01:19:45,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1108602.0, ans=0.0 2023-06-22 01:19:54,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1108662.0, ans=0.0 2023-06-22 01:20:29,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1108722.0, ans=0.125 2023-06-22 01:20:36,853 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.982e+02 2.999e+02 3.807e+02 4.640e+02 8.092e+02, threshold=7.614e+02, percent-clipped=1.0 2023-06-22 01:21:12,514 INFO [train.py:996] (2/4) Epoch 7, batch 1850, loss[loss=0.1818, simple_loss=0.262, pruned_loss=0.05082, over 21431.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.319, pruned_loss=0.08242, over 4281076.15 frames. ], batch size: 211, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:21:14,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1108902.0, ans=0.125 2023-06-22 01:21:14,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1108902.0, ans=0.125 2023-06-22 01:21:27,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1108962.0, ans=0.0 2023-06-22 01:22:01,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1109022.0, ans=0.0 2023-06-22 01:22:09,734 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.58 vs. limit=15.0 2023-06-22 01:22:15,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1109082.0, ans=0.125 2023-06-22 01:22:38,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1109142.0, ans=0.0 2023-06-22 01:22:38,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1109142.0, ans=0.125 2023-06-22 01:22:51,993 INFO [train.py:996] (2/4) Epoch 7, batch 1900, loss[loss=0.2293, simple_loss=0.2907, pruned_loss=0.08394, over 21372.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3197, pruned_loss=0.08348, over 4277907.29 frames. ], batch size: 194, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:22:56,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1109202.0, ans=15.0 2023-06-22 01:23:17,978 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.09 vs. limit=10.0 2023-06-22 01:23:39,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1109322.0, ans=0.07 2023-06-22 01:23:55,369 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.339e+02 3.011e+02 3.315e+02 4.225e+02 7.544e+02, threshold=6.631e+02, percent-clipped=0.0 2023-06-22 01:24:26,439 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.19 vs. limit=22.5 2023-06-22 01:24:31,560 INFO [train.py:996] (2/4) Epoch 7, batch 1950, loss[loss=0.2412, simple_loss=0.2922, pruned_loss=0.09509, over 21833.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3154, pruned_loss=0.0832, over 4280832.99 frames. ], batch size: 98, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:24:48,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1109502.0, ans=0.125 2023-06-22 01:25:09,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1109562.0, ans=0.0 2023-06-22 01:25:16,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1109622.0, ans=0.07 2023-06-22 01:25:20,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1109622.0, ans=0.0 2023-06-22 01:25:26,218 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.48 vs. limit=15.0 2023-06-22 01:25:35,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1109682.0, ans=0.125 2023-06-22 01:26:08,030 INFO [train.py:996] (2/4) Epoch 7, batch 2000, loss[loss=0.1761, simple_loss=0.252, pruned_loss=0.05012, over 21588.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3113, pruned_loss=0.08122, over 4283008.81 frames. ], batch size: 230, lr: 4.46e-03, grad_scale: 32.0 2023-06-22 01:26:21,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1109802.0, ans=0.2 2023-06-22 01:26:24,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1109802.0, ans=0.125 2023-06-22 01:26:50,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1109922.0, ans=0.125 2023-06-22 01:26:57,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1109922.0, ans=0.125 2023-06-22 01:27:02,694 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.68 vs. limit=15.0 2023-06-22 01:27:03,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1109922.0, ans=0.05 2023-06-22 01:27:08,093 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.319e+02 3.011e+02 3.534e+02 4.204e+02 7.079e+02, threshold=7.069e+02, percent-clipped=1.0 2023-06-22 01:27:10,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1109982.0, ans=0.1 2023-06-22 01:27:38,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1110042.0, ans=0.0 2023-06-22 01:27:43,210 INFO [train.py:996] (2/4) Epoch 7, batch 2050, loss[loss=0.1963, simple_loss=0.2655, pruned_loss=0.06352, over 21511.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3103, pruned_loss=0.08046, over 4278709.47 frames. ], batch size: 212, lr: 4.46e-03, grad_scale: 32.0 2023-06-22 01:27:47,874 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:29:14,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1110342.0, ans=0.04949747468305833 2023-06-22 01:29:28,338 INFO [train.py:996] (2/4) Epoch 7, batch 2100, loss[loss=0.2442, simple_loss=0.3064, pruned_loss=0.09097, over 21330.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3135, pruned_loss=0.08177, over 4279849.57 frames. ], batch size: 471, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:30:08,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1110522.0, ans=0.0 2023-06-22 01:30:34,417 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.474e+02 3.462e+02 4.025e+02 4.907e+02 9.309e+02, threshold=8.051e+02, percent-clipped=5.0 2023-06-22 01:30:34,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1110582.0, ans=0.2 2023-06-22 01:30:54,546 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-06-22 01:31:08,523 INFO [train.py:996] (2/4) Epoch 7, batch 2150, loss[loss=0.2286, simple_loss=0.3032, pruned_loss=0.07699, over 21227.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3168, pruned_loss=0.08405, over 4278941.29 frames. ], batch size: 159, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:31:08,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1110702.0, ans=0.125 2023-06-22 01:32:38,152 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=15.0 2023-06-22 01:32:45,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1110942.0, ans=0.1 2023-06-22 01:32:47,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1111002.0, ans=0.125 2023-06-22 01:32:48,364 INFO [train.py:996] (2/4) Epoch 7, batch 2200, loss[loss=0.2006, simple_loss=0.273, pruned_loss=0.06408, over 21225.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3187, pruned_loss=0.08374, over 4271786.57 frames. ], batch size: 159, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:33:14,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1111062.0, ans=0.125 2023-06-22 01:33:37,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1111122.0, ans=0.1 2023-06-22 01:33:48,368 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.551e+02 3.115e+02 3.820e+02 5.117e+02 8.192e+02, threshold=7.640e+02, percent-clipped=1.0 2023-06-22 01:33:56,624 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.33 vs. limit=15.0 2023-06-22 01:34:05,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1111242.0, ans=0.125 2023-06-22 01:34:27,178 INFO [train.py:996] (2/4) Epoch 7, batch 2250, loss[loss=0.2151, simple_loss=0.2833, pruned_loss=0.07347, over 21706.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3141, pruned_loss=0.08192, over 4267583.09 frames. ], batch size: 316, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:34:45,093 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.49 vs. limit=10.0 2023-06-22 01:34:47,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1111362.0, ans=0.125 2023-06-22 01:35:00,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1111362.0, ans=0.2 2023-06-22 01:35:33,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1111482.0, ans=0.2 2023-06-22 01:35:56,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1111542.0, ans=0.1 2023-06-22 01:36:02,578 INFO [train.py:996] (2/4) Epoch 7, batch 2300, loss[loss=0.2031, simple_loss=0.269, pruned_loss=0.06855, over 21653.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3111, pruned_loss=0.08166, over 4270487.18 frames. ], batch size: 333, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:36:07,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1111602.0, ans=0.0 2023-06-22 01:36:17,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1111662.0, ans=0.2 2023-06-22 01:36:52,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1111722.0, ans=0.1 2023-06-22 01:37:09,391 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.274e+02 3.004e+02 3.529e+02 4.212e+02 9.324e+02, threshold=7.058e+02, percent-clipped=1.0 2023-06-22 01:37:19,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1111782.0, ans=0.1 2023-06-22 01:37:42,524 INFO [train.py:996] (2/4) Epoch 7, batch 2350, loss[loss=0.2376, simple_loss=0.292, pruned_loss=0.09165, over 21724.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.312, pruned_loss=0.08363, over 4261209.03 frames. ], batch size: 351, lr: 4.46e-03, grad_scale: 16.0 2023-06-22 01:37:52,958 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.23 vs. limit=15.0 2023-06-22 01:38:53,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1112082.0, ans=0.95 2023-06-22 01:39:17,641 INFO [train.py:996] (2/4) Epoch 7, batch 2400, loss[loss=0.2809, simple_loss=0.3503, pruned_loss=0.1057, over 21602.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3151, pruned_loss=0.08634, over 4265264.59 frames. ], batch size: 415, lr: 4.46e-03, grad_scale: 32.0 2023-06-22 01:40:10,814 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.56 vs. limit=6.0 2023-06-22 01:40:25,227 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.202e+02 3.130e+02 3.615e+02 4.219e+02 6.751e+02, threshold=7.231e+02, percent-clipped=0.0 2023-06-22 01:40:48,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1112442.0, ans=0.0 2023-06-22 01:40:59,118 INFO [train.py:996] (2/4) Epoch 7, batch 2450, loss[loss=0.2129, simple_loss=0.2835, pruned_loss=0.07121, over 21628.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3193, pruned_loss=0.08802, over 4270525.94 frames. ], batch size: 332, lr: 4.46e-03, grad_scale: 32.0 2023-06-22 01:41:01,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1112502.0, ans=0.2 2023-06-22 01:41:20,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1112562.0, ans=0.125 2023-06-22 01:41:33,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1112562.0, ans=0.2 2023-06-22 01:42:09,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1112682.0, ans=0.125 2023-06-22 01:42:40,146 INFO [train.py:996] (2/4) Epoch 7, batch 2500, loss[loss=0.2134, simple_loss=0.2718, pruned_loss=0.07752, over 21547.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3157, pruned_loss=0.08667, over 4264005.60 frames. ], batch size: 442, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:42:51,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1112802.0, ans=0.125 2023-06-22 01:43:30,676 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.84 vs. limit=22.5 2023-06-22 01:43:48,530 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.242e+02 3.102e+02 3.610e+02 4.513e+02 8.483e+02, threshold=7.220e+02, percent-clipped=3.0 2023-06-22 01:44:21,323 INFO [train.py:996] (2/4) Epoch 7, batch 2550, loss[loss=0.2489, simple_loss=0.3104, pruned_loss=0.09374, over 21313.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3151, pruned_loss=0.0862, over 4265409.45 frames. ], batch size: 131, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:44:34,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1113102.0, ans=0.0 2023-06-22 01:44:35,206 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-22 01:44:42,072 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=22.5 2023-06-22 01:45:57,671 INFO [train.py:996] (2/4) Epoch 7, batch 2600, loss[loss=0.2636, simple_loss=0.3284, pruned_loss=0.0994, over 21576.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3174, pruned_loss=0.08796, over 4262682.36 frames. ], batch size: 263, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:46:07,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1113402.0, ans=0.1 2023-06-22 01:46:44,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1113522.0, ans=0.125 2023-06-22 01:47:06,145 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.648e+02 3.288e+02 3.616e+02 4.316e+02 7.089e+02, threshold=7.232e+02, percent-clipped=0.0 2023-06-22 01:47:10,509 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=15.0 2023-06-22 01:47:39,070 INFO [train.py:996] (2/4) Epoch 7, batch 2650, loss[loss=0.3083, simple_loss=0.3498, pruned_loss=0.1334, over 21695.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.317, pruned_loss=0.0879, over 4267730.88 frames. ], batch size: 508, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:47:42,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1113702.0, ans=0.125 2023-06-22 01:48:01,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1113702.0, ans=0.1 2023-06-22 01:48:08,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1113762.0, ans=0.125 2023-06-22 01:48:15,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1113762.0, ans=0.125 2023-06-22 01:48:26,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1113822.0, ans=0.125 2023-06-22 01:48:28,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1113822.0, ans=0.125 2023-06-22 01:49:04,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1113942.0, ans=0.125 2023-06-22 01:49:09,613 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.57 vs. limit=22.5 2023-06-22 01:49:17,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1113942.0, ans=0.1 2023-06-22 01:49:19,247 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=15.0 2023-06-22 01:49:19,714 INFO [train.py:996] (2/4) Epoch 7, batch 2700, loss[loss=0.2611, simple_loss=0.3264, pruned_loss=0.0979, over 21871.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.315, pruned_loss=0.0861, over 4275017.09 frames. ], batch size: 124, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:49:21,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1114002.0, ans=0.125 2023-06-22 01:49:54,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1114062.0, ans=0.1 2023-06-22 01:50:28,235 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.479e+02 2.952e+02 3.435e+02 4.194e+02 7.834e+02, threshold=6.870e+02, percent-clipped=2.0 2023-06-22 01:50:56,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1114242.0, ans=0.0 2023-06-22 01:51:00,682 INFO [train.py:996] (2/4) Epoch 7, batch 2750, loss[loss=0.2891, simple_loss=0.3551, pruned_loss=0.1116, over 21727.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3145, pruned_loss=0.08679, over 4280651.21 frames. ], batch size: 441, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:51:01,724 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.45 vs. limit=6.0 2023-06-22 01:51:13,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1114302.0, ans=0.0 2023-06-22 01:51:47,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1114422.0, ans=0.125 2023-06-22 01:52:12,812 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:52:30,083 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.24 vs. limit=10.0 2023-06-22 01:52:53,284 INFO [train.py:996] (2/4) Epoch 7, batch 2800, loss[loss=0.2314, simple_loss=0.3087, pruned_loss=0.07705, over 21632.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3175, pruned_loss=0.08768, over 4285921.64 frames. ], batch size: 230, lr: 4.45e-03, grad_scale: 32.0 2023-06-22 01:52:57,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1114602.0, ans=0.1 2023-06-22 01:53:00,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1114602.0, ans=0.125 2023-06-22 01:53:03,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1114602.0, ans=0.1 2023-06-22 01:53:14,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1114662.0, ans=0.035 2023-06-22 01:53:16,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1114662.0, ans=0.0 2023-06-22 01:53:32,118 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:53:57,889 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.503e+02 3.206e+02 3.798e+02 4.545e+02 8.220e+02, threshold=7.596e+02, percent-clipped=2.0 2023-06-22 01:53:58,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1114782.0, ans=0.125 2023-06-22 01:54:08,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1114842.0, ans=0.1 2023-06-22 01:54:27,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1114842.0, ans=0.1 2023-06-22 01:54:35,959 INFO [train.py:996] (2/4) Epoch 7, batch 2850, loss[loss=0.2526, simple_loss=0.3334, pruned_loss=0.08593, over 21752.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3213, pruned_loss=0.08959, over 4282227.53 frames. ], batch size: 441, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:55:03,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1114962.0, ans=0.0 2023-06-22 01:55:15,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1115022.0, ans=0.125 2023-06-22 01:55:49,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1115142.0, ans=0.125 2023-06-22 01:55:59,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1115142.0, ans=0.125 2023-06-22 01:56:12,963 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.05 vs. limit=12.0 2023-06-22 01:56:16,773 INFO [train.py:996] (2/4) Epoch 7, batch 2900, loss[loss=0.2414, simple_loss=0.3092, pruned_loss=0.08681, over 21736.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3207, pruned_loss=0.09036, over 4285772.55 frames. ], batch size: 389, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:56:18,576 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 01:56:38,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1115262.0, ans=0.0 2023-06-22 01:57:22,698 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.403e+02 3.184e+02 3.787e+02 4.850e+02 9.590e+02, threshold=7.574e+02, percent-clipped=4.0 2023-06-22 01:57:58,390 INFO [train.py:996] (2/4) Epoch 7, batch 2950, loss[loss=0.2716, simple_loss=0.3531, pruned_loss=0.09504, over 21797.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3202, pruned_loss=0.08981, over 4285405.36 frames. ], batch size: 247, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:58:10,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1115502.0, ans=0.125 2023-06-22 01:58:55,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1115622.0, ans=0.0 2023-06-22 01:59:14,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1115682.0, ans=0.125 2023-06-22 01:59:39,987 INFO [train.py:996] (2/4) Epoch 7, batch 3000, loss[loss=0.3588, simple_loss=0.4027, pruned_loss=0.1575, over 21385.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3244, pruned_loss=0.09071, over 4288849.74 frames. ], batch size: 508, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 01:59:39,987 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-22 01:59:49,367 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.5.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.8397, 3.1956, 3.3560, 3.1504], device='cuda:2') 2023-06-22 01:59:53,782 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.2157, 2.2674, 1.9990, 2.8438, 1.6294, 2.6804, 2.2213, 2.2318], device='cuda:2') 2023-06-22 01:59:54,961 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.5407, 2.9557, 1.6192, 1.8246], device='cuda:2') 2023-06-22 01:59:54,975 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.8349, 2.5913, 1.3342, 1.5337], device='cuda:2') 2023-06-22 01:59:56,488 INFO [train.py:1028] (2/4) Epoch 7, validation: loss=0.2473, simple_loss=0.3435, pruned_loss=0.07556, over 1796401.00 frames. 2023-06-22 01:59:56,489 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-22 02:00:55,018 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-22 02:00:56,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1115922.0, ans=0.1 2023-06-22 02:01:11,443 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.189e+02 3.683e+02 4.814e+02 8.214e+02, threshold=7.366e+02, percent-clipped=1.0 2023-06-22 02:01:19,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1116042.0, ans=0.2 2023-06-22 02:01:36,604 INFO [train.py:996] (2/4) Epoch 7, batch 3050, loss[loss=0.2041, simple_loss=0.279, pruned_loss=0.06457, over 21451.00 frames. ], tot_loss[loss=0.2502, simple_loss=0.3241, pruned_loss=0.08816, over 4282278.54 frames. ], batch size: 194, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 02:01:58,828 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=22.5 2023-06-22 02:02:29,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1116222.0, ans=0.05 2023-06-22 02:02:36,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1116222.0, ans=0.125 2023-06-22 02:03:24,154 INFO [train.py:996] (2/4) Epoch 7, batch 3100, loss[loss=0.2254, simple_loss=0.319, pruned_loss=0.06589, over 21665.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3228, pruned_loss=0.08662, over 4283018.70 frames. ], batch size: 263, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 02:04:34,768 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.255e+02 2.895e+02 3.297e+02 4.092e+02 7.123e+02, threshold=6.595e+02, percent-clipped=0.0 2023-06-22 02:04:42,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1116582.0, ans=0.125 2023-06-22 02:04:57,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1116642.0, ans=0.1 2023-06-22 02:04:59,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1116642.0, ans=0.1 2023-06-22 02:05:11,839 INFO [train.py:996] (2/4) Epoch 7, batch 3150, loss[loss=0.2983, simple_loss=0.372, pruned_loss=0.1123, over 21787.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3236, pruned_loss=0.08741, over 4273198.68 frames. ], batch size: 124, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 02:06:00,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=1116822.0, ans=0.1 2023-06-22 02:06:52,839 INFO [train.py:996] (2/4) Epoch 7, batch 3200, loss[loss=0.2628, simple_loss=0.3477, pruned_loss=0.08893, over 21724.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3243, pruned_loss=0.08756, over 4271138.36 frames. ], batch size: 389, lr: 4.45e-03, grad_scale: 32.0 2023-06-22 02:07:49,250 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.92 vs. limit=12.0 2023-06-22 02:08:05,123 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.099e+02 2.937e+02 3.458e+02 4.160e+02 8.829e+02, threshold=6.916e+02, percent-clipped=6.0 2023-06-22 02:08:08,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1117182.0, ans=0.125 2023-06-22 02:08:22,542 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=15.0 2023-06-22 02:08:34,737 INFO [train.py:996] (2/4) Epoch 7, batch 3250, loss[loss=0.2297, simple_loss=0.2948, pruned_loss=0.08231, over 21748.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3241, pruned_loss=0.08879, over 4266841.42 frames. ], batch size: 124, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 02:08:59,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1117362.0, ans=0.0 2023-06-22 02:09:00,347 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=22.5 2023-06-22 02:09:14,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1117422.0, ans=0.2 2023-06-22 02:09:38,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1117482.0, ans=0.125 2023-06-22 02:09:43,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1117482.0, ans=0.0 2023-06-22 02:10:05,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1117542.0, ans=0.0 2023-06-22 02:10:20,726 INFO [train.py:996] (2/4) Epoch 7, batch 3300, loss[loss=0.2278, simple_loss=0.3254, pruned_loss=0.06509, over 21677.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3184, pruned_loss=0.08799, over 4274310.91 frames. ], batch size: 298, lr: 4.45e-03, grad_scale: 16.0 2023-06-22 02:10:45,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1117662.0, ans=0.125 2023-06-22 02:11:23,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1117782.0, ans=0.2 2023-06-22 02:11:26,398 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.398e+02 2.992e+02 3.641e+02 4.480e+02 7.487e+02, threshold=7.281e+02, percent-clipped=2.0 2023-06-22 02:11:38,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1117842.0, ans=0.5 2023-06-22 02:11:40,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1117842.0, ans=0.2 2023-06-22 02:12:00,472 INFO [train.py:996] (2/4) Epoch 7, batch 3350, loss[loss=0.219, simple_loss=0.3125, pruned_loss=0.06275, over 21305.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3206, pruned_loss=0.08748, over 4280557.15 frames. ], batch size: 211, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:12:38,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1118022.0, ans=10.0 2023-06-22 02:12:41,993 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=22.5 2023-06-22 02:13:02,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1118082.0, ans=0.125 2023-06-22 02:13:04,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1118082.0, ans=0.0 2023-06-22 02:13:46,397 INFO [train.py:996] (2/4) Epoch 7, batch 3400, loss[loss=0.2527, simple_loss=0.3325, pruned_loss=0.08642, over 21341.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3208, pruned_loss=0.0883, over 4287322.10 frames. ], batch size: 176, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:14:34,211 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=22.5 2023-06-22 02:14:43,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1118382.0, ans=0.125 2023-06-22 02:14:52,633 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.339e+02 3.058e+02 3.533e+02 4.088e+02 6.686e+02, threshold=7.066e+02, percent-clipped=0.0 2023-06-22 02:15:02,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1118442.0, ans=0.125 2023-06-22 02:15:04,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1118442.0, ans=0.1 2023-06-22 02:15:26,621 INFO [train.py:996] (2/4) Epoch 7, batch 3450, loss[loss=0.2564, simple_loss=0.3466, pruned_loss=0.08315, over 20860.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.316, pruned_loss=0.08731, over 4285522.19 frames. ], batch size: 607, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:16:06,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1118622.0, ans=0.125 2023-06-22 02:16:15,266 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.76 vs. limit=6.0 2023-06-22 02:16:52,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1118742.0, ans=0.0 2023-06-22 02:17:01,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1118742.0, ans=0.125 2023-06-22 02:17:03,867 INFO [train.py:996] (2/4) Epoch 7, batch 3500, loss[loss=0.3022, simple_loss=0.3739, pruned_loss=0.1152, over 21707.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3273, pruned_loss=0.09122, over 4284543.31 frames. ], batch size: 298, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:17:07,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1118802.0, ans=0.1 2023-06-22 02:17:22,159 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:17:32,964 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=15.0 2023-06-22 02:17:34,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1118862.0, ans=0.0 2023-06-22 02:17:48,080 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.09 vs. limit=15.0 2023-06-22 02:18:20,461 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.661e+02 3.453e+02 3.920e+02 4.708e+02 8.175e+02, threshold=7.839e+02, percent-clipped=4.0 2023-06-22 02:18:44,611 INFO [train.py:996] (2/4) Epoch 7, batch 3550, loss[loss=0.2378, simple_loss=0.3113, pruned_loss=0.0821, over 21739.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3292, pruned_loss=0.09197, over 4289763.02 frames. ], batch size: 351, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:18:54,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1119102.0, ans=0.125 2023-06-22 02:19:42,871 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=15.0 2023-06-22 02:20:04,292 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-22 02:20:05,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1119342.0, ans=0.2 2023-06-22 02:20:17,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1119342.0, ans=0.2 2023-06-22 02:20:19,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1119402.0, ans=0.0 2023-06-22 02:20:19,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1119402.0, ans=0.0 2023-06-22 02:20:20,916 INFO [train.py:996] (2/4) Epoch 7, batch 3600, loss[loss=0.2684, simple_loss=0.3251, pruned_loss=0.1058, over 21843.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3237, pruned_loss=0.09164, over 4287534.56 frames. ], batch size: 118, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:20:27,825 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:21:34,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1119582.0, ans=0.125 2023-06-22 02:21:38,450 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.471e+02 3.375e+02 4.106e+02 5.045e+02 9.366e+02, threshold=8.213e+02, percent-clipped=2.0 2023-06-22 02:21:48,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1119642.0, ans=0.125 2023-06-22 02:22:03,419 INFO [train.py:996] (2/4) Epoch 7, batch 3650, loss[loss=0.2573, simple_loss=0.3148, pruned_loss=0.09996, over 21762.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3259, pruned_loss=0.09235, over 4285165.47 frames. ], batch size: 124, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:22:18,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1119702.0, ans=0.025 2023-06-22 02:22:59,192 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.06 vs. limit=15.0 2023-06-22 02:23:14,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1119882.0, ans=10.0 2023-06-22 02:23:16,755 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=15.0 2023-06-22 02:23:26,568 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=15.0 2023-06-22 02:23:43,415 INFO [train.py:996] (2/4) Epoch 7, batch 3700, loss[loss=0.2638, simple_loss=0.3331, pruned_loss=0.09719, over 21869.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3264, pruned_loss=0.09241, over 4286422.88 frames. ], batch size: 118, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:24:22,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1120062.0, ans=0.125 2023-06-22 02:24:35,116 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=12.0 2023-06-22 02:25:01,657 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.265e+02 3.250e+02 3.889e+02 4.859e+02 8.141e+02, threshold=7.777e+02, percent-clipped=0.0 2023-06-22 02:25:24,492 INFO [train.py:996] (2/4) Epoch 7, batch 3750, loss[loss=0.2075, simple_loss=0.297, pruned_loss=0.05899, over 21019.00 frames. ], tot_loss[loss=0.253, simple_loss=0.3241, pruned_loss=0.09094, over 4290310.86 frames. ], batch size: 608, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:25:39,368 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.29 vs. limit=12.0 2023-06-22 02:25:54,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1120362.0, ans=0.1 2023-06-22 02:26:15,696 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.34 vs. limit=15.0 2023-06-22 02:26:43,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1120482.0, ans=0.125 2023-06-22 02:27:10,040 INFO [train.py:996] (2/4) Epoch 7, batch 3800, loss[loss=0.3061, simple_loss=0.3711, pruned_loss=0.1205, over 21571.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3226, pruned_loss=0.08965, over 4282703.43 frames. ], batch size: 415, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:27:45,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1120662.0, ans=0.125 2023-06-22 02:28:18,656 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.342e+02 3.223e+02 3.848e+02 4.886e+02 9.152e+02, threshold=7.696e+02, percent-clipped=1.0 2023-06-22 02:28:46,244 INFO [train.py:996] (2/4) Epoch 7, batch 3850, loss[loss=0.2097, simple_loss=0.2633, pruned_loss=0.078, over 21149.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3195, pruned_loss=0.0896, over 4275949.78 frames. ], batch size: 143, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:30:01,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1121082.0, ans=0.125 2023-06-22 02:30:25,855 INFO [train.py:996] (2/4) Epoch 7, batch 3900, loss[loss=0.271, simple_loss=0.3335, pruned_loss=0.1042, over 21635.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3146, pruned_loss=0.08914, over 4273635.32 frames. ], batch size: 389, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:30:39,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1121202.0, ans=0.0 2023-06-22 02:31:26,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1121322.0, ans=0.125 2023-06-22 02:31:32,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1121382.0, ans=0.0 2023-06-22 02:31:38,624 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 3.017e+02 3.574e+02 4.086e+02 6.704e+02, threshold=7.148e+02, percent-clipped=0.0 2023-06-22 02:31:39,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1121382.0, ans=0.125 2023-06-22 02:31:46,227 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2023-06-22 02:31:50,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1121442.0, ans=0.0 2023-06-22 02:32:06,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1121442.0, ans=0.125 2023-06-22 02:32:12,131 INFO [train.py:996] (2/4) Epoch 7, batch 3950, loss[loss=0.1819, simple_loss=0.2666, pruned_loss=0.04862, over 21342.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3168, pruned_loss=0.08832, over 4275116.90 frames. ], batch size: 131, lr: 4.44e-03, grad_scale: 16.0 2023-06-22 02:32:32,452 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=22.5 2023-06-22 02:33:05,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1121622.0, ans=0.0 2023-06-22 02:33:52,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1121802.0, ans=22.5 2023-06-22 02:33:53,231 INFO [train.py:996] (2/4) Epoch 7, batch 4000, loss[loss=0.2262, simple_loss=0.282, pruned_loss=0.08518, over 21501.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3113, pruned_loss=0.08505, over 4275991.60 frames. ], batch size: 230, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:34:21,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1121862.0, ans=0.1 2023-06-22 02:34:24,254 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:35:00,540 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 2.938e+02 3.344e+02 4.048e+02 7.852e+02, threshold=6.687e+02, percent-clipped=1.0 2023-06-22 02:35:34,232 INFO [train.py:996] (2/4) Epoch 7, batch 4050, loss[loss=0.2493, simple_loss=0.3189, pruned_loss=0.08988, over 21167.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3106, pruned_loss=0.08325, over 4270566.95 frames. ], batch size: 608, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:36:37,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1122282.0, ans=0.125 2023-06-22 02:36:47,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1122282.0, ans=0.125 2023-06-22 02:37:05,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1122342.0, ans=0.1 2023-06-22 02:37:13,265 INFO [train.py:996] (2/4) Epoch 7, batch 4100, loss[loss=0.2139, simple_loss=0.2888, pruned_loss=0.06953, over 21643.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3108, pruned_loss=0.08316, over 4277859.98 frames. ], batch size: 230, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:37:59,421 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=12.0 2023-06-22 02:38:10,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1122582.0, ans=0.125 2023-06-22 02:38:26,483 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.954e+02 2.663e+02 2.992e+02 3.569e+02 4.943e+02, threshold=5.983e+02, percent-clipped=0.0 2023-06-22 02:38:53,874 INFO [train.py:996] (2/4) Epoch 7, batch 4150, loss[loss=0.2362, simple_loss=0.314, pruned_loss=0.07921, over 21757.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3102, pruned_loss=0.07965, over 4278495.15 frames. ], batch size: 316, lr: 4.44e-03, grad_scale: 32.0 2023-06-22 02:39:23,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1122762.0, ans=0.125 2023-06-22 02:40:06,931 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.18 vs. limit=6.0 2023-06-22 02:40:30,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1122942.0, ans=0.1 2023-06-22 02:40:45,124 INFO [train.py:996] (2/4) Epoch 7, batch 4200, loss[loss=0.2496, simple_loss=0.3221, pruned_loss=0.0886, over 21529.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3093, pruned_loss=0.07962, over 4270355.91 frames. ], batch size: 389, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:41:33,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1123122.0, ans=0.07 2023-06-22 02:41:56,340 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.876e+02 3.116e+02 3.775e+02 4.930e+02 8.993e+02, threshold=7.550e+02, percent-clipped=12.0 2023-06-22 02:42:00,094 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:42:12,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1123242.0, ans=0.1 2023-06-22 02:42:27,347 INFO [train.py:996] (2/4) Epoch 7, batch 4250, loss[loss=0.2745, simple_loss=0.3441, pruned_loss=0.1025, over 21489.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3139, pruned_loss=0.08195, over 4264282.59 frames. ], batch size: 194, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:42:32,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1123302.0, ans=0.125 2023-06-22 02:42:47,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1123362.0, ans=0.0 2023-06-22 02:44:06,756 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2023-06-22 02:44:09,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1123602.0, ans=0.0 2023-06-22 02:44:10,796 INFO [train.py:996] (2/4) Epoch 7, batch 4300, loss[loss=0.1978, simple_loss=0.247, pruned_loss=0.07431, over 20760.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3201, pruned_loss=0.0838, over 4266202.73 frames. ], batch size: 609, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:44:11,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1123602.0, ans=0.1 2023-06-22 02:44:51,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1123662.0, ans=0.125 2023-06-22 02:45:22,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1123782.0, ans=0.125 2023-06-22 02:45:31,291 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.449e+02 3.519e+02 4.292e+02 5.383e+02 8.752e+02, threshold=8.584e+02, percent-clipped=3.0 2023-06-22 02:45:52,079 INFO [train.py:996] (2/4) Epoch 7, batch 4350, loss[loss=0.2757, simple_loss=0.345, pruned_loss=0.1032, over 21597.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3189, pruned_loss=0.08344, over 4266484.48 frames. ], batch size: 414, lr: 4.43e-03, grad_scale: 8.0 2023-06-22 02:46:06,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1123902.0, ans=0.0 2023-06-22 02:46:13,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1123962.0, ans=0.0 2023-06-22 02:46:28,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1123962.0, ans=0.0 2023-06-22 02:46:43,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1124022.0, ans=0.0 2023-06-22 02:47:39,352 INFO [train.py:996] (2/4) Epoch 7, batch 4400, loss[loss=0.209, simple_loss=0.2799, pruned_loss=0.06903, over 21786.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.314, pruned_loss=0.08295, over 4274572.33 frames. ], batch size: 112, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:47:44,032 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.50 vs. limit=15.0 2023-06-22 02:48:22,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1124262.0, ans=0.125 2023-06-22 02:48:40,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1124322.0, ans=0.0 2023-06-22 02:48:46,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1124382.0, ans=0.125 2023-06-22 02:48:57,461 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.429e+02 3.105e+02 3.566e+02 4.210e+02 6.733e+02, threshold=7.132e+02, percent-clipped=0.0 2023-06-22 02:49:05,231 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.09 vs. limit=5.0 2023-06-22 02:49:21,669 INFO [train.py:996] (2/4) Epoch 7, batch 4450, loss[loss=0.239, simple_loss=0.3155, pruned_loss=0.08128, over 21434.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3238, pruned_loss=0.08571, over 4279945.38 frames. ], batch size: 194, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:50:31,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1124682.0, ans=0.125 2023-06-22 02:50:53,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1124742.0, ans=0.2 2023-06-22 02:51:06,301 INFO [train.py:996] (2/4) Epoch 7, batch 4500, loss[loss=0.2561, simple_loss=0.3224, pruned_loss=0.09496, over 21415.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3243, pruned_loss=0.0869, over 4287804.95 frames. ], batch size: 144, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:51:11,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.whiten.whitening_limit, batch_count=1124802.0, ans=12.0 2023-06-22 02:51:49,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1124922.0, ans=15.0 2023-06-22 02:52:01,732 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 02:52:22,681 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.564e+02 3.121e+02 3.546e+02 4.306e+02 8.092e+02, threshold=7.092e+02, percent-clipped=3.0 2023-06-22 02:52:47,536 INFO [train.py:996] (2/4) Epoch 7, batch 4550, loss[loss=0.2764, simple_loss=0.3538, pruned_loss=0.09948, over 21742.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3289, pruned_loss=0.08818, over 4285025.59 frames. ], batch size: 332, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:53:13,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1125162.0, ans=0.2 2023-06-22 02:53:46,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1125282.0, ans=0.1 2023-06-22 02:53:53,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1125282.0, ans=0.125 2023-06-22 02:53:59,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1125282.0, ans=0.125 2023-06-22 02:54:33,354 INFO [train.py:996] (2/4) Epoch 7, batch 4600, loss[loss=0.2375, simple_loss=0.3185, pruned_loss=0.07827, over 21856.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3295, pruned_loss=0.08835, over 4279194.81 frames. ], batch size: 124, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:54:56,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1125462.0, ans=0.125 2023-06-22 02:55:20,629 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.18 vs. limit=15.0 2023-06-22 02:55:36,886 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-22 02:55:43,896 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.487e+02 3.156e+02 3.559e+02 4.324e+02 6.713e+02, threshold=7.117e+02, percent-clipped=0.0 2023-06-22 02:55:54,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1125642.0, ans=0.125 2023-06-22 02:56:05,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1125642.0, ans=0.2 2023-06-22 02:56:05,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1125642.0, ans=0.2 2023-06-22 02:56:13,809 INFO [train.py:996] (2/4) Epoch 7, batch 4650, loss[loss=0.2386, simple_loss=0.3066, pruned_loss=0.08536, over 21575.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3242, pruned_loss=0.0866, over 4280844.85 frames. ], batch size: 471, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:56:28,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1125702.0, ans=0.0 2023-06-22 02:56:30,385 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-22 02:56:42,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1125762.0, ans=0.125 2023-06-22 02:56:44,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1125762.0, ans=0.2 2023-06-22 02:56:56,397 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=22.5 2023-06-22 02:57:09,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1125822.0, ans=10.0 2023-06-22 02:57:12,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1125882.0, ans=0.125 2023-06-22 02:57:30,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1125882.0, ans=0.2 2023-06-22 02:57:49,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1125942.0, ans=0.125 2023-06-22 02:57:50,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1125942.0, ans=0.1 2023-06-22 02:57:51,312 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=15.0 2023-06-22 02:57:58,272 INFO [train.py:996] (2/4) Epoch 7, batch 4700, loss[loss=0.2392, simple_loss=0.2994, pruned_loss=0.08944, over 21572.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3149, pruned_loss=0.08414, over 4275576.93 frames. ], batch size: 391, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 02:58:30,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1126122.0, ans=0.125 2023-06-22 02:58:32,235 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0 2023-06-22 02:59:03,083 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.110e+02 2.887e+02 3.239e+02 4.275e+02 6.571e+02, threshold=6.478e+02, percent-clipped=0.0 2023-06-22 02:59:31,460 INFO [train.py:996] (2/4) Epoch 7, batch 4750, loss[loss=0.249, simple_loss=0.3109, pruned_loss=0.09356, over 21335.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3108, pruned_loss=0.0845, over 4279486.96 frames. ], batch size: 143, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 03:00:23,978 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.82 vs. limit=12.0 2023-06-22 03:00:52,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1126542.0, ans=0.0 2023-06-22 03:01:16,456 INFO [train.py:996] (2/4) Epoch 7, batch 4800, loss[loss=0.2425, simple_loss=0.3418, pruned_loss=0.07166, over 21832.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3119, pruned_loss=0.08562, over 4279988.20 frames. ], batch size: 316, lr: 4.43e-03, grad_scale: 32.0 2023-06-22 03:01:52,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1126722.0, ans=0.125 2023-06-22 03:02:14,870 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:02:23,964 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.311e+02 3.114e+02 3.560e+02 4.146e+02 5.866e+02, threshold=7.121e+02, percent-clipped=0.0 2023-06-22 03:02:42,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1126842.0, ans=0.125 2023-06-22 03:02:56,144 INFO [train.py:996] (2/4) Epoch 7, batch 4850, loss[loss=0.2234, simple_loss=0.2905, pruned_loss=0.07815, over 21779.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3125, pruned_loss=0.0857, over 4282428.23 frames. ], batch size: 247, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 03:03:04,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1126902.0, ans=0.125 2023-06-22 03:04:36,845 INFO [train.py:996] (2/4) Epoch 7, batch 4900, loss[loss=0.2592, simple_loss=0.3527, pruned_loss=0.08281, over 21816.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3164, pruned_loss=0.08636, over 4285363.09 frames. ], batch size: 316, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 03:05:16,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1127322.0, ans=0.1 2023-06-22 03:05:55,057 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.336e+02 3.055e+02 3.401e+02 3.973e+02 6.495e+02, threshold=6.802e+02, percent-clipped=0.0 2023-06-22 03:06:13,322 INFO [train.py:996] (2/4) Epoch 7, batch 4950, loss[loss=0.2052, simple_loss=0.2954, pruned_loss=0.05751, over 21238.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3205, pruned_loss=0.0844, over 4286578.47 frames. ], batch size: 176, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 03:06:34,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1127562.0, ans=0.0 2023-06-22 03:07:14,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1127682.0, ans=0.125 2023-06-22 03:07:52,438 INFO [train.py:996] (2/4) Epoch 7, batch 5000, loss[loss=0.2618, simple_loss=0.3299, pruned_loss=0.09687, over 21875.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3188, pruned_loss=0.08145, over 4282842.78 frames. ], batch size: 371, lr: 4.43e-03, grad_scale: 16.0 2023-06-22 03:07:54,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1127802.0, ans=0.0 2023-06-22 03:07:55,825 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:09:08,692 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.072e+02 2.797e+02 3.102e+02 3.690e+02 6.361e+02, threshold=6.203e+02, percent-clipped=0.0 2023-06-22 03:09:30,784 INFO [train.py:996] (2/4) Epoch 7, batch 5050, loss[loss=0.2703, simple_loss=0.3344, pruned_loss=0.1031, over 21778.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3188, pruned_loss=0.0836, over 4283659.35 frames. ], batch size: 112, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:11:05,874 INFO [train.py:996] (2/4) Epoch 7, batch 5100, loss[loss=0.2385, simple_loss=0.3135, pruned_loss=0.08176, over 21734.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3179, pruned_loss=0.08415, over 4290520.35 frames. ], batch size: 389, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:11:29,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1128462.0, ans=0.125 2023-06-22 03:12:17,459 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 3.054e+02 3.634e+02 4.109e+02 8.042e+02, threshold=7.267e+02, percent-clipped=4.0 2023-06-22 03:12:22,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1128642.0, ans=0.125 2023-06-22 03:12:39,990 INFO [train.py:996] (2/4) Epoch 7, batch 5150, loss[loss=0.3409, simple_loss=0.3977, pruned_loss=0.142, over 21625.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3172, pruned_loss=0.08574, over 4291428.19 frames. ], batch size: 508, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:13:17,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1128822.0, ans=0.0 2023-06-22 03:14:15,493 INFO [train.py:996] (2/4) Epoch 7, batch 5200, loss[loss=0.2492, simple_loss=0.3437, pruned_loss=0.07736, over 21753.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3198, pruned_loss=0.08596, over 4281621.97 frames. ], batch size: 247, lr: 4.42e-03, grad_scale: 32.0 2023-06-22 03:14:16,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1129002.0, ans=0.0 2023-06-22 03:14:25,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1129002.0, ans=0.0 2023-06-22 03:14:31,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1129062.0, ans=0.125 2023-06-22 03:14:46,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1129122.0, ans=0.2 2023-06-22 03:15:38,261 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.326e+02 3.246e+02 3.920e+02 4.845e+02 8.696e+02, threshold=7.839e+02, percent-clipped=4.0 2023-06-22 03:15:54,264 INFO [train.py:996] (2/4) Epoch 7, batch 5250, loss[loss=0.2536, simple_loss=0.3272, pruned_loss=0.09004, over 21853.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3233, pruned_loss=0.08424, over 4286313.72 frames. ], batch size: 316, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:16:11,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1129362.0, ans=0.125 2023-06-22 03:16:30,227 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.06 vs. limit=15.0 2023-06-22 03:17:06,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1129482.0, ans=0.1 2023-06-22 03:17:21,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1129542.0, ans=0.0 2023-06-22 03:17:32,606 INFO [train.py:996] (2/4) Epoch 7, batch 5300, loss[loss=0.2486, simple_loss=0.3057, pruned_loss=0.09578, over 21786.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.323, pruned_loss=0.0851, over 4282144.87 frames. ], batch size: 247, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:17:46,143 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=22.5 2023-06-22 03:18:54,578 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.344e+02 3.014e+02 3.652e+02 4.152e+02 6.819e+02, threshold=7.305e+02, percent-clipped=0.0 2023-06-22 03:19:02,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1129842.0, ans=0.0 2023-06-22 03:19:06,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1129842.0, ans=0.125 2023-06-22 03:19:09,863 INFO [train.py:996] (2/4) Epoch 7, batch 5350, loss[loss=0.2628, simple_loss=0.3249, pruned_loss=0.1003, over 21519.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3222, pruned_loss=0.08703, over 4281567.05 frames. ], batch size: 131, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:19:48,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1130022.0, ans=0.125 2023-06-22 03:20:33,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1130142.0, ans=0.125 2023-06-22 03:20:46,343 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.39 vs. limit=15.0 2023-06-22 03:20:50,427 INFO [train.py:996] (2/4) Epoch 7, batch 5400, loss[loss=0.2616, simple_loss=0.3291, pruned_loss=0.09705, over 21890.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3191, pruned_loss=0.08721, over 4286424.61 frames. ], batch size: 124, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:21:58,486 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2023-06-22 03:22:07,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=1130382.0, ans=0.1 2023-06-22 03:22:14,200 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.098e+02 2.826e+02 3.234e+02 3.815e+02 6.268e+02, threshold=6.469e+02, percent-clipped=0.0 2023-06-22 03:22:30,501 INFO [train.py:996] (2/4) Epoch 7, batch 5450, loss[loss=0.2243, simple_loss=0.3066, pruned_loss=0.07104, over 21326.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3195, pruned_loss=0.08556, over 4284953.61 frames. ], batch size: 194, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:22:32,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1130502.0, ans=0.1 2023-06-22 03:23:26,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1130622.0, ans=0.0 2023-06-22 03:23:40,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1130682.0, ans=0.125 2023-06-22 03:24:00,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1130742.0, ans=0.125 2023-06-22 03:24:12,379 INFO [train.py:996] (2/4) Epoch 7, batch 5500, loss[loss=0.2229, simple_loss=0.3011, pruned_loss=0.07229, over 21244.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3241, pruned_loss=0.08257, over 4281690.47 frames. ], batch size: 159, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:24:53,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1130862.0, ans=0.0 2023-06-22 03:25:08,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1130922.0, ans=0.2 2023-06-22 03:25:17,060 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=22.5 2023-06-22 03:25:31,982 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.040e+02 3.055e+02 3.640e+02 4.334e+02 7.311e+02, threshold=7.280e+02, percent-clipped=2.0 2023-06-22 03:25:32,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1131042.0, ans=0.2 2023-06-22 03:25:35,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1131042.0, ans=0.125 2023-06-22 03:25:57,918 INFO [train.py:996] (2/4) Epoch 7, batch 5550, loss[loss=0.2293, simple_loss=0.295, pruned_loss=0.08185, over 21561.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3213, pruned_loss=0.07928, over 4277338.95 frames. ], batch size: 548, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:26:39,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1131222.0, ans=0.95 2023-06-22 03:27:00,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1131282.0, ans=0.125 2023-06-22 03:27:18,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1131342.0, ans=0.125 2023-06-22 03:27:30,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1131342.0, ans=0.125 2023-06-22 03:27:43,802 INFO [train.py:996] (2/4) Epoch 7, batch 5600, loss[loss=0.3805, simple_loss=0.457, pruned_loss=0.152, over 21438.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3197, pruned_loss=0.0772, over 4275077.99 frames. ], batch size: 507, lr: 4.42e-03, grad_scale: 32.0 2023-06-22 03:27:50,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1131402.0, ans=0.0 2023-06-22 03:28:31,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1131522.0, ans=0.125 2023-06-22 03:28:33,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1131522.0, ans=0.125 2023-06-22 03:28:44,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1131582.0, ans=0.125 2023-06-22 03:29:03,133 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.029e+02 2.861e+02 3.641e+02 4.597e+02 1.091e+03, threshold=7.283e+02, percent-clipped=6.0 2023-06-22 03:29:03,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1131642.0, ans=0.0 2023-06-22 03:29:06,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1131642.0, ans=0.1 2023-06-22 03:29:22,532 INFO [train.py:996] (2/4) Epoch 7, batch 5650, loss[loss=0.2246, simple_loss=0.302, pruned_loss=0.07358, over 21812.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3216, pruned_loss=0.07902, over 4276909.82 frames. ], batch size: 282, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:29:27,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1131702.0, ans=0.125 2023-06-22 03:29:38,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1131702.0, ans=0.0 2023-06-22 03:30:33,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1131882.0, ans=0.0 2023-06-22 03:31:07,562 INFO [train.py:996] (2/4) Epoch 7, batch 5700, loss[loss=0.2131, simple_loss=0.2889, pruned_loss=0.06868, over 21516.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3205, pruned_loss=0.08013, over 4274461.26 frames. ], batch size: 131, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:31:12,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1132002.0, ans=0.2 2023-06-22 03:31:38,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1132122.0, ans=0.125 2023-06-22 03:31:39,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1132122.0, ans=0.1 2023-06-22 03:31:39,540 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=22.5 2023-06-22 03:32:06,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1132182.0, ans=0.2 2023-06-22 03:32:29,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1132182.0, ans=0.125 2023-06-22 03:32:33,701 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.355e+02 3.008e+02 3.484e+02 4.188e+02 7.295e+02, threshold=6.968e+02, percent-clipped=1.0 2023-06-22 03:32:35,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1132242.0, ans=0.0 2023-06-22 03:32:46,247 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.10 vs. limit=22.5 2023-06-22 03:32:48,508 INFO [train.py:996] (2/4) Epoch 7, batch 5750, loss[loss=0.1957, simple_loss=0.3027, pruned_loss=0.04438, over 21179.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3167, pruned_loss=0.07718, over 4269349.10 frames. ], batch size: 548, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:32:55,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1132302.0, ans=0.125 2023-06-22 03:33:11,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1132362.0, ans=0.09899494936611666 2023-06-22 03:34:14,442 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-06-22 03:34:25,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1132542.0, ans=0.125 2023-06-22 03:34:28,143 INFO [train.py:996] (2/4) Epoch 7, batch 5800, loss[loss=0.2289, simple_loss=0.3148, pruned_loss=0.07149, over 21600.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3167, pruned_loss=0.07571, over 4276575.32 frames. ], batch size: 230, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:34:28,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1132602.0, ans=0.1 2023-06-22 03:35:33,200 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.00 vs. limit=6.0 2023-06-22 03:35:44,841 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.12 vs. limit=15.0 2023-06-22 03:35:47,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1132782.0, ans=0.125 2023-06-22 03:35:55,102 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.821e+02 2.844e+02 3.714e+02 4.784e+02 7.655e+02, threshold=7.428e+02, percent-clipped=1.0 2023-06-22 03:36:10,039 INFO [train.py:996] (2/4) Epoch 7, batch 5850, loss[loss=0.1729, simple_loss=0.2766, pruned_loss=0.03456, over 21803.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.315, pruned_loss=0.07212, over 4280254.83 frames. ], batch size: 316, lr: 4.42e-03, grad_scale: 16.0 2023-06-22 03:36:24,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1132902.0, ans=0.2 2023-06-22 03:37:08,890 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-22 03:37:10,618 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=21.12 vs. limit=15.0 2023-06-22 03:37:49,848 INFO [train.py:996] (2/4) Epoch 7, batch 5900, loss[loss=0.1717, simple_loss=0.2636, pruned_loss=0.03989, over 21759.00 frames. ], tot_loss[loss=0.223, simple_loss=0.3097, pruned_loss=0.06821, over 4281902.09 frames. ], batch size: 298, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:38:44,597 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.24 vs. limit=15.0 2023-06-22 03:39:08,927 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.703e+02 2.538e+02 2.951e+02 3.905e+02 7.879e+02, threshold=5.902e+02, percent-clipped=2.0 2023-06-22 03:39:12,876 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.22 vs. limit=10.0 2023-06-22 03:39:23,028 INFO [train.py:996] (2/4) Epoch 7, batch 5950, loss[loss=0.2542, simple_loss=0.3155, pruned_loss=0.09646, over 21920.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3083, pruned_loss=0.07156, over 4286994.51 frames. ], batch size: 316, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:39:49,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1133562.0, ans=0.125 2023-06-22 03:40:19,872 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=12.11 vs. limit=15.0 2023-06-22 03:41:00,583 INFO [train.py:996] (2/4) Epoch 7, batch 6000, loss[loss=0.2466, simple_loss=0.298, pruned_loss=0.09755, over 21764.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3059, pruned_loss=0.07374, over 4284694.98 frames. ], batch size: 351, lr: 4.41e-03, grad_scale: 32.0 2023-06-22 03:41:00,584 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-22 03:41:21,114 INFO [train.py:1028] (2/4) Epoch 7, validation: loss=0.2587, simple_loss=0.3532, pruned_loss=0.08209, over 1796401.00 frames. 2023-06-22 03:41:21,115 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-22 03:42:23,258 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:42:28,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1133982.0, ans=0.0 2023-06-22 03:42:43,734 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.078e+02 3.441e+02 4.226e+02 5.483e+02 1.064e+03, threshold=8.451e+02, percent-clipped=15.0 2023-06-22 03:42:58,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1134042.0, ans=0.0 2023-06-22 03:43:01,815 INFO [train.py:996] (2/4) Epoch 7, batch 6050, loss[loss=0.1891, simple_loss=0.2548, pruned_loss=0.06166, over 21628.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.302, pruned_loss=0.07591, over 4280910.55 frames. ], batch size: 231, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:43:12,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1134102.0, ans=0.0 2023-06-22 03:43:40,366 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 03:44:02,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1134282.0, ans=0.125 2023-06-22 03:44:22,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1134342.0, ans=0.0 2023-06-22 03:44:24,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1134342.0, ans=0.125 2023-06-22 03:44:39,792 INFO [train.py:996] (2/4) Epoch 7, batch 6100, loss[loss=0.2326, simple_loss=0.3062, pruned_loss=0.07951, over 21311.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2979, pruned_loss=0.07408, over 4280864.27 frames. ], batch size: 176, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:45:19,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1134522.0, ans=0.2 2023-06-22 03:46:01,318 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.844e+02 3.267e+02 3.769e+02 7.598e+02, threshold=6.534e+02, percent-clipped=0.0 2023-06-22 03:46:12,468 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=12.0 2023-06-22 03:46:24,527 INFO [train.py:996] (2/4) Epoch 7, batch 6150, loss[loss=0.3236, simple_loss=0.3668, pruned_loss=0.1402, over 21799.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3008, pruned_loss=0.07687, over 4284897.36 frames. ], batch size: 507, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:47:03,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1134822.0, ans=0.025 2023-06-22 03:47:15,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1134882.0, ans=0.125 2023-06-22 03:47:52,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1134942.0, ans=0.0 2023-06-22 03:48:01,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1135002.0, ans=0.5 2023-06-22 03:48:02,648 INFO [train.py:996] (2/4) Epoch 7, batch 6200, loss[loss=0.2564, simple_loss=0.3249, pruned_loss=0.0939, over 21649.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3033, pruned_loss=0.0772, over 4282285.07 frames. ], batch size: 230, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:48:38,427 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=15.0 2023-06-22 03:49:18,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1135242.0, ans=0.125 2023-06-22 03:49:28,283 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.024e+02 2.933e+02 3.461e+02 4.493e+02 7.617e+02, threshold=6.923e+02, percent-clipped=2.0 2023-06-22 03:49:30,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1135242.0, ans=0.125 2023-06-22 03:49:38,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1135242.0, ans=0.0 2023-06-22 03:49:40,999 INFO [train.py:996] (2/4) Epoch 7, batch 6250, loss[loss=0.2318, simple_loss=0.3256, pruned_loss=0.06902, over 21689.00 frames. ], tot_loss[loss=0.234, simple_loss=0.311, pruned_loss=0.07843, over 4281595.97 frames. ], batch size: 247, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:49:50,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1135302.0, ans=0.035 2023-06-22 03:49:59,432 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.91 vs. limit=15.0 2023-06-22 03:50:15,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1135362.0, ans=0.1 2023-06-22 03:50:27,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1135422.0, ans=0.05 2023-06-22 03:50:55,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1135482.0, ans=0.0 2023-06-22 03:51:24,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1135602.0, ans=0.1 2023-06-22 03:51:25,835 INFO [train.py:996] (2/4) Epoch 7, batch 6300, loss[loss=0.2501, simple_loss=0.3235, pruned_loss=0.08832, over 21857.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3143, pruned_loss=0.07796, over 4280975.69 frames. ], batch size: 298, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:51:34,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1135602.0, ans=0.125 2023-06-22 03:51:39,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1135602.0, ans=0.2 2023-06-22 03:52:52,249 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.264e+02 3.056e+02 3.560e+02 4.261e+02 7.497e+02, threshold=7.120e+02, percent-clipped=1.0 2023-06-22 03:53:03,125 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.49 vs. limit=22.5 2023-06-22 03:53:05,182 INFO [train.py:996] (2/4) Epoch 7, batch 6350, loss[loss=0.2801, simple_loss=0.3436, pruned_loss=0.1082, over 21612.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3169, pruned_loss=0.08147, over 4280981.58 frames. ], batch size: 415, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:53:34,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1135962.0, ans=0.0 2023-06-22 03:53:37,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1136022.0, ans=0.125 2023-06-22 03:54:09,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1136082.0, ans=0.125 2023-06-22 03:54:45,886 INFO [train.py:996] (2/4) Epoch 7, batch 6400, loss[loss=0.2689, simple_loss=0.3341, pruned_loss=0.1018, over 21377.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3218, pruned_loss=0.08518, over 4280430.65 frames. ], batch size: 176, lr: 4.41e-03, grad_scale: 32.0 2023-06-22 03:55:15,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1136262.0, ans=0.05 2023-06-22 03:55:26,186 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.46 vs. limit=10.0 2023-06-22 03:55:36,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1136322.0, ans=0.1 2023-06-22 03:56:10,186 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.525e+02 3.095e+02 3.612e+02 4.103e+02 7.644e+02, threshold=7.224e+02, percent-clipped=1.0 2023-06-22 03:56:21,478 INFO [train.py:996] (2/4) Epoch 7, batch 6450, loss[loss=0.2882, simple_loss=0.3409, pruned_loss=0.1177, over 21368.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3259, pruned_loss=0.08606, over 4277880.54 frames. ], batch size: 507, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:56:39,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1136562.0, ans=0.125 2023-06-22 03:56:40,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1136562.0, ans=0.0 2023-06-22 03:56:58,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1136562.0, ans=0.0 2023-06-22 03:58:00,818 INFO [train.py:996] (2/4) Epoch 7, batch 6500, loss[loss=0.2346, simple_loss=0.3194, pruned_loss=0.07485, over 21826.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3208, pruned_loss=0.0842, over 4271778.00 frames. ], batch size: 317, lr: 4.41e-03, grad_scale: 16.0 2023-06-22 03:58:07,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1136802.0, ans=0.2 2023-06-22 03:58:12,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1136802.0, ans=0.1 2023-06-22 03:58:55,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1136922.0, ans=0.1 2023-06-22 03:59:03,561 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-22 03:59:28,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1137042.0, ans=0.1 2023-06-22 03:59:29,993 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.201e+02 3.034e+02 3.497e+02 4.367e+02 8.159e+02, threshold=6.993e+02, percent-clipped=3.0 2023-06-22 03:59:40,142 INFO [train.py:996] (2/4) Epoch 7, batch 6550, loss[loss=0.2727, simple_loss=0.3312, pruned_loss=0.1071, over 21777.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3197, pruned_loss=0.08311, over 4279582.61 frames. ], batch size: 112, lr: 4.41e-03, grad_scale: 8.0 2023-06-22 04:00:25,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1137222.0, ans=0.2 2023-06-22 04:00:27,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1137222.0, ans=0.125 2023-06-22 04:01:05,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1137342.0, ans=0.0 2023-06-22 04:01:19,383 INFO [train.py:996] (2/4) Epoch 7, batch 6600, loss[loss=0.217, simple_loss=0.2768, pruned_loss=0.07857, over 21729.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3141, pruned_loss=0.0829, over 4259457.18 frames. ], batch size: 371, lr: 4.41e-03, grad_scale: 8.0 2023-06-22 04:01:20,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1137402.0, ans=0.1 2023-06-22 04:01:29,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1137402.0, ans=0.2 2023-06-22 04:01:41,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1137462.0, ans=0.125 2023-06-22 04:01:44,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1137462.0, ans=0.0 2023-06-22 04:01:57,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1137462.0, ans=0.125 2023-06-22 04:02:49,268 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.035e+02 2.768e+02 3.180e+02 3.748e+02 5.312e+02, threshold=6.360e+02, percent-clipped=0.0 2023-06-22 04:02:59,291 INFO [train.py:996] (2/4) Epoch 7, batch 6650, loss[loss=0.2295, simple_loss=0.2917, pruned_loss=0.08366, over 21810.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3051, pruned_loss=0.07981, over 4261718.46 frames. ], batch size: 352, lr: 4.41e-03, grad_scale: 8.0 2023-06-22 04:03:09,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1137702.0, ans=0.0 2023-06-22 04:03:10,035 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-22 04:04:17,785 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-22 04:04:21,240 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-22 04:04:22,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1137942.0, ans=0.0 2023-06-22 04:04:39,328 INFO [train.py:996] (2/4) Epoch 7, batch 6700, loss[loss=0.2128, simple_loss=0.288, pruned_loss=0.06882, over 21691.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.2993, pruned_loss=0.07962, over 4262107.90 frames. ], batch size: 282, lr: 4.41e-03, grad_scale: 8.0 2023-06-22 04:04:47,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1138002.0, ans=0.125 2023-06-22 04:05:07,360 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 04:05:32,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1138122.0, ans=0.1 2023-06-22 04:05:34,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1138122.0, ans=0.2 2023-06-22 04:06:07,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1138242.0, ans=0.125 2023-06-22 04:06:08,318 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.985e+02 2.940e+02 3.406e+02 4.084e+02 6.605e+02, threshold=6.813e+02, percent-clipped=1.0 2023-06-22 04:06:17,944 INFO [train.py:996] (2/4) Epoch 7, batch 6750, loss[loss=0.2662, simple_loss=0.3167, pruned_loss=0.1079, over 21405.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.2972, pruned_loss=0.08015, over 4269701.54 frames. ], batch size: 194, lr: 4.40e-03, grad_scale: 8.0 2023-06-22 04:06:22,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1138302.0, ans=0.05 2023-06-22 04:06:52,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1138362.0, ans=0.125 2023-06-22 04:07:14,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1138422.0, ans=0.07 2023-06-22 04:07:38,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1138542.0, ans=0.125 2023-06-22 04:07:55,307 INFO [train.py:996] (2/4) Epoch 7, batch 6800, loss[loss=0.2213, simple_loss=0.356, pruned_loss=0.04336, over 19744.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.2992, pruned_loss=0.08179, over 4267852.58 frames. ], batch size: 702, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:08:00,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1138602.0, ans=0.0 2023-06-22 04:08:32,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1138662.0, ans=0.125 2023-06-22 04:09:00,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1138782.0, ans=0.125 2023-06-22 04:09:16,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1138842.0, ans=0.0 2023-06-22 04:09:24,423 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.399e+02 3.025e+02 3.541e+02 4.379e+02 6.653e+02, threshold=7.081e+02, percent-clipped=0.0 2023-06-22 04:09:24,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1138842.0, ans=0.0 2023-06-22 04:09:33,717 INFO [train.py:996] (2/4) Epoch 7, batch 6850, loss[loss=0.2275, simple_loss=0.2848, pruned_loss=0.08511, over 21167.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.2993, pruned_loss=0.08371, over 4276497.72 frames. ], batch size: 176, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:09:52,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1138902.0, ans=0.125 2023-06-22 04:11:14,095 INFO [train.py:996] (2/4) Epoch 7, batch 6900, loss[loss=0.22, simple_loss=0.3252, pruned_loss=0.05736, over 21528.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3027, pruned_loss=0.08332, over 4282355.39 frames. ], batch size: 471, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:11:41,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1139262.0, ans=0.0 2023-06-22 04:12:21,201 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 04:12:29,468 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.48 vs. limit=15.0 2023-06-22 04:12:41,060 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.079e+02 2.939e+02 3.532e+02 4.220e+02 8.926e+02, threshold=7.064e+02, percent-clipped=5.0 2023-06-22 04:12:47,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1139442.0, ans=0.125 2023-06-22 04:12:55,490 INFO [train.py:996] (2/4) Epoch 7, batch 6950, loss[loss=0.2563, simple_loss=0.338, pruned_loss=0.08732, over 21469.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3041, pruned_loss=0.08061, over 4278641.24 frames. ], batch size: 131, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:13:54,319 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-06-22 04:14:35,492 INFO [train.py:996] (2/4) Epoch 7, batch 7000, loss[loss=0.2441, simple_loss=0.3109, pruned_loss=0.08861, over 20803.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3078, pruned_loss=0.08301, over 4277649.10 frames. ], batch size: 608, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:15:14,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.74 vs. limit=12.0 2023-06-22 04:15:22,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1139922.0, ans=0.125 2023-06-22 04:15:27,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1139922.0, ans=0.035 2023-06-22 04:15:45,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1139982.0, ans=0.0 2023-06-22 04:16:01,350 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.353e+02 3.073e+02 3.537e+02 4.508e+02 8.250e+02, threshold=7.073e+02, percent-clipped=4.0 2023-06-22 04:16:10,894 INFO [train.py:996] (2/4) Epoch 7, batch 7050, loss[loss=0.2454, simple_loss=0.3334, pruned_loss=0.07877, over 21411.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3063, pruned_loss=0.08269, over 4278222.83 frames. ], batch size: 507, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:16:54,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1140222.0, ans=0.125 2023-06-22 04:17:07,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1140282.0, ans=0.125 2023-06-22 04:17:10,040 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.58 vs. limit=15.0 2023-06-22 04:17:29,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1140342.0, ans=0.2 2023-06-22 04:17:52,893 INFO [train.py:996] (2/4) Epoch 7, batch 7100, loss[loss=0.2779, simple_loss=0.3466, pruned_loss=0.1046, over 21395.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3106, pruned_loss=0.0845, over 4272463.73 frames. ], batch size: 471, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:17:54,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1140402.0, ans=0.125 2023-06-22 04:17:56,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1140402.0, ans=0.95 2023-06-22 04:18:15,060 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-22 04:18:17,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1140462.0, ans=0.125 2023-06-22 04:18:41,379 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.69 vs. limit=15.0 2023-06-22 04:18:57,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1140582.0, ans=0.125 2023-06-22 04:19:25,196 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.113e+02 2.884e+02 3.183e+02 3.903e+02 6.649e+02, threshold=6.367e+02, percent-clipped=0.0 2023-06-22 04:19:34,807 INFO [train.py:996] (2/4) Epoch 7, batch 7150, loss[loss=0.2655, simple_loss=0.3452, pruned_loss=0.0929, over 21845.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.309, pruned_loss=0.082, over 4272256.09 frames. ], batch size: 118, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:19:38,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1140702.0, ans=0.125 2023-06-22 04:20:10,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1140822.0, ans=0.0 2023-06-22 04:20:23,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1140882.0, ans=0.2 2023-06-22 04:20:30,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1140882.0, ans=0.125 2023-06-22 04:21:02,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1140942.0, ans=0.125 2023-06-22 04:21:02,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1140942.0, ans=0.0 2023-06-22 04:21:07,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1140942.0, ans=0.125 2023-06-22 04:21:14,894 INFO [train.py:996] (2/4) Epoch 7, batch 7200, loss[loss=0.234, simple_loss=0.3, pruned_loss=0.08393, over 21829.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3126, pruned_loss=0.08447, over 4270523.21 frames. ], batch size: 317, lr: 4.40e-03, grad_scale: 32.0 2023-06-22 04:21:28,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1141002.0, ans=0.025 2023-06-22 04:21:31,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1141062.0, ans=0.125 2023-06-22 04:21:34,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1141062.0, ans=0.1 2023-06-22 04:21:41,488 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.16 vs. limit=6.0 2023-06-22 04:21:42,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1141062.0, ans=10.0 2023-06-22 04:22:12,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1141182.0, ans=0.125 2023-06-22 04:22:18,009 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=22.5 2023-06-22 04:22:44,332 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 2.920e+02 3.463e+02 4.119e+02 7.524e+02, threshold=6.925e+02, percent-clipped=3.0 2023-06-22 04:22:53,901 INFO [train.py:996] (2/4) Epoch 7, batch 7250, loss[loss=0.1932, simple_loss=0.2642, pruned_loss=0.0611, over 21594.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3078, pruned_loss=0.08453, over 4277809.92 frames. ], batch size: 298, lr: 4.40e-03, grad_scale: 32.0 2023-06-22 04:22:57,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1141302.0, ans=10.0 2023-06-22 04:23:18,475 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-22 04:23:23,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1141422.0, ans=0.0 2023-06-22 04:23:51,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1141482.0, ans=0.125 2023-06-22 04:24:20,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1141542.0, ans=0.0 2023-06-22 04:24:33,091 INFO [train.py:996] (2/4) Epoch 7, batch 7300, loss[loss=0.2228, simple_loss=0.2838, pruned_loss=0.08094, over 21817.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3033, pruned_loss=0.08336, over 4269417.07 frames. ], batch size: 352, lr: 4.40e-03, grad_scale: 32.0 2023-06-22 04:24:37,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1141602.0, ans=0.125 2023-06-22 04:25:20,646 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=15.0 2023-06-22 04:26:04,332 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.235e+02 2.850e+02 3.371e+02 4.063e+02 7.936e+02, threshold=6.743e+02, percent-clipped=2.0 2023-06-22 04:26:13,046 INFO [train.py:996] (2/4) Epoch 7, batch 7350, loss[loss=0.2515, simple_loss=0.3047, pruned_loss=0.09914, over 21741.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3005, pruned_loss=0.08424, over 4266068.36 frames. ], batch size: 102, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:26:22,373 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.71 vs. limit=10.0 2023-06-22 04:26:24,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1141902.0, ans=0.2 2023-06-22 04:26:58,857 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-06-22 04:27:07,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1142022.0, ans=0.1 2023-06-22 04:27:09,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1142022.0, ans=0.0 2023-06-22 04:27:16,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1142082.0, ans=0.1 2023-06-22 04:27:30,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1142082.0, ans=0.125 2023-06-22 04:27:49,466 INFO [train.py:996] (2/4) Epoch 7, batch 7400, loss[loss=0.288, simple_loss=0.3485, pruned_loss=0.1138, over 21609.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3059, pruned_loss=0.08609, over 4273802.46 frames. ], batch size: 389, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:29:21,608 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.301e+02 3.129e+02 3.591e+02 4.476e+02 8.193e+02, threshold=7.182e+02, percent-clipped=3.0 2023-06-22 04:29:28,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1142502.0, ans=0.0 2023-06-22 04:29:29,600 INFO [train.py:996] (2/4) Epoch 7, batch 7450, loss[loss=0.2439, simple_loss=0.3121, pruned_loss=0.08786, over 21486.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3044, pruned_loss=0.08484, over 4278115.11 frames. ], batch size: 389, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:29:47,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1142562.0, ans=0.0 2023-06-22 04:30:06,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1142562.0, ans=0.1 2023-06-22 04:30:28,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1142622.0, ans=0.125 2023-06-22 04:31:10,682 INFO [train.py:996] (2/4) Epoch 7, batch 7500, loss[loss=0.2778, simple_loss=0.3824, pruned_loss=0.08661, over 21754.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3091, pruned_loss=0.08578, over 4284339.09 frames. ], batch size: 332, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:31:22,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1142802.0, ans=0.0 2023-06-22 04:32:07,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1142922.0, ans=0.125 2023-06-22 04:32:11,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1142922.0, ans=0.125 2023-06-22 04:32:13,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1142922.0, ans=0.125 2023-06-22 04:32:43,204 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.064e+02 3.393e+02 4.362e+02 5.679e+02 1.317e+03, threshold=8.723e+02, percent-clipped=9.0 2023-06-22 04:32:51,275 INFO [train.py:996] (2/4) Epoch 7, batch 7550, loss[loss=0.2087, simple_loss=0.3039, pruned_loss=0.05671, over 21712.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3157, pruned_loss=0.08427, over 4287194.40 frames. ], batch size: 298, lr: 4.40e-03, grad_scale: 16.0 2023-06-22 04:33:14,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1143162.0, ans=0.125 2023-06-22 04:33:46,464 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=22.5 2023-06-22 04:34:21,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1143342.0, ans=0.0 2023-06-22 04:34:25,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1143342.0, ans=0.0 2023-06-22 04:34:30,101 INFO [train.py:996] (2/4) Epoch 7, batch 7600, loss[loss=0.2091, simple_loss=0.2723, pruned_loss=0.07301, over 20207.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3145, pruned_loss=0.0827, over 4288332.65 frames. ], batch size: 702, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 04:35:22,310 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-06-22 04:35:26,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1143522.0, ans=0.0 2023-06-22 04:35:32,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1143582.0, ans=0.0 2023-06-22 04:35:37,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1143582.0, ans=0.0 2023-06-22 04:35:55,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1143642.0, ans=0.0 2023-06-22 04:35:56,522 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.126e+02 2.783e+02 3.344e+02 4.108e+02 6.348e+02, threshold=6.687e+02, percent-clipped=0.0 2023-06-22 04:36:04,724 INFO [train.py:996] (2/4) Epoch 7, batch 7650, loss[loss=0.2684, simple_loss=0.3282, pruned_loss=0.1043, over 21878.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3138, pruned_loss=0.08487, over 4283913.33 frames. ], batch size: 371, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 04:37:02,288 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=22.5 2023-06-22 04:37:06,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1143822.0, ans=0.0 2023-06-22 04:37:44,852 INFO [train.py:996] (2/4) Epoch 7, batch 7700, loss[loss=0.279, simple_loss=0.3561, pruned_loss=0.1009, over 21825.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3179, pruned_loss=0.08739, over 4286555.53 frames. ], batch size: 118, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 04:37:50,873 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=12.0 2023-06-22 04:38:07,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1144002.0, ans=0.04949747468305833 2023-06-22 04:38:09,253 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.16 vs. limit=22.5 2023-06-22 04:38:31,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1144062.0, ans=0.1 2023-06-22 04:38:57,927 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-22 04:39:20,942 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.234e+02 3.101e+02 3.626e+02 4.244e+02 7.117e+02, threshold=7.252e+02, percent-clipped=1.0 2023-06-22 04:39:33,812 INFO [train.py:996] (2/4) Epoch 7, batch 7750, loss[loss=0.3, simple_loss=0.3989, pruned_loss=0.1006, over 21770.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3238, pruned_loss=0.0865, over 4281738.88 frames. ], batch size: 332, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:39:53,178 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.34 vs. limit=6.0 2023-06-22 04:40:20,309 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=22.5 2023-06-22 04:40:21,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1144422.0, ans=0.1 2023-06-22 04:40:25,140 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=15.0 2023-06-22 04:40:36,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1144482.0, ans=0.125 2023-06-22 04:40:45,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1144482.0, ans=0.125 2023-06-22 04:41:20,040 INFO [train.py:996] (2/4) Epoch 7, batch 7800, loss[loss=0.3268, simple_loss=0.3892, pruned_loss=0.1322, over 21461.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3292, pruned_loss=0.08829, over 4288128.85 frames. ], batch size: 471, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:42:26,181 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.21 vs. limit=15.0 2023-06-22 04:42:33,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1144842.0, ans=0.125 2023-06-22 04:42:37,942 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.647e+02 3.499e+02 4.174e+02 5.699e+02 9.171e+02, threshold=8.349e+02, percent-clipped=6.0 2023-06-22 04:42:49,093 INFO [train.py:996] (2/4) Epoch 7, batch 7850, loss[loss=0.244, simple_loss=0.3028, pruned_loss=0.09264, over 22001.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3196, pruned_loss=0.0865, over 4279641.95 frames. ], batch size: 103, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:43:05,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1144902.0, ans=0.125 2023-06-22 04:43:10,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1144902.0, ans=0.125 2023-06-22 04:43:17,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1144962.0, ans=0.125 2023-06-22 04:43:31,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1145022.0, ans=0.0 2023-06-22 04:44:03,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1145142.0, ans=0.125 2023-06-22 04:44:20,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1145142.0, ans=0.125 2023-06-22 04:44:20,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1145142.0, ans=0.125 2023-06-22 04:44:41,140 INFO [train.py:996] (2/4) Epoch 7, batch 7900, loss[loss=0.2638, simple_loss=0.3528, pruned_loss=0.08742, over 21600.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3154, pruned_loss=0.08639, over 4267599.96 frames. ], batch size: 389, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:45:03,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1145262.0, ans=0.125 2023-06-22 04:46:15,355 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-22 04:46:17,642 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.334e+02 3.136e+02 3.570e+02 4.500e+02 9.857e+02, threshold=7.139e+02, percent-clipped=1.0 2023-06-22 04:46:23,930 INFO [train.py:996] (2/4) Epoch 7, batch 7950, loss[loss=0.327, simple_loss=0.3919, pruned_loss=0.1311, over 21534.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3179, pruned_loss=0.08499, over 4263932.89 frames. ], batch size: 507, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:46:51,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1145562.0, ans=0.125 2023-06-22 04:46:57,232 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=15.0 2023-06-22 04:47:21,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1145682.0, ans=0.125 2023-06-22 04:47:22,265 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.21 vs. limit=15.0 2023-06-22 04:47:49,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1145742.0, ans=0.0 2023-06-22 04:47:53,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=1145742.0, ans=0.02 2023-06-22 04:48:04,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1145802.0, ans=0.2 2023-06-22 04:48:05,912 INFO [train.py:996] (2/4) Epoch 7, batch 8000, loss[loss=0.3378, simple_loss=0.4081, pruned_loss=0.1337, over 21481.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3231, pruned_loss=0.08744, over 4265958.75 frames. ], batch size: 471, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 04:49:43,956 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.592e+02 3.352e+02 3.873e+02 5.175e+02 9.395e+02, threshold=7.746e+02, percent-clipped=4.0 2023-06-22 04:49:50,529 INFO [train.py:996] (2/4) Epoch 7, batch 8050, loss[loss=0.1989, simple_loss=0.2685, pruned_loss=0.06461, over 21410.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3242, pruned_loss=0.08728, over 4260701.87 frames. ], batch size: 194, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 04:49:56,113 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.92 vs. limit=10.0 2023-06-22 04:50:19,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1146162.0, ans=0.0 2023-06-22 04:51:23,741 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.03 vs. limit=10.0 2023-06-22 04:51:31,142 INFO [train.py:996] (2/4) Epoch 7, batch 8100, loss[loss=0.2075, simple_loss=0.2839, pruned_loss=0.06549, over 21152.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3252, pruned_loss=0.08851, over 4260958.92 frames. ], batch size: 607, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:51:34,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1146402.0, ans=0.0 2023-06-22 04:51:53,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1146402.0, ans=0.2 2023-06-22 04:52:28,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1146522.0, ans=0.125 2023-06-22 04:53:00,887 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=15.0 2023-06-22 04:53:20,128 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.528e+02 3.325e+02 3.912e+02 5.287e+02 8.623e+02, threshold=7.823e+02, percent-clipped=4.0 2023-06-22 04:53:22,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1146642.0, ans=0.1 2023-06-22 04:53:29,623 INFO [train.py:996] (2/4) Epoch 7, batch 8150, loss[loss=0.207, simple_loss=0.2879, pruned_loss=0.06303, over 21527.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3294, pruned_loss=0.0889, over 4267120.60 frames. ], batch size: 212, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:54:03,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1146762.0, ans=0.0 2023-06-22 04:54:06,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1146822.0, ans=0.0 2023-06-22 04:54:16,771 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.40 vs. limit=15.0 2023-06-22 04:54:21,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1146882.0, ans=0.125 2023-06-22 04:55:08,812 INFO [train.py:996] (2/4) Epoch 7, batch 8200, loss[loss=0.2687, simple_loss=0.3182, pruned_loss=0.1096, over 21554.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3237, pruned_loss=0.08676, over 4265604.91 frames. ], batch size: 442, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:56:23,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1147242.0, ans=0.2 2023-06-22 04:56:38,808 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.189e+02 2.942e+02 3.673e+02 4.818e+02 8.671e+02, threshold=7.346e+02, percent-clipped=2.0 2023-06-22 04:56:48,453 INFO [train.py:996] (2/4) Epoch 7, batch 8250, loss[loss=0.2202, simple_loss=0.3133, pruned_loss=0.06358, over 21428.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.322, pruned_loss=0.08588, over 4264204.11 frames. ], batch size: 211, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:57:37,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1147422.0, ans=0.125 2023-06-22 04:58:11,367 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-22 04:58:28,866 INFO [train.py:996] (2/4) Epoch 7, batch 8300, loss[loss=0.2271, simple_loss=0.2989, pruned_loss=0.07765, over 21266.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3214, pruned_loss=0.08392, over 4267898.53 frames. ], batch size: 176, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 04:58:46,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1147602.0, ans=0.125 2023-06-22 04:59:17,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1147722.0, ans=0.125 2023-06-22 04:59:29,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1147782.0, ans=0.125 2023-06-22 04:59:46,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1147782.0, ans=0.125 2023-06-22 05:00:04,377 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.266e+02 2.891e+02 3.462e+02 4.321e+02 7.253e+02, threshold=6.923e+02, percent-clipped=0.0 2023-06-22 05:00:14,271 INFO [train.py:996] (2/4) Epoch 7, batch 8350, loss[loss=0.2064, simple_loss=0.2869, pruned_loss=0.06297, over 21080.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3193, pruned_loss=0.08155, over 4260746.68 frames. ], batch size: 143, lr: 4.39e-03, grad_scale: 16.0 2023-06-22 05:00:19,013 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.43 vs. limit=10.0 2023-06-22 05:00:19,122 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-22 05:00:29,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1147962.0, ans=0.125 2023-06-22 05:00:53,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1148022.0, ans=0.05 2023-06-22 05:01:49,554 INFO [train.py:996] (2/4) Epoch 7, batch 8400, loss[loss=0.2466, simple_loss=0.3777, pruned_loss=0.05776, over 20798.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.316, pruned_loss=0.0788, over 4250691.22 frames. ], batch size: 607, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 05:01:50,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1148202.0, ans=0.125 2023-06-22 05:02:18,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1148262.0, ans=0.1 2023-06-22 05:03:13,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1148442.0, ans=0.0 2023-06-22 05:03:23,502 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.056e+02 2.841e+02 3.519e+02 4.158e+02 9.923e+02, threshold=7.039e+02, percent-clipped=4.0 2023-06-22 05:03:28,529 INFO [train.py:996] (2/4) Epoch 7, batch 8450, loss[loss=0.2757, simple_loss=0.3843, pruned_loss=0.0835, over 20906.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3152, pruned_loss=0.07887, over 4261527.32 frames. ], batch size: 607, lr: 4.39e-03, grad_scale: 32.0 2023-06-22 05:03:55,812 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-06-22 05:03:58,994 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.29 vs. limit=12.0 2023-06-22 05:04:02,261 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-22 05:04:08,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1148622.0, ans=0.0 2023-06-22 05:04:10,035 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=22.5 2023-06-22 05:05:07,091 INFO [train.py:996] (2/4) Epoch 7, batch 8500, loss[loss=0.2139, simple_loss=0.2803, pruned_loss=0.07374, over 21721.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3111, pruned_loss=0.08025, over 4258499.05 frames. ], batch size: 351, lr: 4.38e-03, grad_scale: 32.0 2023-06-22 05:05:19,329 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.09 vs. limit=15.0 2023-06-22 05:06:16,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1148982.0, ans=0.0 2023-06-22 05:06:35,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1149042.0, ans=0.5 2023-06-22 05:06:43,151 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.460e+02 3.101e+02 3.750e+02 4.685e+02 7.391e+02, threshold=7.500e+02, percent-clipped=2.0 2023-06-22 05:06:47,858 INFO [train.py:996] (2/4) Epoch 7, batch 8550, loss[loss=0.2964, simple_loss=0.3797, pruned_loss=0.1065, over 21673.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.316, pruned_loss=0.08378, over 4255942.37 frames. ], batch size: 441, lr: 4.38e-03, grad_scale: 32.0 2023-06-22 05:07:09,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1149162.0, ans=0.125 2023-06-22 05:07:26,565 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=10.63 vs. limit=15.0 2023-06-22 05:08:13,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1149342.0, ans=0.0 2023-06-22 05:08:28,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1149402.0, ans=0.1 2023-06-22 05:08:29,780 INFO [train.py:996] (2/4) Epoch 7, batch 8600, loss[loss=0.2954, simple_loss=0.3859, pruned_loss=0.1024, over 21289.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3217, pruned_loss=0.086, over 4260619.09 frames. ], batch size: 548, lr: 4.38e-03, grad_scale: 32.0 2023-06-22 05:08:46,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1149462.0, ans=0.125 2023-06-22 05:09:08,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1149522.0, ans=0.1 2023-06-22 05:10:02,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1149642.0, ans=0.125 2023-06-22 05:10:06,554 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.931e+02 3.158e+02 3.954e+02 4.821e+02 7.985e+02, threshold=7.909e+02, percent-clipped=1.0 2023-06-22 05:10:11,785 INFO [train.py:996] (2/4) Epoch 7, batch 8650, loss[loss=0.1737, simple_loss=0.2742, pruned_loss=0.0366, over 21776.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3265, pruned_loss=0.08626, over 4265683.86 frames. ], batch size: 282, lr: 4.38e-03, grad_scale: 32.0 2023-06-22 05:10:31,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1149762.0, ans=0.2 2023-06-22 05:10:48,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1149822.0, ans=0.0 2023-06-22 05:11:07,024 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.95 vs. limit=15.0 2023-06-22 05:11:29,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1149942.0, ans=0.07 2023-06-22 05:11:46,151 INFO [train.py:996] (2/4) Epoch 7, batch 8700, loss[loss=0.2377, simple_loss=0.2939, pruned_loss=0.09076, over 21455.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3204, pruned_loss=0.08411, over 4257104.19 frames. ], batch size: 389, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:11:48,552 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.82 vs. limit=22.5 2023-06-22 05:11:50,025 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.95 vs. limit=6.0 2023-06-22 05:12:11,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1150062.0, ans=0.1 2023-06-22 05:12:34,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1150122.0, ans=0.07 2023-06-22 05:13:07,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1150242.0, ans=0.125 2023-06-22 05:13:21,473 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.066e+02 2.826e+02 3.648e+02 4.622e+02 7.671e+02, threshold=7.296e+02, percent-clipped=0.0 2023-06-22 05:13:24,955 INFO [train.py:996] (2/4) Epoch 7, batch 8750, loss[loss=0.2673, simple_loss=0.326, pruned_loss=0.1043, over 21721.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3161, pruned_loss=0.08463, over 4265445.43 frames. ], batch size: 389, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:13:25,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1150302.0, ans=0.1 2023-06-22 05:13:45,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1150362.0, ans=0.1 2023-06-22 05:13:58,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1150362.0, ans=0.2 2023-06-22 05:14:42,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1150482.0, ans=0.1 2023-06-22 05:15:06,508 INFO [train.py:996] (2/4) Epoch 7, batch 8800, loss[loss=0.2951, simple_loss=0.3658, pruned_loss=0.1122, over 21826.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3245, pruned_loss=0.08759, over 4269986.57 frames. ], batch size: 124, lr: 4.38e-03, grad_scale: 32.0 2023-06-22 05:15:26,558 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.95 vs. limit=10.0 2023-06-22 05:15:40,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1150662.0, ans=0.125 2023-06-22 05:16:09,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1150722.0, ans=0.125 2023-06-22 05:16:14,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1150782.0, ans=0.0 2023-06-22 05:16:23,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1150782.0, ans=0.125 2023-06-22 05:16:45,286 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.245e+02 3.569e+02 4.667e+02 6.070e+02 1.023e+03, threshold=9.335e+02, percent-clipped=11.0 2023-06-22 05:16:45,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1150902.0, ans=0.125 2023-06-22 05:16:45,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1150902.0, ans=0.0 2023-06-22 05:16:47,132 INFO [train.py:996] (2/4) Epoch 7, batch 8850, loss[loss=0.208, simple_loss=0.2997, pruned_loss=0.05814, over 21669.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3296, pruned_loss=0.08928, over 4271031.46 frames. ], batch size: 247, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:17:12,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1150962.0, ans=0.125 2023-06-22 05:17:25,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1150962.0, ans=0.05 2023-06-22 05:17:34,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1150962.0, ans=0.125 2023-06-22 05:18:33,240 INFO [train.py:996] (2/4) Epoch 7, batch 8900, loss[loss=0.2185, simple_loss=0.2957, pruned_loss=0.07065, over 21624.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3231, pruned_loss=0.08809, over 4268445.40 frames. ], batch size: 247, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:18:57,303 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=22.5 2023-06-22 05:19:11,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1151262.0, ans=0.125 2023-06-22 05:19:12,268 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-22 05:19:19,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1151322.0, ans=0.125 2023-06-22 05:20:19,271 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.570e+02 3.272e+02 4.158e+02 4.869e+02 9.581e+02, threshold=8.315e+02, percent-clipped=1.0 2023-06-22 05:20:19,291 INFO [train.py:996] (2/4) Epoch 7, batch 8950, loss[loss=0.2604, simple_loss=0.3325, pruned_loss=0.09416, over 21890.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3231, pruned_loss=0.0872, over 4273672.91 frames. ], batch size: 372, lr: 4.38e-03, grad_scale: 8.0 2023-06-22 05:20:44,548 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=15.0 2023-06-22 05:21:08,820 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.29 vs. limit=15.0 2023-06-22 05:21:35,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1151682.0, ans=0.2 2023-06-22 05:21:53,353 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=15.0 2023-06-22 05:21:58,735 INFO [train.py:996] (2/4) Epoch 7, batch 9000, loss[loss=0.2492, simple_loss=0.3035, pruned_loss=0.09746, over 22002.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.318, pruned_loss=0.08687, over 4275047.74 frames. ], batch size: 103, lr: 4.38e-03, grad_scale: 8.0 2023-06-22 05:21:58,735 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-22 05:22:20,477 INFO [train.py:1028] (2/4) Epoch 7, validation: loss=0.2667, simple_loss=0.3612, pruned_loss=0.08614, over 1796401.00 frames. 2023-06-22 05:22:20,478 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-22 05:22:56,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1151922.0, ans=0.125 2023-06-22 05:23:54,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1152042.0, ans=0.125 2023-06-22 05:23:57,237 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.260e+02 2.838e+02 3.460e+02 4.383e+02 1.064e+03, threshold=6.920e+02, percent-clipped=1.0 2023-06-22 05:23:57,258 INFO [train.py:996] (2/4) Epoch 7, batch 9050, loss[loss=0.2768, simple_loss=0.347, pruned_loss=0.1033, over 21727.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3144, pruned_loss=0.08381, over 4273264.65 frames. ], batch size: 441, lr: 4.38e-03, grad_scale: 8.0 2023-06-22 05:24:22,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1152162.0, ans=0.125 2023-06-22 05:24:59,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1152282.0, ans=0.09899494936611666 2023-06-22 05:25:37,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1152402.0, ans=0.125 2023-06-22 05:25:38,178 INFO [train.py:996] (2/4) Epoch 7, batch 9100, loss[loss=0.3217, simple_loss=0.3865, pruned_loss=0.1285, over 21441.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.322, pruned_loss=0.08676, over 4271280.81 frames. ], batch size: 471, lr: 4.38e-03, grad_scale: 8.0 2023-06-22 05:26:01,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1152462.0, ans=0.125 2023-06-22 05:26:17,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1152522.0, ans=0.0 2023-06-22 05:27:04,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1152642.0, ans=0.2 2023-06-22 05:27:05,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1152642.0, ans=0.0 2023-06-22 05:27:18,305 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.380e+02 3.226e+02 3.929e+02 4.790e+02 9.193e+02, threshold=7.858e+02, percent-clipped=7.0 2023-06-22 05:27:18,326 INFO [train.py:996] (2/4) Epoch 7, batch 9150, loss[loss=0.2581, simple_loss=0.3476, pruned_loss=0.08427, over 21792.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3247, pruned_loss=0.08442, over 4268010.14 frames. ], batch size: 351, lr: 4.38e-03, grad_scale: 8.0 2023-06-22 05:27:31,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1152702.0, ans=0.0 2023-06-22 05:27:35,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1152762.0, ans=0.0 2023-06-22 05:28:37,610 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 05:28:57,907 INFO [train.py:996] (2/4) Epoch 7, batch 9200, loss[loss=0.2627, simple_loss=0.348, pruned_loss=0.08872, over 21709.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3278, pruned_loss=0.08356, over 4272338.93 frames. ], batch size: 351, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:29:23,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1153062.0, ans=10.0 2023-06-22 05:29:25,604 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=22.5 2023-06-22 05:29:34,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1153062.0, ans=0.2 2023-06-22 05:29:48,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1153122.0, ans=0.125 2023-06-22 05:30:18,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1153182.0, ans=0.125 2023-06-22 05:30:32,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1153242.0, ans=0.0 2023-06-22 05:30:38,528 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.323e+02 3.168e+02 3.950e+02 4.666e+02 8.453e+02, threshold=7.900e+02, percent-clipped=2.0 2023-06-22 05:30:38,549 INFO [train.py:996] (2/4) Epoch 7, batch 9250, loss[loss=0.2234, simple_loss=0.2961, pruned_loss=0.07538, over 21758.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3293, pruned_loss=0.08575, over 4266665.93 frames. ], batch size: 282, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:30:53,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1153302.0, ans=0.125 2023-06-22 05:31:21,587 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.26 vs. limit=22.5 2023-06-22 05:31:25,021 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=15.0 2023-06-22 05:32:24,616 INFO [train.py:996] (2/4) Epoch 7, batch 9300, loss[loss=0.2448, simple_loss=0.3137, pruned_loss=0.08792, over 19998.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3231, pruned_loss=0.08562, over 4268559.95 frames. ], batch size: 702, lr: 4.38e-03, grad_scale: 16.0 2023-06-22 05:33:02,254 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.76 vs. limit=10.0 2023-06-22 05:33:03,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1153722.0, ans=0.0 2023-06-22 05:33:59,728 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=15.0 2023-06-22 05:34:11,589 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.281e+02 3.246e+02 3.753e+02 4.890e+02 7.964e+02, threshold=7.506e+02, percent-clipped=1.0 2023-06-22 05:34:11,610 INFO [train.py:996] (2/4) Epoch 7, batch 9350, loss[loss=0.2322, simple_loss=0.3494, pruned_loss=0.05754, over 20718.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.329, pruned_loss=0.0858, over 4261402.29 frames. ], batch size: 607, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:34:12,562 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-22 05:34:53,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1154022.0, ans=0.95 2023-06-22 05:35:15,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1154082.0, ans=0.04949747468305833 2023-06-22 05:35:41,929 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-22 05:35:47,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1154142.0, ans=0.0 2023-06-22 05:35:52,119 INFO [train.py:996] (2/4) Epoch 7, batch 9400, loss[loss=0.2399, simple_loss=0.2925, pruned_loss=0.09366, over 21365.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3297, pruned_loss=0.08625, over 4270520.44 frames. ], batch size: 194, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:36:14,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1154262.0, ans=0.125 2023-06-22 05:36:16,268 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2023-06-22 05:36:39,976 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=22.5 2023-06-22 05:37:32,761 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.390e+02 3.206e+02 3.638e+02 4.538e+02 8.694e+02, threshold=7.276e+02, percent-clipped=2.0 2023-06-22 05:37:32,785 INFO [train.py:996] (2/4) Epoch 7, batch 9450, loss[loss=0.1861, simple_loss=0.2469, pruned_loss=0.06268, over 21354.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3221, pruned_loss=0.08562, over 4274153.88 frames. ], batch size: 551, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:37:57,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1154562.0, ans=0.0 2023-06-22 05:38:25,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1154622.0, ans=0.0 2023-06-22 05:38:47,369 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=22.5 2023-06-22 05:38:51,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1154742.0, ans=0.125 2023-06-22 05:39:11,629 INFO [train.py:996] (2/4) Epoch 7, batch 9500, loss[loss=0.1972, simple_loss=0.2827, pruned_loss=0.05583, over 21728.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3122, pruned_loss=0.08294, over 4274181.29 frames. ], batch size: 247, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:39:12,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1154802.0, ans=0.125 2023-06-22 05:40:22,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1154982.0, ans=0.125 2023-06-22 05:40:46,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1155042.0, ans=0.0 2023-06-22 05:40:51,639 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.95 vs. limit=15.0 2023-06-22 05:40:52,007 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 3.257e+02 3.921e+02 4.943e+02 1.018e+03, threshold=7.842e+02, percent-clipped=7.0 2023-06-22 05:40:52,028 INFO [train.py:996] (2/4) Epoch 7, batch 9550, loss[loss=0.2908, simple_loss=0.3483, pruned_loss=0.1166, over 21392.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3193, pruned_loss=0.08525, over 4275254.85 frames. ], batch size: 549, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:41:33,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1155222.0, ans=0.2 2023-06-22 05:41:44,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1155282.0, ans=0.125 2023-06-22 05:42:26,551 INFO [train.py:996] (2/4) Epoch 7, batch 9600, loss[loss=0.255, simple_loss=0.3539, pruned_loss=0.07802, over 20777.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3222, pruned_loss=0.0873, over 4280687.20 frames. ], batch size: 607, lr: 4.37e-03, grad_scale: 32.0 2023-06-22 05:42:34,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff2.min_abs, batch_count=1155402.0, ans=0.1 2023-06-22 05:43:13,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1155522.0, ans=0.125 2023-06-22 05:43:33,663 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.61 vs. limit=15.0 2023-06-22 05:44:06,659 INFO [train.py:996] (2/4) Epoch 7, batch 9650, loss[loss=0.2989, simple_loss=0.3682, pruned_loss=0.1148, over 21484.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3221, pruned_loss=0.08785, over 4285764.64 frames. ], batch size: 131, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:44:08,110 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.313e+02 3.176e+02 3.740e+02 4.596e+02 7.915e+02, threshold=7.479e+02, percent-clipped=1.0 2023-06-22 05:44:32,380 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-06-22 05:44:57,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1155822.0, ans=0.0 2023-06-22 05:44:58,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1155822.0, ans=0.125 2023-06-22 05:45:31,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1155942.0, ans=0.2 2023-06-22 05:45:51,572 INFO [train.py:996] (2/4) Epoch 7, batch 9700, loss[loss=0.2087, simple_loss=0.2861, pruned_loss=0.06562, over 21360.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3255, pruned_loss=0.0887, over 4288627.19 frames. ], batch size: 159, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:46:04,346 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-22 05:46:14,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1156062.0, ans=0.125 2023-06-22 05:46:19,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1156062.0, ans=0.125 2023-06-22 05:46:45,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1156122.0, ans=0.125 2023-06-22 05:47:02,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1156182.0, ans=0.2 2023-06-22 05:47:35,169 INFO [train.py:996] (2/4) Epoch 7, batch 9750, loss[loss=0.2368, simple_loss=0.2861, pruned_loss=0.0938, over 21464.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3188, pruned_loss=0.08731, over 4289142.43 frames. ], batch size: 476, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:47:36,493 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.485e+02 3.111e+02 3.618e+02 4.143e+02 7.836e+02, threshold=7.236e+02, percent-clipped=1.0 2023-06-22 05:48:11,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1156422.0, ans=0.0 2023-06-22 05:48:11,329 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 05:48:24,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1156422.0, ans=0.125 2023-06-22 05:48:50,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1156542.0, ans=0.1 2023-06-22 05:48:59,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1156542.0, ans=0.1 2023-06-22 05:49:08,229 INFO [train.py:996] (2/4) Epoch 7, batch 9800, loss[loss=0.2143, simple_loss=0.289, pruned_loss=0.0698, over 20091.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3182, pruned_loss=0.08669, over 4271964.65 frames. ], batch size: 703, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:50:41,991 INFO [train.py:996] (2/4) Epoch 7, batch 9850, loss[loss=0.2444, simple_loss=0.3082, pruned_loss=0.09029, over 21889.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.315, pruned_loss=0.08711, over 4263489.67 frames. ], batch size: 333, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:50:43,430 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.297e+02 3.145e+02 3.713e+02 4.993e+02 9.640e+02, threshold=7.425e+02, percent-clipped=7.0 2023-06-22 05:52:21,246 INFO [train.py:996] (2/4) Epoch 7, batch 9900, loss[loss=0.2288, simple_loss=0.3046, pruned_loss=0.07651, over 21424.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3107, pruned_loss=0.08669, over 4261184.88 frames. ], batch size: 211, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:52:49,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1157262.0, ans=0.0 2023-06-22 05:53:54,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1157442.0, ans=15.0 2023-06-22 05:54:06,444 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.35 vs. limit=15.0 2023-06-22 05:54:06,746 INFO [train.py:996] (2/4) Epoch 7, batch 9950, loss[loss=0.2114, simple_loss=0.277, pruned_loss=0.07289, over 21565.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3124, pruned_loss=0.08899, over 4263535.33 frames. ], batch size: 263, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:54:08,112 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.677e+02 3.173e+02 3.693e+02 4.396e+02 6.940e+02, threshold=7.386e+02, percent-clipped=0.0 2023-06-22 05:54:21,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1157502.0, ans=0.2 2023-06-22 05:54:55,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1157622.0, ans=0.09899494936611666 2023-06-22 05:55:02,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1157622.0, ans=0.2 2023-06-22 05:55:24,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1157682.0, ans=0.5 2023-06-22 05:55:45,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1157742.0, ans=0.2 2023-06-22 05:55:54,202 INFO [train.py:996] (2/4) Epoch 7, batch 10000, loss[loss=0.2411, simple_loss=0.3169, pruned_loss=0.08264, over 21367.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3073, pruned_loss=0.08673, over 4270588.90 frames. ], batch size: 549, lr: 4.37e-03, grad_scale: 32.0 2023-06-22 05:55:58,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1157802.0, ans=0.0 2023-06-22 05:56:12,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1157862.0, ans=0.125 2023-06-22 05:56:19,158 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.37 vs. limit=10.0 2023-06-22 05:56:34,223 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.56 vs. limit=15.0 2023-06-22 05:56:37,377 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=22.5 2023-06-22 05:56:53,626 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=15.0 2023-06-22 05:57:35,225 INFO [train.py:996] (2/4) Epoch 7, batch 10050, loss[loss=0.3423, simple_loss=0.3911, pruned_loss=0.1468, over 21436.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3112, pruned_loss=0.08809, over 4270282.81 frames. ], batch size: 509, lr: 4.37e-03, grad_scale: 32.0 2023-06-22 05:57:36,646 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.289e+02 3.023e+02 3.455e+02 4.258e+02 6.801e+02, threshold=6.910e+02, percent-clipped=0.0 2023-06-22 05:58:26,682 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.19 vs. limit=15.0 2023-06-22 05:58:52,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1158282.0, ans=0.125 2023-06-22 05:59:15,532 INFO [train.py:996] (2/4) Epoch 7, batch 10100, loss[loss=0.2017, simple_loss=0.2855, pruned_loss=0.05898, over 20881.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3082, pruned_loss=0.08539, over 4276358.20 frames. ], batch size: 608, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 05:59:23,482 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.79 vs. limit=10.0 2023-06-22 05:59:25,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1158402.0, ans=0.0 2023-06-22 05:59:25,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1158402.0, ans=0.2 2023-06-22 06:00:44,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1158642.0, ans=0.0 2023-06-22 06:00:55,639 INFO [train.py:996] (2/4) Epoch 7, batch 10150, loss[loss=0.2759, simple_loss=0.3477, pruned_loss=0.102, over 21805.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.314, pruned_loss=0.08662, over 4265773.47 frames. ], batch size: 282, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 06:00:58,867 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.393e+02 3.244e+02 3.861e+02 4.882e+02 7.298e+02, threshold=7.722e+02, percent-clipped=2.0 2023-06-22 06:02:35,504 INFO [train.py:996] (2/4) Epoch 7, batch 10200, loss[loss=0.2375, simple_loss=0.32, pruned_loss=0.07751, over 21705.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3117, pruned_loss=0.08409, over 4271067.40 frames. ], batch size: 415, lr: 4.37e-03, grad_scale: 16.0 2023-06-22 06:03:03,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1159062.0, ans=0.0 2023-06-22 06:03:13,656 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.63 vs. limit=22.5 2023-06-22 06:03:43,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1159182.0, ans=0.125 2023-06-22 06:04:00,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1159242.0, ans=0.1 2023-06-22 06:04:13,454 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-06-22 06:04:14,094 INFO [train.py:996] (2/4) Epoch 7, batch 10250, loss[loss=0.202, simple_loss=0.2933, pruned_loss=0.05537, over 21865.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3067, pruned_loss=0.07868, over 4271441.12 frames. ], batch size: 372, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:04:17,127 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.009e+02 2.655e+02 3.086e+02 4.104e+02 7.872e+02, threshold=6.172e+02, percent-clipped=2.0 2023-06-22 06:04:24,462 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 06:04:35,112 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=15.0 2023-06-22 06:05:02,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1159422.0, ans=0.2 2023-06-22 06:05:30,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1159482.0, ans=0.125 2023-06-22 06:05:44,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1159542.0, ans=0.0 2023-06-22 06:06:01,463 INFO [train.py:996] (2/4) Epoch 7, batch 10300, loss[loss=0.2477, simple_loss=0.3198, pruned_loss=0.08778, over 19983.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3114, pruned_loss=0.08106, over 4271462.99 frames. ], batch size: 703, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:06:18,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1159602.0, ans=0.0 2023-06-22 06:06:49,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1159722.0, ans=0.04949747468305833 2023-06-22 06:07:44,242 INFO [train.py:996] (2/4) Epoch 7, batch 10350, loss[loss=0.1759, simple_loss=0.2246, pruned_loss=0.06366, over 21243.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3115, pruned_loss=0.08042, over 4272122.02 frames. ], batch size: 143, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:07:47,492 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.370e+02 3.368e+02 3.957e+02 4.921e+02 8.307e+02, threshold=7.914e+02, percent-clipped=7.0 2023-06-22 06:08:04,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1159962.0, ans=0.0 2023-06-22 06:08:12,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1159962.0, ans=0.07 2023-06-22 06:08:23,171 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.18 vs. limit=8.0 2023-06-22 06:08:46,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1160082.0, ans=0.0 2023-06-22 06:08:50,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1160082.0, ans=0.125 2023-06-22 06:09:31,075 INFO [train.py:996] (2/4) Epoch 7, batch 10400, loss[loss=0.1943, simple_loss=0.2455, pruned_loss=0.07151, over 21296.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3051, pruned_loss=0.07914, over 4262524.90 frames. ], batch size: 176, lr: 4.36e-03, grad_scale: 32.0 2023-06-22 06:09:36,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1160202.0, ans=0.125 2023-06-22 06:11:13,825 INFO [train.py:996] (2/4) Epoch 7, batch 10450, loss[loss=0.2379, simple_loss=0.3412, pruned_loss=0.06731, over 20826.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3108, pruned_loss=0.08229, over 4265958.18 frames. ], batch size: 608, lr: 4.36e-03, grad_scale: 32.0 2023-06-22 06:11:16,913 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.616e+02 3.263e+02 3.725e+02 4.769e+02 8.321e+02, threshold=7.450e+02, percent-clipped=2.0 2023-06-22 06:11:25,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1160502.0, ans=0.125 2023-06-22 06:12:08,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1160622.0, ans=0.125 2023-06-22 06:12:45,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1160742.0, ans=0.0 2023-06-22 06:12:58,240 INFO [train.py:996] (2/4) Epoch 7, batch 10500, loss[loss=0.2025, simple_loss=0.2694, pruned_loss=0.06774, over 21658.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.311, pruned_loss=0.08118, over 4259321.88 frames. ], batch size: 298, lr: 4.36e-03, grad_scale: 32.0 2023-06-22 06:13:47,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1160922.0, ans=0.1 2023-06-22 06:14:27,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1161042.0, ans=0.125 2023-06-22 06:14:28,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1161042.0, ans=0.125 2023-06-22 06:14:37,815 INFO [train.py:996] (2/4) Epoch 7, batch 10550, loss[loss=0.2598, simple_loss=0.2974, pruned_loss=0.1111, over 21307.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3055, pruned_loss=0.08095, over 4268869.80 frames. ], batch size: 507, lr: 4.36e-03, grad_scale: 32.0 2023-06-22 06:14:40,895 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.300e+02 2.934e+02 3.554e+02 4.294e+02 7.411e+02, threshold=7.109e+02, percent-clipped=0.0 2023-06-22 06:14:46,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1161102.0, ans=0.1 2023-06-22 06:15:30,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1161222.0, ans=0.04949747468305833 2023-06-22 06:16:19,493 INFO [train.py:996] (2/4) Epoch 7, batch 10600, loss[loss=0.1926, simple_loss=0.2571, pruned_loss=0.0641, over 15371.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.2998, pruned_loss=0.07888, over 4264183.69 frames. ], batch size: 62, lr: 4.36e-03, grad_scale: 32.0 2023-06-22 06:16:35,725 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.53 vs. limit=15.0 2023-06-22 06:17:23,743 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=22.5 2023-06-22 06:17:50,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1161642.0, ans=0.5 2023-06-22 06:18:06,264 INFO [train.py:996] (2/4) Epoch 7, batch 10650, loss[loss=0.213, simple_loss=0.335, pruned_loss=0.04546, over 19979.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3048, pruned_loss=0.0784, over 4270905.26 frames. ], batch size: 703, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:18:11,058 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.321e+02 3.046e+02 3.763e+02 4.720e+02 8.386e+02, threshold=7.526e+02, percent-clipped=4.0 2023-06-22 06:18:21,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1161762.0, ans=0.2 2023-06-22 06:18:40,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1161762.0, ans=0.0 2023-06-22 06:18:58,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1161822.0, ans=0.0 2023-06-22 06:19:46,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1162002.0, ans=0.0 2023-06-22 06:19:47,763 INFO [train.py:996] (2/4) Epoch 7, batch 10700, loss[loss=0.2756, simple_loss=0.3484, pruned_loss=0.1015, over 21404.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3043, pruned_loss=0.0782, over 4261453.07 frames. ], batch size: 131, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:20:13,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1162062.0, ans=0.125 2023-06-22 06:20:31,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1162122.0, ans=0.1 2023-06-22 06:20:31,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1162122.0, ans=0.025 2023-06-22 06:21:06,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1162182.0, ans=0.125 2023-06-22 06:21:17,244 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.63 vs. limit=10.0 2023-06-22 06:21:30,722 INFO [train.py:996] (2/4) Epoch 7, batch 10750, loss[loss=0.2701, simple_loss=0.3626, pruned_loss=0.08884, over 21762.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3155, pruned_loss=0.08309, over 4264479.10 frames. ], batch size: 332, lr: 4.36e-03, grad_scale: 8.0 2023-06-22 06:21:33,452 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=22.5 2023-06-22 06:21:42,677 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.984e+02 3.648e+02 4.416e+02 6.142e+02 1.061e+03, threshold=8.833e+02, percent-clipped=11.0 2023-06-22 06:21:43,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1162302.0, ans=0.1 2023-06-22 06:23:17,162 INFO [train.py:996] (2/4) Epoch 7, batch 10800, loss[loss=0.241, simple_loss=0.3178, pruned_loss=0.08213, over 21561.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3189, pruned_loss=0.08341, over 4263437.75 frames. ], batch size: 230, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:23:17,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1162602.0, ans=0.125 2023-06-22 06:23:29,366 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.95 vs. limit=15.0 2023-06-22 06:23:52,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=1162722.0, ans=0.02 2023-06-22 06:23:53,172 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-22 06:23:56,440 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-06-22 06:24:03,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1162722.0, ans=0.0 2023-06-22 06:24:56,591 INFO [train.py:996] (2/4) Epoch 7, batch 10850, loss[loss=0.2325, simple_loss=0.2911, pruned_loss=0.08695, over 21136.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3191, pruned_loss=0.08386, over 4258629.31 frames. ], batch size: 143, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:25:07,913 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.223e+02 3.182e+02 4.104e+02 5.003e+02 8.249e+02, threshold=8.208e+02, percent-clipped=0.0 2023-06-22 06:25:09,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1162902.0, ans=0.125 2023-06-22 06:25:26,031 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 06:25:40,792 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=15.0 2023-06-22 06:26:06,883 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.44 vs. limit=15.0 2023-06-22 06:26:12,203 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=22.5 2023-06-22 06:26:28,347 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.86 vs. limit=6.0 2023-06-22 06:26:41,332 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-06-22 06:26:41,642 INFO [train.py:996] (2/4) Epoch 7, batch 10900, loss[loss=0.2152, simple_loss=0.337, pruned_loss=0.04669, over 21198.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3119, pruned_loss=0.08212, over 4258897.95 frames. ], batch size: 548, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:26:45,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1163202.0, ans=0.04949747468305833 2023-06-22 06:27:01,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1163262.0, ans=0.125 2023-06-22 06:27:23,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1163322.0, ans=0.0 2023-06-22 06:27:25,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1163322.0, ans=0.125 2023-06-22 06:27:39,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1163382.0, ans=0.125 2023-06-22 06:27:58,397 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.26 vs. limit=15.0 2023-06-22 06:28:16,271 INFO [train.py:996] (2/4) Epoch 7, batch 10950, loss[loss=0.217, simple_loss=0.2774, pruned_loss=0.07829, over 21229.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3083, pruned_loss=0.08087, over 4266493.47 frames. ], batch size: 144, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:28:27,456 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.461e+02 3.282e+02 3.918e+02 4.735e+02 6.803e+02, threshold=7.835e+02, percent-clipped=0.0 2023-06-22 06:29:03,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1163622.0, ans=0.125 2023-06-22 06:29:50,120 INFO [train.py:996] (2/4) Epoch 7, batch 11000, loss[loss=0.3193, simple_loss=0.3567, pruned_loss=0.1409, over 21772.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3072, pruned_loss=0.08148, over 4256917.40 frames. ], batch size: 508, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:29:52,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1163802.0, ans=0.0 2023-06-22 06:30:00,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1163802.0, ans=0.125 2023-06-22 06:30:41,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1163922.0, ans=0.125 2023-06-22 06:31:30,317 INFO [train.py:996] (2/4) Epoch 7, batch 11050, loss[loss=0.2244, simple_loss=0.277, pruned_loss=0.08587, over 21283.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3051, pruned_loss=0.08331, over 4259821.72 frames. ], batch size: 548, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:31:40,913 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.344e+02 3.144e+02 3.658e+02 4.347e+02 7.948e+02, threshold=7.316e+02, percent-clipped=1.0 2023-06-22 06:31:53,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1164162.0, ans=0.1 2023-06-22 06:32:19,700 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-22 06:32:32,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1164282.0, ans=0.2 2023-06-22 06:33:02,551 INFO [train.py:996] (2/4) Epoch 7, batch 11100, loss[loss=0.2022, simple_loss=0.2712, pruned_loss=0.06663, over 21800.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3046, pruned_loss=0.08377, over 4245108.14 frames. ], batch size: 112, lr: 4.36e-03, grad_scale: 16.0 2023-06-22 06:33:58,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1164522.0, ans=0.0 2023-06-22 06:34:39,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1164642.0, ans=0.125 2023-06-22 06:34:42,426 INFO [train.py:996] (2/4) Epoch 7, batch 11150, loss[loss=0.2277, simple_loss=0.3155, pruned_loss=0.06994, over 21585.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.303, pruned_loss=0.0832, over 4239850.02 frames. ], batch size: 414, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 06:34:48,747 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.401e+02 2.924e+02 3.298e+02 3.958e+02 6.309e+02, threshold=6.596e+02, percent-clipped=0.0 2023-06-22 06:34:49,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1164702.0, ans=0.1 2023-06-22 06:35:04,205 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=22.5 2023-06-22 06:35:44,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1164882.0, ans=0.125 2023-06-22 06:36:16,922 INFO [train.py:996] (2/4) Epoch 7, batch 11200, loss[loss=0.2404, simple_loss=0.2954, pruned_loss=0.09265, over 21217.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3004, pruned_loss=0.08253, over 4227833.76 frames. ], batch size: 471, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:36:35,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1165002.0, ans=0.2 2023-06-22 06:36:56,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1165062.0, ans=0.0 2023-06-22 06:37:12,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1165122.0, ans=0.125 2023-06-22 06:37:24,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1165182.0, ans=0.09899494936611666 2023-06-22 06:37:52,318 INFO [train.py:996] (2/4) Epoch 7, batch 11250, loss[loss=0.2167, simple_loss=0.307, pruned_loss=0.06317, over 21565.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.2992, pruned_loss=0.08213, over 4241254.95 frames. ], batch size: 195, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:37:58,484 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.219e+02 2.901e+02 3.332e+02 3.824e+02 5.999e+02, threshold=6.664e+02, percent-clipped=0.0 2023-06-22 06:38:12,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1165362.0, ans=0.2 2023-06-22 06:38:34,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1165422.0, ans=0.125 2023-06-22 06:39:04,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1165482.0, ans=0.0 2023-06-22 06:39:06,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1165482.0, ans=0.125 2023-06-22 06:39:22,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1165542.0, ans=0.125 2023-06-22 06:39:28,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1165542.0, ans=0.125 2023-06-22 06:39:31,697 INFO [train.py:996] (2/4) Epoch 7, batch 11300, loss[loss=0.2159, simple_loss=0.2997, pruned_loss=0.06605, over 21846.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3009, pruned_loss=0.08164, over 4253107.84 frames. ], batch size: 98, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:39:37,195 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 06:39:48,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1165602.0, ans=0.125 2023-06-22 06:40:02,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1165662.0, ans=0.125 2023-06-22 06:40:21,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1165722.0, ans=0.0 2023-06-22 06:40:44,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1165782.0, ans=0.125 2023-06-22 06:40:58,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1165842.0, ans=0.1 2023-06-22 06:41:12,117 INFO [train.py:996] (2/4) Epoch 7, batch 11350, loss[loss=0.2545, simple_loss=0.3404, pruned_loss=0.08432, over 21619.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3039, pruned_loss=0.08108, over 4262801.24 frames. ], batch size: 389, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:41:23,493 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.328e+02 2.937e+02 3.595e+02 4.319e+02 9.423e+02, threshold=7.190e+02, percent-clipped=3.0 2023-06-22 06:41:43,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1165962.0, ans=0.015 2023-06-22 06:42:30,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1166082.0, ans=0.0 2023-06-22 06:42:59,005 INFO [train.py:996] (2/4) Epoch 7, batch 11400, loss[loss=0.2482, simple_loss=0.3253, pruned_loss=0.08553, over 21831.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3102, pruned_loss=0.08457, over 4264320.96 frames. ], batch size: 282, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:43:05,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1166202.0, ans=0.125 2023-06-22 06:43:29,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1166262.0, ans=0.2 2023-06-22 06:43:29,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1166262.0, ans=0.0 2023-06-22 06:43:31,378 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 06:44:11,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1166382.0, ans=0.0 2023-06-22 06:44:19,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1166442.0, ans=0.1 2023-06-22 06:44:27,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1166442.0, ans=0.0 2023-06-22 06:44:29,272 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=15.0 2023-06-22 06:44:38,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1166502.0, ans=0.2 2023-06-22 06:44:40,298 INFO [train.py:996] (2/4) Epoch 7, batch 11450, loss[loss=0.2528, simple_loss=0.354, pruned_loss=0.07574, over 21230.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3107, pruned_loss=0.08347, over 4269839.06 frames. ], batch size: 549, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:44:52,081 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.327e+02 3.083e+02 3.885e+02 5.108e+02 7.985e+02, threshold=7.771e+02, percent-clipped=2.0 2023-06-22 06:45:09,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1166562.0, ans=0.0 2023-06-22 06:46:17,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1166742.0, ans=0.05 2023-06-22 06:46:21,781 INFO [train.py:996] (2/4) Epoch 7, batch 11500, loss[loss=0.2022, simple_loss=0.2587, pruned_loss=0.07285, over 20784.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3139, pruned_loss=0.0846, over 4263279.80 frames. ], batch size: 608, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:48:14,271 INFO [train.py:996] (2/4) Epoch 7, batch 11550, loss[loss=0.2804, simple_loss=0.3853, pruned_loss=0.08778, over 21249.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3211, pruned_loss=0.08545, over 4263582.86 frames. ], batch size: 548, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:48:21,197 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.417e+02 3.086e+02 3.744e+02 4.289e+02 8.491e+02, threshold=7.488e+02, percent-clipped=1.0 2023-06-22 06:48:50,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1167162.0, ans=0.2 2023-06-22 06:49:03,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1167222.0, ans=0.1 2023-06-22 06:49:45,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1167342.0, ans=0.125 2023-06-22 06:49:47,520 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=15.0 2023-06-22 06:49:51,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1167342.0, ans=0.125 2023-06-22 06:49:56,236 INFO [train.py:996] (2/4) Epoch 7, batch 11600, loss[loss=0.29, simple_loss=0.3952, pruned_loss=0.09243, over 21636.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3355, pruned_loss=0.08748, over 4262483.80 frames. ], batch size: 441, lr: 4.35e-03, grad_scale: 32.0 2023-06-22 06:49:59,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1167402.0, ans=0.125 2023-06-22 06:50:49,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1167522.0, ans=0.0 2023-06-22 06:51:01,671 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.62 vs. limit=15.0 2023-06-22 06:51:25,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1167642.0, ans=0.125 2023-06-22 06:51:37,741 INFO [train.py:996] (2/4) Epoch 7, batch 11650, loss[loss=0.3967, simple_loss=0.4555, pruned_loss=0.169, over 21437.00 frames. ], tot_loss[loss=0.2587, simple_loss=0.3412, pruned_loss=0.08815, over 4261841.87 frames. ], batch size: 507, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 06:51:52,648 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.564e+02 3.531e+02 4.483e+02 5.705e+02 9.764e+02, threshold=8.966e+02, percent-clipped=9.0 2023-06-22 06:52:17,189 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 06:52:31,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1167822.0, ans=0.125 2023-06-22 06:52:54,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1167882.0, ans=0.125 2023-06-22 06:53:18,230 INFO [train.py:996] (2/4) Epoch 7, batch 11700, loss[loss=0.2115, simple_loss=0.2866, pruned_loss=0.06816, over 21725.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3335, pruned_loss=0.08821, over 4270263.77 frames. ], batch size: 112, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 06:53:40,588 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.91 vs. limit=15.0 2023-06-22 06:53:54,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1168062.0, ans=0.0 2023-06-22 06:54:20,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1168182.0, ans=0.125 2023-06-22 06:54:41,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1168242.0, ans=0.09899494936611666 2023-06-22 06:54:56,767 INFO [train.py:996] (2/4) Epoch 7, batch 11750, loss[loss=0.25, simple_loss=0.3123, pruned_loss=0.0939, over 21665.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3234, pruned_loss=0.08685, over 4275565.78 frames. ], batch size: 298, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 06:55:11,697 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.565e+02 3.104e+02 3.664e+02 4.523e+02 8.929e+02, threshold=7.328e+02, percent-clipped=0.0 2023-06-22 06:55:34,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1168362.0, ans=0.015 2023-06-22 06:55:51,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1168422.0, ans=0.0 2023-06-22 06:56:11,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1168482.0, ans=0.1 2023-06-22 06:56:36,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1168542.0, ans=0.125 2023-06-22 06:56:44,766 INFO [train.py:996] (2/4) Epoch 7, batch 11800, loss[loss=0.2821, simple_loss=0.3453, pruned_loss=0.1095, over 21757.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3233, pruned_loss=0.08905, over 4279636.26 frames. ], batch size: 332, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 06:57:51,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1168782.0, ans=0.125 2023-06-22 06:57:54,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1168782.0, ans=0.125 2023-06-22 06:58:09,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1168842.0, ans=0.0 2023-06-22 06:58:25,529 INFO [train.py:996] (2/4) Epoch 7, batch 11850, loss[loss=0.2224, simple_loss=0.2828, pruned_loss=0.08101, over 16128.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3261, pruned_loss=0.0885, over 4269454.35 frames. ], batch size: 60, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 06:58:39,970 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 3.224e+02 3.745e+02 4.482e+02 9.714e+02, threshold=7.491e+02, percent-clipped=2.0 2023-06-22 06:58:49,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1168962.0, ans=0.125 2023-06-22 06:58:54,998 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=12.0 2023-06-22 06:59:17,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1169022.0, ans=0.125 2023-06-22 06:59:30,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1169082.0, ans=0.125 2023-06-22 07:00:02,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1169142.0, ans=0.2 2023-06-22 07:00:12,034 INFO [train.py:996] (2/4) Epoch 7, batch 11900, loss[loss=0.2227, simple_loss=0.3131, pruned_loss=0.06611, over 21581.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3268, pruned_loss=0.08598, over 4275013.19 frames. ], batch size: 389, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 07:00:28,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1169262.0, ans=0.0 2023-06-22 07:00:32,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1169262.0, ans=0.125 2023-06-22 07:00:45,570 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:01:27,810 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-22 07:01:35,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1169442.0, ans=0.1 2023-06-22 07:01:47,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1169502.0, ans=0.2 2023-06-22 07:01:48,746 INFO [train.py:996] (2/4) Epoch 7, batch 11950, loss[loss=0.2252, simple_loss=0.321, pruned_loss=0.06473, over 21637.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3279, pruned_loss=0.08281, over 4273088.30 frames. ], batch size: 441, lr: 4.35e-03, grad_scale: 16.0 2023-06-22 07:01:58,126 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.161e+02 3.043e+02 3.599e+02 4.818e+02 9.282e+02, threshold=7.198e+02, percent-clipped=3.0 2023-06-22 07:02:00,695 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=22.5 2023-06-22 07:03:05,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1169742.0, ans=0.125 2023-06-22 07:03:27,370 INFO [train.py:996] (2/4) Epoch 7, batch 12000, loss[loss=0.24, simple_loss=0.2955, pruned_loss=0.09221, over 21853.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3194, pruned_loss=0.08043, over 4271656.65 frames. ], batch size: 107, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:03:27,370 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-22 07:03:43,845 INFO [train.py:1028] (2/4) Epoch 7, validation: loss=0.2652, simple_loss=0.3601, pruned_loss=0.08515, over 1796401.00 frames. 2023-06-22 07:03:43,846 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-22 07:03:44,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1169802.0, ans=0.125 2023-06-22 07:04:09,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1169862.0, ans=0.125 2023-06-22 07:04:11,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1169862.0, ans=0.125 2023-06-22 07:04:21,195 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=15.0 2023-06-22 07:04:36,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1169922.0, ans=0.125 2023-06-22 07:04:48,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1169982.0, ans=0.125 2023-06-22 07:04:57,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1169982.0, ans=0.125 2023-06-22 07:05:07,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1170042.0, ans=0.0 2023-06-22 07:05:10,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1170042.0, ans=0.1 2023-06-22 07:05:21,223 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2023-06-22 07:05:23,288 INFO [train.py:996] (2/4) Epoch 7, batch 12050, loss[loss=0.2569, simple_loss=0.3612, pruned_loss=0.07628, over 19784.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3156, pruned_loss=0.08237, over 4269430.25 frames. ], batch size: 702, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:05:23,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1170102.0, ans=0.125 2023-06-22 07:05:33,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1170102.0, ans=0.125 2023-06-22 07:05:37,733 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.333e+02 3.086e+02 3.580e+02 4.845e+02 1.189e+03, threshold=7.160e+02, percent-clipped=3.0 2023-06-22 07:05:53,495 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=15.0 2023-06-22 07:05:57,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1170162.0, ans=0.125 2023-06-22 07:06:01,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1170162.0, ans=0.0 2023-06-22 07:06:22,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1170222.0, ans=0.125 2023-06-22 07:07:09,345 INFO [train.py:996] (2/4) Epoch 7, batch 12100, loss[loss=0.3084, simple_loss=0.3752, pruned_loss=0.1208, over 21617.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3197, pruned_loss=0.08644, over 4276496.54 frames. ], batch size: 389, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:07:31,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1170462.0, ans=0.1 2023-06-22 07:08:57,245 INFO [train.py:996] (2/4) Epoch 7, batch 12150, loss[loss=0.2369, simple_loss=0.3175, pruned_loss=0.07814, over 21542.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3242, pruned_loss=0.08635, over 4272128.82 frames. ], batch size: 230, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:09:06,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1170702.0, ans=0.1 2023-06-22 07:09:07,096 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.430e+02 3.401e+02 4.092e+02 5.164e+02 8.690e+02, threshold=8.185e+02, percent-clipped=4.0 2023-06-22 07:10:04,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1170882.0, ans=0.1 2023-06-22 07:10:12,788 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.69 vs. limit=15.0 2023-06-22 07:10:35,913 INFO [train.py:996] (2/4) Epoch 7, batch 12200, loss[loss=0.2157, simple_loss=0.2824, pruned_loss=0.07456, over 21293.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3234, pruned_loss=0.08482, over 4266218.73 frames. ], batch size: 131, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:10:41,502 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.45 vs. limit=12.0 2023-06-22 07:11:25,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1171122.0, ans=0.0 2023-06-22 07:11:28,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1171122.0, ans=0.5 2023-06-22 07:12:13,474 INFO [train.py:996] (2/4) Epoch 7, batch 12250, loss[loss=0.195, simple_loss=0.2588, pruned_loss=0.06559, over 21158.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3146, pruned_loss=0.08061, over 4273846.34 frames. ], batch size: 143, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:12:18,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1171302.0, ans=0.2 2023-06-22 07:12:21,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1171302.0, ans=0.125 2023-06-22 07:12:24,061 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.960e+02 3.345e+02 4.478e+02 6.132e+02 1.246e+03, threshold=8.957e+02, percent-clipped=10.0 2023-06-22 07:12:50,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1171362.0, ans=0.0 2023-06-22 07:13:05,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1171422.0, ans=0.1 2023-06-22 07:13:29,553 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2023-06-22 07:13:52,881 INFO [train.py:996] (2/4) Epoch 7, batch 12300, loss[loss=0.2837, simple_loss=0.3728, pruned_loss=0.09733, over 21636.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3062, pruned_loss=0.07569, over 4267090.69 frames. ], batch size: 389, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:14:26,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1171662.0, ans=0.0 2023-06-22 07:14:39,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1171722.0, ans=0.0 2023-06-22 07:14:50,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1171782.0, ans=0.125 2023-06-22 07:14:52,837 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.32 vs. limit=22.5 2023-06-22 07:14:59,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1171782.0, ans=0.125 2023-06-22 07:15:26,372 INFO [train.py:996] (2/4) Epoch 7, batch 12350, loss[loss=0.249, simple_loss=0.3194, pruned_loss=0.08931, over 21404.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3107, pruned_loss=0.07709, over 4273980.95 frames. ], batch size: 211, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:15:37,085 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.789e+02 2.594e+02 3.277e+02 4.549e+02 8.356e+02, threshold=6.553e+02, percent-clipped=0.0 2023-06-22 07:15:48,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1171962.0, ans=0.125 2023-06-22 07:16:03,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1171962.0, ans=0.2 2023-06-22 07:16:04,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1171962.0, ans=0.0 2023-06-22 07:17:05,104 INFO [train.py:996] (2/4) Epoch 7, batch 12400, loss[loss=0.278, simple_loss=0.3368, pruned_loss=0.1096, over 21388.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3124, pruned_loss=0.08023, over 4271091.86 frames. ], batch size: 176, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:17:31,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1172262.0, ans=0.1 2023-06-22 07:18:27,605 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.69 vs. limit=15.0 2023-06-22 07:18:44,307 INFO [train.py:996] (2/4) Epoch 7, batch 12450, loss[loss=0.1988, simple_loss=0.2776, pruned_loss=0.06005, over 17064.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3162, pruned_loss=0.08378, over 4270551.61 frames. ], batch size: 60, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:19:01,078 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.212e+02 3.229e+02 3.781e+02 4.439e+02 8.175e+02, threshold=7.562e+02, percent-clipped=5.0 2023-06-22 07:19:14,288 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.13 vs. limit=12.0 2023-06-22 07:19:27,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1172622.0, ans=0.035 2023-06-22 07:19:59,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1172682.0, ans=0.125 2023-06-22 07:20:20,453 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.66 vs. limit=15.0 2023-06-22 07:20:32,512 INFO [train.py:996] (2/4) Epoch 7, batch 12500, loss[loss=0.2784, simple_loss=0.3713, pruned_loss=0.09277, over 21776.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3284, pruned_loss=0.08789, over 4277529.71 frames. ], batch size: 282, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:21:34,360 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:21:34,974 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.91 vs. limit=15.0 2023-06-22 07:21:52,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1173042.0, ans=0.07 2023-06-22 07:22:14,662 INFO [train.py:996] (2/4) Epoch 7, batch 12550, loss[loss=0.2646, simple_loss=0.3374, pruned_loss=0.09594, over 21843.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3312, pruned_loss=0.08914, over 4276253.24 frames. ], batch size: 118, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:22:32,591 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.611e+02 3.175e+02 3.622e+02 4.685e+02 7.876e+02, threshold=7.244e+02, percent-clipped=1.0 2023-06-22 07:23:06,178 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.11 vs. limit=15.0 2023-06-22 07:23:51,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1173342.0, ans=0.125 2023-06-22 07:24:00,200 INFO [train.py:996] (2/4) Epoch 7, batch 12600, loss[loss=0.181, simple_loss=0.2478, pruned_loss=0.05708, over 21860.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3298, pruned_loss=0.08669, over 4279862.24 frames. ], batch size: 98, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:24:06,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1173402.0, ans=0.125 2023-06-22 07:24:24,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1173462.0, ans=0.0 2023-06-22 07:24:30,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1173462.0, ans=0.0 2023-06-22 07:24:33,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1173462.0, ans=0.2 2023-06-22 07:25:38,764 INFO [train.py:996] (2/4) Epoch 7, batch 12650, loss[loss=0.2182, simple_loss=0.2863, pruned_loss=0.07511, over 21774.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3223, pruned_loss=0.0833, over 4282036.38 frames. ], batch size: 247, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:25:43,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1173702.0, ans=0.0 2023-06-22 07:25:51,197 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.376e+02 3.154e+02 3.639e+02 4.446e+02 1.064e+03, threshold=7.278e+02, percent-clipped=5.0 2023-06-22 07:25:51,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1173702.0, ans=0.1 2023-06-22 07:25:59,052 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-22 07:26:21,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1173822.0, ans=0.125 2023-06-22 07:27:01,472 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=15.0 2023-06-22 07:27:02,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1173942.0, ans=0.125 2023-06-22 07:27:19,804 INFO [train.py:996] (2/4) Epoch 7, batch 12700, loss[loss=0.2379, simple_loss=0.3145, pruned_loss=0.0807, over 21002.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3214, pruned_loss=0.08568, over 4284378.64 frames. ], batch size: 608, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:28:01,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1174122.0, ans=0.125 2023-06-22 07:28:10,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1174122.0, ans=0.125 2023-06-22 07:28:22,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1174182.0, ans=0.0 2023-06-22 07:28:32,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1174182.0, ans=0.0 2023-06-22 07:29:00,187 INFO [train.py:996] (2/4) Epoch 7, batch 12750, loss[loss=0.2651, simple_loss=0.3331, pruned_loss=0.09858, over 21843.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3231, pruned_loss=0.08645, over 4284016.91 frames. ], batch size: 118, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:29:17,965 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.566e+02 3.268e+02 3.639e+02 4.556e+02 7.416e+02, threshold=7.278e+02, percent-clipped=1.0 2023-06-22 07:29:26,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1174362.0, ans=0.1 2023-06-22 07:29:50,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1174422.0, ans=0.0 2023-06-22 07:30:17,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1174482.0, ans=0.0 2023-06-22 07:30:27,588 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-06-22 07:30:30,899 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.96 vs. limit=15.0 2023-06-22 07:30:40,332 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-22 07:30:44,405 INFO [train.py:996] (2/4) Epoch 7, batch 12800, loss[loss=0.2255, simple_loss=0.306, pruned_loss=0.07247, over 20092.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3226, pruned_loss=0.08703, over 4285312.52 frames. ], batch size: 704, lr: 4.34e-03, grad_scale: 32.0 2023-06-22 07:30:59,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1174662.0, ans=0.125 2023-06-22 07:31:26,167 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.45 vs. limit=10.0 2023-06-22 07:31:56,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1174782.0, ans=0.1 2023-06-22 07:32:25,154 INFO [train.py:996] (2/4) Epoch 7, batch 12850, loss[loss=0.2579, simple_loss=0.3322, pruned_loss=0.09182, over 21349.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3248, pruned_loss=0.08793, over 4285309.12 frames. ], batch size: 548, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:32:39,817 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.419e+02 3.112e+02 3.554e+02 4.407e+02 7.373e+02, threshold=7.108e+02, percent-clipped=1.0 2023-06-22 07:32:53,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1174962.0, ans=0.125 2023-06-22 07:33:28,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1175082.0, ans=0.125 2023-06-22 07:33:47,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1175082.0, ans=0.0 2023-06-22 07:33:58,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1175142.0, ans=0.125 2023-06-22 07:34:06,255 INFO [train.py:996] (2/4) Epoch 7, batch 12900, loss[loss=0.2506, simple_loss=0.3422, pruned_loss=0.07953, over 21644.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3205, pruned_loss=0.08334, over 4279577.90 frames. ], batch size: 414, lr: 4.34e-03, grad_scale: 16.0 2023-06-22 07:34:22,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1175202.0, ans=0.125 2023-06-22 07:35:15,082 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-22 07:35:44,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1175442.0, ans=0.125 2023-06-22 07:35:53,253 INFO [train.py:996] (2/4) Epoch 7, batch 12950, loss[loss=0.1958, simple_loss=0.2724, pruned_loss=0.05954, over 21385.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3206, pruned_loss=0.08252, over 4277834.19 frames. ], batch size: 194, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 07:35:56,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1175502.0, ans=0.125 2023-06-22 07:35:58,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1175502.0, ans=0.125 2023-06-22 07:36:12,622 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.305e+02 2.893e+02 3.599e+02 4.715e+02 8.391e+02, threshold=7.198e+02, percent-clipped=5.0 2023-06-22 07:36:27,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1175562.0, ans=0.05 2023-06-22 07:37:00,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1175682.0, ans=0.125 2023-06-22 07:37:33,412 INFO [train.py:996] (2/4) Epoch 7, batch 13000, loss[loss=0.1567, simple_loss=0.2173, pruned_loss=0.04803, over 21814.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3198, pruned_loss=0.08286, over 4280690.53 frames. ], batch size: 98, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 07:37:33,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1175802.0, ans=0.1 2023-06-22 07:37:35,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1175802.0, ans=0.125 2023-06-22 07:37:57,472 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=12.0 2023-06-22 07:38:17,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1175922.0, ans=0.125 2023-06-22 07:39:07,118 INFO [train.py:996] (2/4) Epoch 7, batch 13050, loss[loss=0.2271, simple_loss=0.2958, pruned_loss=0.07923, over 21850.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3168, pruned_loss=0.08133, over 4283624.44 frames. ], batch size: 298, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 07:39:30,662 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.865e+02 2.860e+02 3.531e+02 4.680e+02 1.133e+03, threshold=7.061e+02, percent-clipped=2.0 2023-06-22 07:39:32,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1176162.0, ans=0.0 2023-06-22 07:39:50,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1176162.0, ans=0.0 2023-06-22 07:39:56,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1176222.0, ans=0.125 2023-06-22 07:40:21,843 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.71 vs. limit=22.5 2023-06-22 07:40:30,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1176342.0, ans=0.0 2023-06-22 07:40:56,523 INFO [train.py:996] (2/4) Epoch 7, batch 13100, loss[loss=0.3246, simple_loss=0.4522, pruned_loss=0.0985, over 19755.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3193, pruned_loss=0.08185, over 4281106.09 frames. ], batch size: 702, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 07:41:21,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1176462.0, ans=0.125 2023-06-22 07:41:37,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1176522.0, ans=0.0 2023-06-22 07:42:35,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1176642.0, ans=0.2 2023-06-22 07:42:37,551 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=12.0 2023-06-22 07:42:42,751 INFO [train.py:996] (2/4) Epoch 7, batch 13150, loss[loss=0.2197, simple_loss=0.279, pruned_loss=0.08014, over 21290.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3224, pruned_loss=0.0852, over 4278826.40 frames. ], batch size: 159, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 07:42:44,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1176702.0, ans=0.2 2023-06-22 07:42:50,014 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=22.5 2023-06-22 07:43:01,879 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.295e+02 3.619e+02 4.512e+02 5.792e+02 9.632e+02, threshold=9.025e+02, percent-clipped=11.0 2023-06-22 07:43:09,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1176762.0, ans=0.025 2023-06-22 07:43:19,297 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=12.0 2023-06-22 07:43:19,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1176822.0, ans=22.5 2023-06-22 07:43:20,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1176822.0, ans=0.04949747468305833 2023-06-22 07:43:20,534 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:43:22,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1176822.0, ans=0.125 2023-06-22 07:44:01,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1176942.0, ans=0.0 2023-06-22 07:44:24,033 INFO [train.py:996] (2/4) Epoch 7, batch 13200, loss[loss=0.2543, simple_loss=0.3233, pruned_loss=0.09264, over 21618.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3226, pruned_loss=0.08609, over 4276792.68 frames. ], batch size: 230, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:44:45,659 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.737e-03 2023-06-22 07:44:56,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=1177062.0, ans=0.05 2023-06-22 07:44:59,003 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=22.5 2023-06-22 07:46:01,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1177242.0, ans=0.125 2023-06-22 07:46:03,669 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.92 vs. limit=10.0 2023-06-22 07:46:09,404 INFO [train.py:996] (2/4) Epoch 7, batch 13250, loss[loss=0.2175, simple_loss=0.3016, pruned_loss=0.06666, over 21638.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3215, pruned_loss=0.08752, over 4273103.59 frames. ], batch size: 230, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:46:09,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1177302.0, ans=0.125 2023-06-22 07:46:13,372 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2023-06-22 07:46:24,179 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.563e+02 3.281e+02 4.048e+02 5.234e+02 8.486e+02, threshold=8.096e+02, percent-clipped=0.0 2023-06-22 07:47:28,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1177542.0, ans=0.04949747468305833 2023-06-22 07:47:50,901 INFO [train.py:996] (2/4) Epoch 7, batch 13300, loss[loss=0.2566, simple_loss=0.3343, pruned_loss=0.08946, over 21776.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3239, pruned_loss=0.0859, over 4272681.32 frames. ], batch size: 124, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:47:56,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1177602.0, ans=0.0 2023-06-22 07:48:00,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1177602.0, ans=0.125 2023-06-22 07:48:07,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1177662.0, ans=0.125 2023-06-22 07:48:15,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1177662.0, ans=0.125 2023-06-22 07:48:21,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1177662.0, ans=0.1 2023-06-22 07:49:09,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1177782.0, ans=0.125 2023-06-22 07:49:28,816 INFO [train.py:996] (2/4) Epoch 7, batch 13350, loss[loss=0.2618, simple_loss=0.344, pruned_loss=0.08981, over 21799.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3276, pruned_loss=0.08846, over 4277036.37 frames. ], batch size: 282, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:49:43,390 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.412e+02 3.139e+02 3.531e+02 4.158e+02 7.079e+02, threshold=7.062e+02, percent-clipped=0.0 2023-06-22 07:50:24,201 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=6.0 2023-06-22 07:50:53,131 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.18 vs. limit=15.0 2023-06-22 07:50:53,147 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-06-22 07:51:08,298 INFO [train.py:996] (2/4) Epoch 7, batch 13400, loss[loss=0.2415, simple_loss=0.301, pruned_loss=0.09098, over 21417.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3293, pruned_loss=0.09027, over 4280733.32 frames. ], batch size: 211, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:51:11,012 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=22.5 2023-06-22 07:51:45,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1178262.0, ans=0.125 2023-06-22 07:52:33,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1178442.0, ans=0.125 2023-06-22 07:52:36,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1178442.0, ans=0.1 2023-06-22 07:52:48,419 INFO [train.py:996] (2/4) Epoch 7, batch 13450, loss[loss=0.2235, simple_loss=0.3009, pruned_loss=0.07306, over 20693.00 frames. ], tot_loss[loss=0.2568, simple_loss=0.3292, pruned_loss=0.09225, over 4274460.26 frames. ], batch size: 607, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:52:50,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1178502.0, ans=0.125 2023-06-22 07:53:12,721 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.492e+02 3.365e+02 3.946e+02 4.575e+02 8.284e+02, threshold=7.892e+02, percent-clipped=1.0 2023-06-22 07:53:30,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1178562.0, ans=0.125 2023-06-22 07:54:25,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1178742.0, ans=0.0 2023-06-22 07:54:28,479 INFO [train.py:996] (2/4) Epoch 7, batch 13500, loss[loss=0.2116, simple_loss=0.2989, pruned_loss=0.06215, over 21221.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3206, pruned_loss=0.08956, over 4273017.72 frames. ], batch size: 548, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:54:48,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1178802.0, ans=0.0 2023-06-22 07:55:04,252 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:55:40,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1178982.0, ans=0.0 2023-06-22 07:56:09,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1179042.0, ans=0.2 2023-06-22 07:56:15,636 INFO [train.py:996] (2/4) Epoch 7, batch 13550, loss[loss=0.2804, simple_loss=0.3589, pruned_loss=0.101, over 21268.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3255, pruned_loss=0.08888, over 4274525.14 frames. ], batch size: 176, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 07:56:31,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1179102.0, ans=0.125 2023-06-22 07:56:36,012 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.476e+02 3.442e+02 4.149e+02 5.236e+02 8.278e+02, threshold=8.298e+02, percent-clipped=4.0 2023-06-22 07:56:38,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1179162.0, ans=0.125 2023-06-22 07:56:45,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1179162.0, ans=0.5 2023-06-22 07:57:42,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1179342.0, ans=0.125 2023-06-22 07:57:54,368 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-22 07:57:54,921 INFO [train.py:996] (2/4) Epoch 7, batch 13600, loss[loss=0.2318, simple_loss=0.3087, pruned_loss=0.07743, over 21581.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3262, pruned_loss=0.08889, over 4274330.36 frames. ], batch size: 548, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:58:07,873 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 07:58:18,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1179462.0, ans=0.0 2023-06-22 07:58:51,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1179582.0, ans=0.125 2023-06-22 07:59:34,112 INFO [train.py:996] (2/4) Epoch 7, batch 13650, loss[loss=0.2117, simple_loss=0.2754, pruned_loss=0.07399, over 21876.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3223, pruned_loss=0.08593, over 4266838.63 frames. ], batch size: 107, lr: 4.33e-03, grad_scale: 32.0 2023-06-22 07:59:50,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1179702.0, ans=0.04949747468305833 2023-06-22 07:59:54,406 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.889e+02 2.926e+02 3.440e+02 4.459e+02 9.365e+02, threshold=6.879e+02, percent-clipped=1.0 2023-06-22 08:00:14,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1179822.0, ans=0.1 2023-06-22 08:00:16,386 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=22.5 2023-06-22 08:00:35,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1179882.0, ans=0.125 2023-06-22 08:00:37,989 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.15 vs. limit=22.5 2023-06-22 08:00:51,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1179942.0, ans=0.125 2023-06-22 08:01:13,427 INFO [train.py:996] (2/4) Epoch 7, batch 13700, loss[loss=0.2313, simple_loss=0.2994, pruned_loss=0.08162, over 21631.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3152, pruned_loss=0.0852, over 4258329.45 frames. ], batch size: 263, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 08:01:51,941 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.21 vs. limit=15.0 2023-06-22 08:02:23,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1180182.0, ans=0.2 2023-06-22 08:02:59,600 INFO [train.py:996] (2/4) Epoch 7, batch 13750, loss[loss=0.1964, simple_loss=0.2442, pruned_loss=0.0743, over 21790.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3114, pruned_loss=0.084, over 4264356.33 frames. ], batch size: 102, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 08:03:06,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1180302.0, ans=0.0 2023-06-22 08:03:20,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1180362.0, ans=0.0 2023-06-22 08:03:23,019 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.446e+02 3.304e+02 4.106e+02 4.985e+02 1.123e+03, threshold=8.212e+02, percent-clipped=9.0 2023-06-22 08:03:28,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1180362.0, ans=0.0 2023-06-22 08:04:08,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1180482.0, ans=0.1 2023-06-22 08:04:32,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1180542.0, ans=0.2 2023-06-22 08:04:48,519 INFO [train.py:996] (2/4) Epoch 7, batch 13800, loss[loss=0.2294, simple_loss=0.3062, pruned_loss=0.07628, over 21134.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3203, pruned_loss=0.0833, over 4259708.66 frames. ], batch size: 607, lr: 4.33e-03, grad_scale: 16.0 2023-06-22 08:05:25,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1180722.0, ans=0.125 2023-06-22 08:05:52,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1180782.0, ans=0.1 2023-06-22 08:06:02,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1180782.0, ans=0.2 2023-06-22 08:06:29,555 INFO [train.py:996] (2/4) Epoch 7, batch 13850, loss[loss=0.3176, simple_loss=0.3874, pruned_loss=0.1239, over 21722.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3246, pruned_loss=0.08403, over 4262201.95 frames. ], batch size: 441, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:06:51,702 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.107e+02 3.586e+02 4.613e+02 6.020e+02 1.189e+03, threshold=9.227e+02, percent-clipped=5.0 2023-06-22 08:06:53,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1180962.0, ans=0.125 2023-06-22 08:06:57,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1180962.0, ans=0.125 2023-06-22 08:07:16,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1181022.0, ans=0.5 2023-06-22 08:07:38,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1181082.0, ans=0.0 2023-06-22 08:07:55,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1181142.0, ans=0.0 2023-06-22 08:07:58,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1181142.0, ans=0.1 2023-06-22 08:08:08,723 INFO [train.py:996] (2/4) Epoch 7, batch 13900, loss[loss=0.2281, simple_loss=0.2973, pruned_loss=0.07941, over 21806.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3283, pruned_loss=0.08726, over 4261567.47 frames. ], batch size: 298, lr: 4.32e-03, grad_scale: 8.0 2023-06-22 08:08:46,669 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.64 vs. limit=22.5 2023-06-22 08:09:09,916 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=22.5 2023-06-22 08:09:49,796 INFO [train.py:996] (2/4) Epoch 7, batch 13950, loss[loss=0.2207, simple_loss=0.2961, pruned_loss=0.07269, over 21688.00 frames. ], tot_loss[loss=0.2539, simple_loss=0.3295, pruned_loss=0.08918, over 4272860.59 frames. ], batch size: 263, lr: 4.32e-03, grad_scale: 8.0 2023-06-22 08:10:18,726 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.662e+02 3.399e+02 3.923e+02 4.848e+02 6.986e+02, threshold=7.845e+02, percent-clipped=0.0 2023-06-22 08:10:36,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1181622.0, ans=0.2 2023-06-22 08:10:39,822 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=15.0 2023-06-22 08:10:44,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1181622.0, ans=0.07 2023-06-22 08:10:48,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1181622.0, ans=0.2 2023-06-22 08:10:55,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1181682.0, ans=0.0 2023-06-22 08:11:16,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1181742.0, ans=0.0 2023-06-22 08:11:28,463 INFO [train.py:996] (2/4) Epoch 7, batch 14000, loss[loss=0.2183, simple_loss=0.3185, pruned_loss=0.05903, over 21597.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3261, pruned_loss=0.08705, over 4258083.08 frames. ], batch size: 230, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:11:44,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1181802.0, ans=0.2 2023-06-22 08:11:57,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1181862.0, ans=0.1 2023-06-22 08:12:23,242 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.75 vs. limit=15.0 2023-06-22 08:13:10,908 INFO [train.py:996] (2/4) Epoch 7, batch 14050, loss[loss=0.2159, simple_loss=0.2712, pruned_loss=0.08031, over 21480.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3204, pruned_loss=0.0834, over 4264296.39 frames. ], batch size: 195, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:13:34,944 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.158e+02 3.004e+02 3.495e+02 4.384e+02 1.047e+03, threshold=6.990e+02, percent-clipped=3.0 2023-06-22 08:13:53,083 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-22 08:14:20,222 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.23 vs. limit=10.0 2023-06-22 08:14:39,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1182342.0, ans=0.0 2023-06-22 08:14:49,760 INFO [train.py:996] (2/4) Epoch 7, batch 14100, loss[loss=0.287, simple_loss=0.3397, pruned_loss=0.1172, over 21466.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.314, pruned_loss=0.08333, over 4264708.39 frames. ], batch size: 194, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:14:58,998 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=22.5 2023-06-22 08:15:26,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1182462.0, ans=0.0 2023-06-22 08:15:33,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.83 vs. limit=12.0 2023-06-22 08:15:44,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1182522.0, ans=0.1 2023-06-22 08:16:21,794 INFO [train.py:996] (2/4) Epoch 7, batch 14150, loss[loss=0.239, simple_loss=0.3234, pruned_loss=0.07735, over 21771.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3166, pruned_loss=0.08445, over 4270829.23 frames. ], batch size: 118, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:16:26,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1182702.0, ans=0.5 2023-06-22 08:16:35,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1182762.0, ans=0.125 2023-06-22 08:16:44,671 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.262e+02 2.878e+02 3.254e+02 3.924e+02 9.436e+02, threshold=6.508e+02, percent-clipped=4.0 2023-06-22 08:16:49,955 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-22 08:17:37,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1182942.0, ans=0.1 2023-06-22 08:17:39,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1182942.0, ans=0.1 2023-06-22 08:17:57,549 INFO [train.py:996] (2/4) Epoch 7, batch 14200, loss[loss=0.2308, simple_loss=0.2951, pruned_loss=0.08326, over 21654.00 frames. ], tot_loss[loss=0.241, simple_loss=0.316, pruned_loss=0.08304, over 4259606.47 frames. ], batch size: 332, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:18:27,049 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.21 vs. limit=12.0 2023-06-22 08:18:58,784 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=15.0 2023-06-22 08:19:01,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1183182.0, ans=0.0 2023-06-22 08:19:07,872 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:19:32,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1183242.0, ans=0.125 2023-06-22 08:19:36,400 INFO [train.py:996] (2/4) Epoch 7, batch 14250, loss[loss=0.2251, simple_loss=0.3027, pruned_loss=0.07377, over 21505.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3113, pruned_loss=0.08357, over 4255904.65 frames. ], batch size: 509, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:19:51,998 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=15.0 2023-06-22 08:19:55,865 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.296e+02 2.870e+02 3.314e+02 3.996e+02 6.865e+02, threshold=6.627e+02, percent-clipped=2.0 2023-06-22 08:21:12,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1183542.0, ans=0.125 2023-06-22 08:21:16,215 INFO [train.py:996] (2/4) Epoch 7, batch 14300, loss[loss=0.2058, simple_loss=0.2903, pruned_loss=0.06067, over 21165.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3117, pruned_loss=0.08276, over 4246130.76 frames. ], batch size: 548, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:21:27,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1183602.0, ans=0.0 2023-06-22 08:21:43,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1183662.0, ans=0.125 2023-06-22 08:22:56,608 INFO [train.py:996] (2/4) Epoch 7, batch 14350, loss[loss=0.2483, simple_loss=0.3274, pruned_loss=0.08457, over 21769.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3138, pruned_loss=0.08192, over 4229348.37 frames. ], batch size: 441, lr: 4.32e-03, grad_scale: 16.0 2023-06-22 08:22:56,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1183902.0, ans=0.125 2023-06-22 08:23:00,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1183902.0, ans=0.125 2023-06-22 08:23:15,122 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.444e+02 3.408e+02 4.555e+02 6.047e+02 1.523e+03, threshold=9.110e+02, percent-clipped=21.0 2023-06-22 08:23:24,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1183962.0, ans=0.0 2023-06-22 08:24:00,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1184082.0, ans=0.125 2023-06-22 08:24:34,923 INFO [train.py:996] (2/4) Epoch 7, batch 14400, loss[loss=0.262, simple_loss=0.3143, pruned_loss=0.1048, over 21998.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3134, pruned_loss=0.08343, over 4243753.23 frames. ], batch size: 103, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:24:38,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1184202.0, ans=0.0 2023-06-22 08:25:04,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1184262.0, ans=0.2 2023-06-22 08:26:11,481 INFO [train.py:996] (2/4) Epoch 7, batch 14450, loss[loss=0.216, simple_loss=0.2802, pruned_loss=0.07588, over 21829.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3094, pruned_loss=0.08417, over 4256995.10 frames. ], batch size: 283, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:26:21,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1184502.0, ans=0.2 2023-06-22 08:26:30,343 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.298e+02 2.987e+02 3.327e+02 4.057e+02 7.605e+02, threshold=6.653e+02, percent-clipped=0.0 2023-06-22 08:27:52,115 INFO [train.py:996] (2/4) Epoch 7, batch 14500, loss[loss=0.2418, simple_loss=0.3209, pruned_loss=0.08131, over 21553.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3071, pruned_loss=0.0839, over 4266237.51 frames. ], batch size: 441, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:28:01,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1184802.0, ans=0.1 2023-06-22 08:28:01,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1184802.0, ans=0.125 2023-06-22 08:29:28,322 INFO [train.py:996] (2/4) Epoch 7, batch 14550, loss[loss=0.2792, simple_loss=0.3464, pruned_loss=0.106, over 21388.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3131, pruned_loss=0.08548, over 4269669.88 frames. ], batch size: 176, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:29:36,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1185102.0, ans=0.125 2023-06-22 08:29:57,825 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.305e+02 3.217e+02 4.103e+02 5.336e+02 9.308e+02, threshold=8.206e+02, percent-clipped=6.0 2023-06-22 08:30:14,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1185222.0, ans=0.0 2023-06-22 08:30:22,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1185222.0, ans=0.05 2023-06-22 08:30:26,625 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.38 vs. limit=10.0 2023-06-22 08:30:34,098 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.71 vs. limit=6.0 2023-06-22 08:30:36,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1185282.0, ans=0.0 2023-06-22 08:31:09,729 INFO [train.py:996] (2/4) Epoch 7, batch 14600, loss[loss=0.2338, simple_loss=0.3299, pruned_loss=0.06889, over 19706.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3215, pruned_loss=0.09048, over 4271888.68 frames. ], batch size: 702, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:31:58,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1185522.0, ans=0.0 2023-06-22 08:32:07,780 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=19.87 vs. limit=15.0 2023-06-22 08:32:23,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1185582.0, ans=0.125 2023-06-22 08:32:31,697 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-06-22 08:32:48,042 INFO [train.py:996] (2/4) Epoch 7, batch 14650, loss[loss=0.3144, simple_loss=0.4, pruned_loss=0.1144, over 21209.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3235, pruned_loss=0.08948, over 4262825.43 frames. ], batch size: 548, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:33:22,487 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.190e+02 2.921e+02 3.378e+02 4.532e+02 7.463e+02, threshold=6.756e+02, percent-clipped=1.0 2023-06-22 08:33:32,837 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:33:37,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1185822.0, ans=0.1 2023-06-22 08:34:20,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1185942.0, ans=0.025 2023-06-22 08:34:28,131 INFO [train.py:996] (2/4) Epoch 7, batch 14700, loss[loss=0.2314, simple_loss=0.2908, pruned_loss=0.086, over 16287.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3164, pruned_loss=0.08325, over 4252059.79 frames. ], batch size: 61, lr: 4.32e-03, grad_scale: 32.0 2023-06-22 08:34:30,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1186002.0, ans=0.125 2023-06-22 08:34:45,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1186002.0, ans=0.1 2023-06-22 08:34:46,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1186002.0, ans=0.125 2023-06-22 08:35:12,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1186062.0, ans=0.125 2023-06-22 08:35:44,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1186182.0, ans=0.0 2023-06-22 08:36:19,354 INFO [train.py:996] (2/4) Epoch 7, batch 14750, loss[loss=0.229, simple_loss=0.3196, pruned_loss=0.06925, over 20974.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.322, pruned_loss=0.0854, over 4253552.05 frames. ], batch size: 608, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:36:36,577 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0 2023-06-22 08:36:45,469 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 3.126e+02 3.786e+02 4.508e+02 7.747e+02, threshold=7.572e+02, percent-clipped=1.0 2023-06-22 08:37:05,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1186422.0, ans=0.0 2023-06-22 08:37:16,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1186482.0, ans=0.125 2023-06-22 08:38:03,832 INFO [train.py:996] (2/4) Epoch 7, batch 14800, loss[loss=0.2168, simple_loss=0.2825, pruned_loss=0.07556, over 21831.00 frames. ], tot_loss[loss=0.2559, simple_loss=0.3327, pruned_loss=0.08955, over 4259312.74 frames. ], batch size: 118, lr: 4.31e-03, grad_scale: 32.0 2023-06-22 08:38:11,211 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.81 vs. limit=6.0 2023-06-22 08:38:50,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1186722.0, ans=0.1 2023-06-22 08:39:02,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1186782.0, ans=0.2 2023-06-22 08:39:25,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1186842.0, ans=0.0 2023-06-22 08:39:45,579 INFO [train.py:996] (2/4) Epoch 7, batch 14850, loss[loss=0.2135, simple_loss=0.2989, pruned_loss=0.0641, over 20072.00 frames. ], tot_loss[loss=0.2525, simple_loss=0.3263, pruned_loss=0.08928, over 4259092.29 frames. ], batch size: 704, lr: 4.31e-03, grad_scale: 32.0 2023-06-22 08:39:54,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1186902.0, ans=0.1 2023-06-22 08:40:04,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1186902.0, ans=0.1 2023-06-22 08:40:06,553 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.91 vs. limit=22.5 2023-06-22 08:40:12,136 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.490e+02 3.436e+02 3.807e+02 4.957e+02 1.167e+03, threshold=7.615e+02, percent-clipped=4.0 2023-06-22 08:40:19,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1186962.0, ans=0.125 2023-06-22 08:40:24,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1187022.0, ans=0.125 2023-06-22 08:41:04,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1187082.0, ans=0.2 2023-06-22 08:41:32,004 INFO [train.py:996] (2/4) Epoch 7, batch 14900, loss[loss=0.2808, simple_loss=0.3425, pruned_loss=0.1096, over 21567.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3296, pruned_loss=0.09192, over 4262697.00 frames. ], batch size: 230, lr: 4.31e-03, grad_scale: 32.0 2023-06-22 08:42:10,423 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.90 vs. limit=15.0 2023-06-22 08:42:14,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1187322.0, ans=0.125 2023-06-22 08:42:31,307 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.37 vs. limit=15.0 2023-06-22 08:42:39,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1187382.0, ans=0.0 2023-06-22 08:42:45,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1187382.0, ans=0.125 2023-06-22 08:42:45,984 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.40 vs. limit=15.0 2023-06-22 08:43:05,611 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.93 vs. limit=12.0 2023-06-22 08:43:12,800 INFO [train.py:996] (2/4) Epoch 7, batch 14950, loss[loss=0.2747, simple_loss=0.3498, pruned_loss=0.09977, over 21419.00 frames. ], tot_loss[loss=0.2553, simple_loss=0.3292, pruned_loss=0.09073, over 4260321.66 frames. ], batch size: 131, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:43:13,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1187502.0, ans=0.0 2023-06-22 08:43:39,822 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 3.264e+02 3.667e+02 4.078e+02 7.613e+02, threshold=7.333e+02, percent-clipped=0.0 2023-06-22 08:44:52,918 INFO [train.py:996] (2/4) Epoch 7, batch 15000, loss[loss=0.2275, simple_loss=0.3042, pruned_loss=0.07545, over 19988.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3313, pruned_loss=0.09187, over 4259719.56 frames. ], batch size: 702, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:44:52,919 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-22 08:45:09,861 INFO [train.py:1028] (2/4) Epoch 7, validation: loss=0.2588, simple_loss=0.3554, pruned_loss=0.08105, over 1796401.00 frames. 2023-06-22 08:45:09,862 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-22 08:45:47,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1187862.0, ans=0.125 2023-06-22 08:45:47,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1187862.0, ans=0.125 2023-06-22 08:45:49,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1187862.0, ans=0.0 2023-06-22 08:45:49,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1187862.0, ans=0.2 2023-06-22 08:46:02,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1187922.0, ans=0.0 2023-06-22 08:46:10,903 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.18 vs. limit=15.0 2023-06-22 08:46:20,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1187982.0, ans=0.2 2023-06-22 08:46:23,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1187982.0, ans=0.125 2023-06-22 08:46:56,310 INFO [train.py:996] (2/4) Epoch 7, batch 15050, loss[loss=0.3108, simple_loss=0.4014, pruned_loss=0.1101, over 21578.00 frames. ], tot_loss[loss=0.2603, simple_loss=0.3333, pruned_loss=0.09368, over 4256706.59 frames. ], batch size: 471, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:46:58,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1188102.0, ans=0.0 2023-06-22 08:47:27,948 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.680e+02 3.352e+02 4.069e+02 4.839e+02 9.529e+02, threshold=8.138e+02, percent-clipped=2.0 2023-06-22 08:47:45,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1188222.0, ans=0.125 2023-06-22 08:47:57,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1188282.0, ans=0.125 2023-06-22 08:48:16,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1188342.0, ans=0.125 2023-06-22 08:48:27,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1188342.0, ans=0.125 2023-06-22 08:48:39,529 INFO [train.py:996] (2/4) Epoch 7, batch 15100, loss[loss=0.3233, simple_loss=0.3867, pruned_loss=0.13, over 21438.00 frames. ], tot_loss[loss=0.2611, simple_loss=0.3352, pruned_loss=0.09349, over 4262888.46 frames. ], batch size: 471, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:48:54,934 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-22 08:49:02,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1188462.0, ans=0.0 2023-06-22 08:49:35,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1188522.0, ans=0.04949747468305833 2023-06-22 08:50:00,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1188642.0, ans=0.125 2023-06-22 08:50:10,664 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.46 vs. limit=15.0 2023-06-22 08:50:18,424 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.32 vs. limit=22.5 2023-06-22 08:50:19,327 INFO [train.py:996] (2/4) Epoch 7, batch 15150, loss[loss=0.2267, simple_loss=0.2876, pruned_loss=0.08295, over 21427.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3301, pruned_loss=0.09322, over 4271246.19 frames. ], batch size: 389, lr: 4.31e-03, grad_scale: 8.0 2023-06-22 08:50:38,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1188702.0, ans=0.0 2023-06-22 08:50:41,748 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 08:50:49,047 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.627e+02 3.254e+02 3.801e+02 4.686e+02 8.027e+02, threshold=7.602e+02, percent-clipped=0.0 2023-06-22 08:52:04,633 INFO [train.py:996] (2/4) Epoch 7, batch 15200, loss[loss=0.232, simple_loss=0.3168, pruned_loss=0.07356, over 21853.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.322, pruned_loss=0.08949, over 4266716.59 frames. ], batch size: 372, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:52:05,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1189002.0, ans=0.125 2023-06-22 08:52:11,868 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.76 vs. limit=15.0 2023-06-22 08:52:45,535 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.27 vs. limit=22.5 2023-06-22 08:53:06,921 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=12.0 2023-06-22 08:53:43,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1189302.0, ans=0.1 2023-06-22 08:53:44,163 INFO [train.py:996] (2/4) Epoch 7, batch 15250, loss[loss=0.2641, simple_loss=0.3095, pruned_loss=0.1094, over 21317.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3157, pruned_loss=0.08784, over 4262467.12 frames. ], batch size: 473, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:54:13,563 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.144e+02 3.041e+02 3.715e+02 4.659e+02 9.808e+02, threshold=7.430e+02, percent-clipped=2.0 2023-06-22 08:54:13,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=1189362.0, ans=0.02 2023-06-22 08:54:26,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1189422.0, ans=0.0 2023-06-22 08:54:33,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1189422.0, ans=0.125 2023-06-22 08:55:09,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1189542.0, ans=10.0 2023-06-22 08:55:09,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1189542.0, ans=0.04949747468305833 2023-06-22 08:55:22,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1189542.0, ans=0.125 2023-06-22 08:55:25,347 INFO [train.py:996] (2/4) Epoch 7, batch 15300, loss[loss=0.3046, simple_loss=0.3651, pruned_loss=0.1221, over 21442.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3189, pruned_loss=0.09059, over 4255899.28 frames. ], batch size: 471, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:55:51,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1189662.0, ans=0.125 2023-06-22 08:55:53,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1189662.0, ans=0.125 2023-06-22 08:56:04,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1189722.0, ans=0.125 2023-06-22 08:57:04,784 INFO [train.py:996] (2/4) Epoch 7, batch 15350, loss[loss=0.2687, simple_loss=0.3447, pruned_loss=0.09636, over 21488.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.324, pruned_loss=0.09289, over 4262623.35 frames. ], batch size: 211, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:57:33,806 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.624e+02 3.368e+02 3.940e+02 5.271e+02 1.051e+03, threshold=7.879e+02, percent-clipped=5.0 2023-06-22 08:58:02,667 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.51 vs. limit=12.0 2023-06-22 08:58:03,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1190082.0, ans=0.0 2023-06-22 08:58:43,311 INFO [train.py:996] (2/4) Epoch 7, batch 15400, loss[loss=0.2397, simple_loss=0.3184, pruned_loss=0.08047, over 21843.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3245, pruned_loss=0.09148, over 4275339.62 frames. ], batch size: 124, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 08:58:57,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1190262.0, ans=0.125 2023-06-22 08:59:11,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1190262.0, ans=0.0 2023-06-22 08:59:16,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1190262.0, ans=0.0 2023-06-22 08:59:20,248 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.74 vs. limit=15.0 2023-06-22 08:59:24,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1190322.0, ans=0.125 2023-06-22 08:59:29,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1190322.0, ans=0.125 2023-06-22 09:00:16,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=1190442.0, ans=0.2 2023-06-22 09:00:22,616 INFO [train.py:996] (2/4) Epoch 7, batch 15450, loss[loss=0.2559, simple_loss=0.34, pruned_loss=0.08588, over 21758.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3225, pruned_loss=0.09058, over 4274532.94 frames. ], batch size: 414, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 09:00:30,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1190502.0, ans=0.5 2023-06-22 09:00:46,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1190562.0, ans=0.125 2023-06-22 09:00:51,247 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.325e+02 2.924e+02 3.383e+02 4.121e+02 7.553e+02, threshold=6.767e+02, percent-clipped=0.0 2023-06-22 09:01:11,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1190622.0, ans=0.0 2023-06-22 09:01:49,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1190742.0, ans=0.125 2023-06-22 09:01:57,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1190742.0, ans=0.125 2023-06-22 09:02:02,959 INFO [train.py:996] (2/4) Epoch 7, batch 15500, loss[loss=0.2857, simple_loss=0.3556, pruned_loss=0.1079, over 21588.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3248, pruned_loss=0.08948, over 4252614.67 frames. ], batch size: 263, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 09:02:23,576 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.03 vs. limit=6.0 2023-06-22 09:02:37,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1190862.0, ans=0.125 2023-06-22 09:02:51,706 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.33 vs. limit=15.0 2023-06-22 09:02:57,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1190922.0, ans=0.1 2023-06-22 09:03:37,238 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 09:03:48,150 INFO [train.py:996] (2/4) Epoch 7, batch 15550, loss[loss=0.2093, simple_loss=0.2901, pruned_loss=0.06428, over 21712.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3234, pruned_loss=0.08633, over 4253662.77 frames. ], batch size: 247, lr: 4.31e-03, grad_scale: 16.0 2023-06-22 09:03:53,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1191102.0, ans=0.0 2023-06-22 09:04:06,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1191162.0, ans=0.1 2023-06-22 09:04:11,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1191162.0, ans=0.125 2023-06-22 09:04:12,439 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.221e+02 3.104e+02 3.542e+02 4.427e+02 7.965e+02, threshold=7.084e+02, percent-clipped=2.0 2023-06-22 09:05:21,968 INFO [train.py:996] (2/4) Epoch 7, batch 15600, loss[loss=0.2401, simple_loss=0.29, pruned_loss=0.09514, over 21410.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3197, pruned_loss=0.08492, over 4243742.29 frames. ], batch size: 212, lr: 4.31e-03, grad_scale: 32.0 2023-06-22 09:06:01,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1191462.0, ans=0.0 2023-06-22 09:06:44,028 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.08 vs. limit=12.0 2023-06-22 09:07:01,226 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.62 vs. limit=8.0 2023-06-22 09:07:08,479 INFO [train.py:996] (2/4) Epoch 7, batch 15650, loss[loss=0.2096, simple_loss=0.296, pruned_loss=0.06155, over 21590.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3185, pruned_loss=0.08458, over 4234841.72 frames. ], batch size: 263, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:07:38,620 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.422e+02 3.201e+02 3.774e+02 4.746e+02 8.455e+02, threshold=7.547e+02, percent-clipped=5.0 2023-06-22 09:07:43,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1191822.0, ans=0.125 2023-06-22 09:08:09,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1191882.0, ans=0.125 2023-06-22 09:08:24,418 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 09:08:47,620 INFO [train.py:996] (2/4) Epoch 7, batch 15700, loss[loss=0.2241, simple_loss=0.3121, pruned_loss=0.06802, over 21860.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3143, pruned_loss=0.08322, over 4236604.92 frames. ], batch size: 372, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:09:13,741 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.25 vs. limit=15.0 2023-06-22 09:09:44,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1192182.0, ans=0.125 2023-06-22 09:09:46,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1192182.0, ans=0.125 2023-06-22 09:10:27,332 INFO [train.py:996] (2/4) Epoch 7, batch 15750, loss[loss=0.2112, simple_loss=0.2875, pruned_loss=0.06743, over 21691.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3088, pruned_loss=0.08263, over 4245171.25 frames. ], batch size: 282, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:10:38,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1192302.0, ans=0.1 2023-06-22 09:10:56,918 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.451e+02 3.176e+02 3.735e+02 4.754e+02 7.774e+02, threshold=7.471e+02, percent-clipped=1.0 2023-06-22 09:11:12,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1192422.0, ans=0.125 2023-06-22 09:12:06,988 INFO [train.py:996] (2/4) Epoch 7, batch 15800, loss[loss=0.2301, simple_loss=0.2967, pruned_loss=0.08178, over 21745.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3039, pruned_loss=0.08253, over 4256002.55 frames. ], batch size: 351, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:12:16,602 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 09:12:40,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1192662.0, ans=0.125 2023-06-22 09:13:45,628 INFO [train.py:996] (2/4) Epoch 7, batch 15850, loss[loss=0.2475, simple_loss=0.3141, pruned_loss=0.09044, over 21902.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3045, pruned_loss=0.08473, over 4260758.16 frames. ], batch size: 317, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:13:48,147 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-06-22 09:13:53,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1192902.0, ans=0.125 2023-06-22 09:14:00,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1192962.0, ans=0.0 2023-06-22 09:14:00,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1192962.0, ans=0.2 2023-06-22 09:14:03,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1192962.0, ans=15.0 2023-06-22 09:14:15,709 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.357e+02 3.067e+02 3.802e+02 4.626e+02 8.154e+02, threshold=7.604e+02, percent-clipped=3.0 2023-06-22 09:14:24,373 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-06-22 09:14:39,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1193022.0, ans=0.07 2023-06-22 09:14:45,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1193022.0, ans=0.1 2023-06-22 09:15:26,013 INFO [train.py:996] (2/4) Epoch 7, batch 15900, loss[loss=0.2363, simple_loss=0.3013, pruned_loss=0.08569, over 21861.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3016, pruned_loss=0.08462, over 4259033.30 frames. ], batch size: 107, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:16:15,270 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.90 vs. limit=6.0 2023-06-22 09:16:38,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1193382.0, ans=0.125 2023-06-22 09:16:41,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1193382.0, ans=0.0 2023-06-22 09:17:05,280 INFO [train.py:996] (2/4) Epoch 7, batch 15950, loss[loss=0.2627, simple_loss=0.3422, pruned_loss=0.09163, over 21600.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3036, pruned_loss=0.08357, over 4246985.82 frames. ], batch size: 414, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:17:28,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1193562.0, ans=0.0 2023-06-22 09:17:31,177 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.361e+02 3.037e+02 3.517e+02 4.251e+02 9.007e+02, threshold=7.034e+02, percent-clipped=1.0 2023-06-22 09:17:44,801 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 09:17:51,939 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.78 vs. limit=15.0 2023-06-22 09:18:46,869 INFO [train.py:996] (2/4) Epoch 7, batch 16000, loss[loss=0.2369, simple_loss=0.328, pruned_loss=0.07289, over 21759.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.304, pruned_loss=0.08077, over 4246833.45 frames. ], batch size: 298, lr: 4.30e-03, grad_scale: 32.0 2023-06-22 09:18:51,341 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-06-22 09:19:41,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1193922.0, ans=0.125 2023-06-22 09:20:06,795 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.17 vs. limit=10.0 2023-06-22 09:20:16,522 INFO [train.py:996] (2/4) Epoch 7, batch 16050, loss[loss=0.2374, simple_loss=0.3275, pruned_loss=0.07365, over 21699.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3064, pruned_loss=0.07835, over 4254526.75 frames. ], batch size: 247, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:20:34,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1194102.0, ans=0.125 2023-06-22 09:20:47,928 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.409e+02 3.171e+02 3.896e+02 5.247e+02 9.817e+02, threshold=7.791e+02, percent-clipped=4.0 2023-06-22 09:21:07,030 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=15.0 2023-06-22 09:21:07,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1194222.0, ans=0.0 2023-06-22 09:21:55,838 INFO [train.py:996] (2/4) Epoch 7, batch 16100, loss[loss=0.2286, simple_loss=0.3025, pruned_loss=0.0773, over 21861.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3135, pruned_loss=0.08019, over 4258906.51 frames. ], batch size: 298, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:22:50,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1194522.0, ans=0.125 2023-06-22 09:23:01,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1194582.0, ans=0.0 2023-06-22 09:23:11,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1194582.0, ans=0.125 2023-06-22 09:23:11,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1194582.0, ans=0.125 2023-06-22 09:23:35,217 INFO [train.py:996] (2/4) Epoch 7, batch 16150, loss[loss=0.277, simple_loss=0.338, pruned_loss=0.108, over 21786.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.315, pruned_loss=0.08294, over 4270187.12 frames. ], batch size: 112, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:23:54,216 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=22.5 2023-06-22 09:24:08,131 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.346e+02 3.102e+02 3.921e+02 4.852e+02 9.563e+02, threshold=7.842e+02, percent-clipped=2.0 2023-06-22 09:24:18,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1194822.0, ans=0.1 2023-06-22 09:24:38,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1194882.0, ans=0.1 2023-06-22 09:24:52,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1194882.0, ans=0.125 2023-06-22 09:24:57,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1194942.0, ans=0.0 2023-06-22 09:25:18,416 INFO [train.py:996] (2/4) Epoch 7, batch 16200, loss[loss=0.2439, simple_loss=0.3117, pruned_loss=0.08808, over 21677.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3185, pruned_loss=0.0842, over 4278896.18 frames. ], batch size: 263, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:25:45,151 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.34 vs. limit=15.0 2023-06-22 09:25:47,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1195062.0, ans=0.125 2023-06-22 09:26:59,831 INFO [train.py:996] (2/4) Epoch 7, batch 16250, loss[loss=0.2086, simple_loss=0.2779, pruned_loss=0.06963, over 21737.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3176, pruned_loss=0.08448, over 4284341.81 frames. ], batch size: 282, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:27:00,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1195302.0, ans=0.1 2023-06-22 09:27:10,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1195302.0, ans=0.1 2023-06-22 09:27:13,179 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 09:27:31,790 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.539e+02 3.018e+02 3.500e+02 4.433e+02 8.732e+02, threshold=7.000e+02, percent-clipped=2.0 2023-06-22 09:28:40,757 INFO [train.py:996] (2/4) Epoch 7, batch 16300, loss[loss=0.1998, simple_loss=0.2884, pruned_loss=0.05564, over 21767.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3112, pruned_loss=0.08083, over 4276473.02 frames. ], batch size: 282, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:28:41,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1195602.0, ans=0.1 2023-06-22 09:28:45,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1195602.0, ans=0.1 2023-06-22 09:29:36,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1195722.0, ans=0.0 2023-06-22 09:29:48,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1195782.0, ans=0.125 2023-06-22 09:29:51,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1195782.0, ans=0.125 2023-06-22 09:29:51,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1195782.0, ans=0.2 2023-06-22 09:30:09,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1195842.0, ans=0.0 2023-06-22 09:30:24,313 INFO [train.py:996] (2/4) Epoch 7, batch 16350, loss[loss=0.2565, simple_loss=0.327, pruned_loss=0.09297, over 21361.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3127, pruned_loss=0.08212, over 4273921.62 frames. ], batch size: 176, lr: 4.30e-03, grad_scale: 16.0 2023-06-22 09:30:50,233 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=12.0 2023-06-22 09:31:06,944 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 3.307e+02 4.109e+02 5.521e+02 1.139e+03, threshold=8.218e+02, percent-clipped=11.0 2023-06-22 09:31:13,090 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.31 vs. limit=22.5 2023-06-22 09:32:07,008 INFO [train.py:996] (2/4) Epoch 7, batch 16400, loss[loss=0.2337, simple_loss=0.3262, pruned_loss=0.07058, over 20759.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.315, pruned_loss=0.0833, over 4281986.15 frames. ], batch size: 607, lr: 4.30e-03, grad_scale: 32.0 2023-06-22 09:33:24,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1196382.0, ans=0.1 2023-06-22 09:33:38,980 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=22.5 2023-06-22 09:33:47,413 INFO [train.py:996] (2/4) Epoch 7, batch 16450, loss[loss=0.2424, simple_loss=0.3133, pruned_loss=0.08571, over 21866.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3163, pruned_loss=0.08498, over 4287534.72 frames. ], batch size: 351, lr: 4.30e-03, grad_scale: 32.0 2023-06-22 09:33:54,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1196502.0, ans=0.125 2023-06-22 09:34:07,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1196562.0, ans=0.2 2023-06-22 09:34:28,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1196562.0, ans=0.125 2023-06-22 09:34:29,352 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.529e+02 3.059e+02 3.522e+02 4.400e+02 7.364e+02, threshold=7.044e+02, percent-clipped=0.0 2023-06-22 09:34:44,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1196622.0, ans=0.1 2023-06-22 09:34:47,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1196622.0, ans=0.0 2023-06-22 09:35:28,322 INFO [train.py:996] (2/4) Epoch 7, batch 16500, loss[loss=0.293, simple_loss=0.3653, pruned_loss=0.1104, over 21515.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.317, pruned_loss=0.08514, over 4273907.11 frames. ], batch size: 508, lr: 4.30e-03, grad_scale: 32.0 2023-06-22 09:36:16,872 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.53 vs. limit=15.0 2023-06-22 09:36:25,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1196922.0, ans=0.0 2023-06-22 09:36:25,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1196922.0, ans=0.125 2023-06-22 09:36:35,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1196922.0, ans=0.125 2023-06-22 09:36:49,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1196982.0, ans=0.125 2023-06-22 09:36:51,187 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.77 vs. limit=6.0 2023-06-22 09:37:16,066 INFO [train.py:996] (2/4) Epoch 7, batch 16550, loss[loss=0.2345, simple_loss=0.3105, pruned_loss=0.07923, over 21460.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3134, pruned_loss=0.08218, over 4273288.45 frames. ], batch size: 194, lr: 4.30e-03, grad_scale: 32.0 2023-06-22 09:37:24,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1197102.0, ans=0.1 2023-06-22 09:37:53,812 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.487e+02 3.830e+02 4.900e+02 6.619e+02 1.240e+03, threshold=9.800e+02, percent-clipped=18.0 2023-06-22 09:38:08,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1197222.0, ans=0.125 2023-06-22 09:38:10,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1197222.0, ans=0.025 2023-06-22 09:38:17,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1197222.0, ans=0.125 2023-06-22 09:38:21,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1197282.0, ans=0.125 2023-06-22 09:38:45,223 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.01 vs. limit=10.0 2023-06-22 09:39:08,760 INFO [train.py:996] (2/4) Epoch 7, batch 16600, loss[loss=0.2554, simple_loss=0.343, pruned_loss=0.08388, over 21310.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3246, pruned_loss=0.08607, over 4272157.22 frames. ], batch size: 176, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 09:39:13,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1197402.0, ans=0.125 2023-06-22 09:40:04,991 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-22 09:40:11,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1197582.0, ans=0.125 2023-06-22 09:40:50,870 INFO [train.py:996] (2/4) Epoch 7, batch 16650, loss[loss=0.347, simple_loss=0.4028, pruned_loss=0.1456, over 21449.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3351, pruned_loss=0.0894, over 4276018.07 frames. ], batch size: 471, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:41:06,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1197702.0, ans=0.125 2023-06-22 09:41:26,156 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.702e+02 3.520e+02 3.910e+02 4.811e+02 1.011e+03, threshold=7.820e+02, percent-clipped=1.0 2023-06-22 09:42:15,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1197882.0, ans=0.1 2023-06-22 09:42:39,963 INFO [train.py:996] (2/4) Epoch 7, batch 16700, loss[loss=0.2113, simple_loss=0.2763, pruned_loss=0.0732, over 21489.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3358, pruned_loss=0.09031, over 4273004.78 frames. ], batch size: 211, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:42:50,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1198002.0, ans=0.0 2023-06-22 09:43:20,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1198122.0, ans=0.125 2023-06-22 09:44:24,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1198242.0, ans=0.125 2023-06-22 09:44:27,968 INFO [train.py:996] (2/4) Epoch 7, batch 16750, loss[loss=0.236, simple_loss=0.2997, pruned_loss=0.08612, over 19908.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3372, pruned_loss=0.09225, over 4269311.20 frames. ], batch size: 702, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:45:09,614 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.580e+02 3.471e+02 3.936e+02 4.958e+02 1.171e+03, threshold=7.873e+02, percent-clipped=3.0 2023-06-22 09:45:39,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1198482.0, ans=0.125 2023-06-22 09:45:41,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1198482.0, ans=0.125 2023-06-22 09:46:11,332 INFO [train.py:996] (2/4) Epoch 7, batch 16800, loss[loss=0.2783, simple_loss=0.3461, pruned_loss=0.1052, over 21854.00 frames. ], tot_loss[loss=0.2613, simple_loss=0.3393, pruned_loss=0.0917, over 4272431.57 frames. ], batch size: 371, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 09:46:29,531 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=22.5 2023-06-22 09:46:37,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1198662.0, ans=0.1 2023-06-22 09:47:23,672 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=12.0 2023-06-22 09:47:40,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1198842.0, ans=0.125 2023-06-22 09:47:51,157 INFO [train.py:996] (2/4) Epoch 7, batch 16850, loss[loss=0.2608, simple_loss=0.3342, pruned_loss=0.09367, over 21846.00 frames. ], tot_loss[loss=0.2588, simple_loss=0.335, pruned_loss=0.09126, over 4283260.39 frames. ], batch size: 124, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 09:48:16,927 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 09:48:29,635 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.730e+02 3.467e+02 4.300e+02 5.663e+02 1.182e+03, threshold=8.599e+02, percent-clipped=7.0 2023-06-22 09:48:47,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1199022.0, ans=0.2 2023-06-22 09:49:26,266 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=22.5 2023-06-22 09:49:30,243 INFO [train.py:996] (2/4) Epoch 7, batch 16900, loss[loss=0.195, simple_loss=0.2663, pruned_loss=0.06185, over 21501.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3288, pruned_loss=0.08976, over 4288128.79 frames. ], batch size: 212, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 09:49:45,706 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 09:50:37,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1199382.0, ans=0.0 2023-06-22 09:50:52,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1199442.0, ans=0.125 2023-06-22 09:50:55,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1199442.0, ans=0.0 2023-06-22 09:51:05,718 INFO [train.py:996] (2/4) Epoch 7, batch 16950, loss[loss=0.2783, simple_loss=0.3236, pruned_loss=0.1165, over 21777.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3218, pruned_loss=0.08816, over 4282146.59 frames. ], batch size: 508, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:51:22,650 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-06-22 09:51:24,429 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=22.5 2023-06-22 09:51:30,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1199562.0, ans=0.05 2023-06-22 09:51:45,903 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.467e+02 2.915e+02 3.202e+02 3.763e+02 5.382e+02, threshold=6.404e+02, percent-clipped=0.0 2023-06-22 09:52:21,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1199682.0, ans=0.125 2023-06-22 09:52:50,005 INFO [train.py:996] (2/4) Epoch 7, batch 17000, loss[loss=0.239, simple_loss=0.3107, pruned_loss=0.08368, over 21885.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3188, pruned_loss=0.08866, over 4292188.11 frames. ], batch size: 332, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:52:58,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1199802.0, ans=0.125 2023-06-22 09:53:00,773 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-06-22 09:53:28,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1199922.0, ans=0.2 2023-06-22 09:53:46,302 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 09:53:52,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1199982.0, ans=0.0 2023-06-22 09:54:36,542 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-06-22 09:54:38,313 INFO [train.py:996] (2/4) Epoch 7, batch 17050, loss[loss=0.2645, simple_loss=0.3368, pruned_loss=0.09612, over 21206.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3254, pruned_loss=0.09112, over 4289096.60 frames. ], batch size: 143, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:55:08,543 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.478e+02 3.382e+02 4.158e+02 4.859e+02 8.252e+02, threshold=8.317e+02, percent-clipped=8.0 2023-06-22 09:55:17,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1200222.0, ans=0.2 2023-06-22 09:56:17,461 INFO [train.py:996] (2/4) Epoch 7, batch 17100, loss[loss=0.2514, simple_loss=0.3112, pruned_loss=0.09583, over 21856.00 frames. ], tot_loss[loss=0.2542, simple_loss=0.3255, pruned_loss=0.09144, over 4289026.25 frames. ], batch size: 298, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:57:04,323 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.77 vs. limit=5.0 2023-06-22 09:57:08,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1200582.0, ans=0.0 2023-06-22 09:57:33,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1200642.0, ans=0.125 2023-06-22 09:57:46,686 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.37 vs. limit=22.5 2023-06-22 09:57:50,015 INFO [train.py:996] (2/4) Epoch 7, batch 17150, loss[loss=0.193, simple_loss=0.2773, pruned_loss=0.05437, over 21807.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3209, pruned_loss=0.09062, over 4292981.48 frames. ], batch size: 351, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 09:57:57,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1200702.0, ans=0.125 2023-06-22 09:58:30,912 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.500e+02 3.032e+02 3.543e+02 4.123e+02 6.537e+02, threshold=7.086e+02, percent-clipped=0.0 2023-06-22 09:58:52,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1200882.0, ans=0.125 2023-06-22 09:59:11,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1200882.0, ans=0.0 2023-06-22 09:59:11,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1200882.0, ans=0.2 2023-06-22 09:59:32,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1200942.0, ans=0.125 2023-06-22 09:59:32,692 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.98 vs. limit=10.0 2023-06-22 09:59:37,091 INFO [train.py:996] (2/4) Epoch 7, batch 17200, loss[loss=0.3036, simple_loss=0.362, pruned_loss=0.1226, over 21415.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.321, pruned_loss=0.09031, over 4293851.69 frames. ], batch size: 471, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 10:00:14,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1201122.0, ans=0.035 2023-06-22 10:00:29,553 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.96 vs. limit=15.0 2023-06-22 10:00:46,946 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:01:20,171 INFO [train.py:996] (2/4) Epoch 7, batch 17250, loss[loss=0.311, simple_loss=0.3738, pruned_loss=0.1241, over 21335.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3242, pruned_loss=0.09229, over 4294319.60 frames. ], batch size: 549, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 10:01:38,963 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:02:01,137 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.703e+02 3.318e+02 3.860e+02 4.888e+02 8.680e+02, threshold=7.720e+02, percent-clipped=6.0 2023-06-22 10:02:26,906 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2023-06-22 10:02:34,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1201482.0, ans=0.2 2023-06-22 10:03:04,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1201542.0, ans=0.125 2023-06-22 10:03:07,125 INFO [train.py:996] (2/4) Epoch 7, batch 17300, loss[loss=0.3116, simple_loss=0.377, pruned_loss=0.1231, over 21438.00 frames. ], tot_loss[loss=0.2624, simple_loss=0.333, pruned_loss=0.0959, over 4288563.62 frames. ], batch size: 131, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 10:03:07,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1201602.0, ans=0.0 2023-06-22 10:03:13,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1201602.0, ans=0.125 2023-06-22 10:03:42,725 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.18 vs. limit=15.0 2023-06-22 10:04:32,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1201842.0, ans=0.0 2023-06-22 10:04:41,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1201842.0, ans=0.1 2023-06-22 10:04:50,680 INFO [train.py:996] (2/4) Epoch 7, batch 17350, loss[loss=0.2413, simple_loss=0.338, pruned_loss=0.07225, over 19835.00 frames. ], tot_loss[loss=0.2608, simple_loss=0.3326, pruned_loss=0.09455, over 4281850.26 frames. ], batch size: 702, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 10:05:36,384 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 3.363e+02 3.779e+02 4.471e+02 7.201e+02, threshold=7.558e+02, percent-clipped=0.0 2023-06-22 10:05:51,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1202022.0, ans=0.125 2023-06-22 10:05:56,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1202082.0, ans=0.1 2023-06-22 10:06:37,391 INFO [train.py:996] (2/4) Epoch 7, batch 17400, loss[loss=0.2291, simple_loss=0.3097, pruned_loss=0.07425, over 21825.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3281, pruned_loss=0.09066, over 4278547.19 frames. ], batch size: 316, lr: 4.29e-03, grad_scale: 32.0 2023-06-22 10:07:24,205 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.75 vs. limit=10.0 2023-06-22 10:07:32,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1202322.0, ans=0.015 2023-06-22 10:08:13,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1202442.0, ans=0.0 2023-06-22 10:08:24,849 INFO [train.py:996] (2/4) Epoch 7, batch 17450, loss[loss=0.2115, simple_loss=0.3136, pruned_loss=0.05472, over 21583.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3216, pruned_loss=0.08743, over 4271788.06 frames. ], batch size: 389, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 10:08:37,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1202502.0, ans=0.2 2023-06-22 10:09:02,630 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.989e+02 3.174e+02 3.775e+02 5.488e+02 9.226e+02, threshold=7.551e+02, percent-clipped=5.0 2023-06-22 10:09:11,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1202622.0, ans=0.125 2023-06-22 10:09:30,801 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-22 10:10:06,309 INFO [train.py:996] (2/4) Epoch 7, batch 17500, loss[loss=0.2718, simple_loss=0.3276, pruned_loss=0.1081, over 21319.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3176, pruned_loss=0.08522, over 4273461.51 frames. ], batch size: 159, lr: 4.29e-03, grad_scale: 16.0 2023-06-22 10:10:14,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1202802.0, ans=0.125 2023-06-22 10:10:22,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1202862.0, ans=0.2 2023-06-22 10:10:35,341 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:10:39,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1202922.0, ans=0.0 2023-06-22 10:10:40,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1202922.0, ans=0.1 2023-06-22 10:10:40,894 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=15.0 2023-06-22 10:10:41,764 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:10:41,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1202922.0, ans=0.125 2023-06-22 10:10:41,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1202922.0, ans=0.04949747468305833 2023-06-22 10:10:54,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1202922.0, ans=0.125 2023-06-22 10:11:41,903 INFO [train.py:996] (2/4) Epoch 7, batch 17550, loss[loss=0.2463, simple_loss=0.3269, pruned_loss=0.08282, over 21282.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3184, pruned_loss=0.08436, over 4268466.57 frames. ], batch size: 143, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:11:59,523 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=22.5 2023-06-22 10:12:02,442 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=22.5 2023-06-22 10:12:14,418 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.287e+02 2.829e+02 3.350e+02 3.891e+02 7.522e+02, threshold=6.700e+02, percent-clipped=0.0 2023-06-22 10:12:21,940 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.36 vs. limit=22.5 2023-06-22 10:12:44,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1203282.0, ans=0.1 2023-06-22 10:12:52,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1203282.0, ans=10.0 2023-06-22 10:13:22,585 INFO [train.py:996] (2/4) Epoch 7, batch 17600, loss[loss=0.2655, simple_loss=0.3417, pruned_loss=0.0947, over 21187.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3205, pruned_loss=0.08461, over 4259930.23 frames. ], batch size: 143, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:13:27,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1203402.0, ans=0.125 2023-06-22 10:14:44,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1203642.0, ans=0.1 2023-06-22 10:15:03,732 INFO [train.py:996] (2/4) Epoch 7, batch 17650, loss[loss=0.2909, simple_loss=0.3534, pruned_loss=0.1142, over 21533.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.319, pruned_loss=0.08487, over 4250862.70 frames. ], batch size: 509, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:15:28,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1203762.0, ans=0.1 2023-06-22 10:15:36,726 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.660e+02 3.234e+02 3.859e+02 4.407e+02 8.519e+02, threshold=7.719e+02, percent-clipped=7.0 2023-06-22 10:16:46,346 INFO [train.py:996] (2/4) Epoch 7, batch 17700, loss[loss=0.2745, simple_loss=0.3579, pruned_loss=0.0956, over 21574.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3112, pruned_loss=0.08071, over 4247166.29 frames. ], batch size: 414, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:17:17,801 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.95 vs. limit=10.0 2023-06-22 10:17:28,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1204122.0, ans=10.0 2023-06-22 10:17:31,189 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.73 vs. limit=15.0 2023-06-22 10:17:41,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1204122.0, ans=0.0 2023-06-22 10:17:49,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1204182.0, ans=0.125 2023-06-22 10:18:20,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1204242.0, ans=0.125 2023-06-22 10:18:20,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1204242.0, ans=0.1 2023-06-22 10:18:24,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1204242.0, ans=0.125 2023-06-22 10:18:29,535 INFO [train.py:996] (2/4) Epoch 7, batch 17750, loss[loss=0.2219, simple_loss=0.3033, pruned_loss=0.07026, over 19983.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3198, pruned_loss=0.08461, over 4254773.50 frames. ], batch size: 703, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:18:32,268 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=22.5 2023-06-22 10:18:40,290 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.63 vs. limit=12.0 2023-06-22 10:18:52,262 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=12.0 2023-06-22 10:19:13,746 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.471e+02 3.318e+02 4.087e+02 5.384e+02 1.002e+03, threshold=8.174e+02, percent-clipped=10.0 2023-06-22 10:19:34,383 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.88 vs. limit=22.5 2023-06-22 10:19:36,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1204482.0, ans=0.0 2023-06-22 10:19:40,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1204482.0, ans=0.1 2023-06-22 10:19:59,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1204542.0, ans=0.125 2023-06-22 10:20:11,951 INFO [train.py:996] (2/4) Epoch 7, batch 17800, loss[loss=0.2084, simple_loss=0.2793, pruned_loss=0.0688, over 21278.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3207, pruned_loss=0.08461, over 4262985.88 frames. ], batch size: 159, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:20:49,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1204662.0, ans=0.0 2023-06-22 10:20:59,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1204722.0, ans=0.125 2023-06-22 10:21:30,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1204782.0, ans=0.07 2023-06-22 10:21:49,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1204842.0, ans=0.1 2023-06-22 10:21:55,060 INFO [train.py:996] (2/4) Epoch 7, batch 17850, loss[loss=0.2529, simple_loss=0.3684, pruned_loss=0.06872, over 20704.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3215, pruned_loss=0.08523, over 4265402.08 frames. ], batch size: 607, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:22:18,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1204962.0, ans=0.2 2023-06-22 10:22:39,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1204962.0, ans=0.125 2023-06-22 10:22:45,756 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.428e+02 3.209e+02 3.990e+02 4.443e+02 8.332e+02, threshold=7.980e+02, percent-clipped=3.0 2023-06-22 10:23:17,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1205142.0, ans=0.0 2023-06-22 10:23:38,646 INFO [train.py:996] (2/4) Epoch 7, batch 17900, loss[loss=0.2305, simple_loss=0.3214, pruned_loss=0.06984, over 21435.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3257, pruned_loss=0.08583, over 4261455.29 frames. ], batch size: 194, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:24:24,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1205322.0, ans=0.125 2023-06-22 10:24:27,118 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.99 vs. limit=22.5 2023-06-22 10:25:08,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1205442.0, ans=0.125 2023-06-22 10:25:24,778 INFO [train.py:996] (2/4) Epoch 7, batch 17950, loss[loss=0.2093, simple_loss=0.3031, pruned_loss=0.05773, over 21760.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3235, pruned_loss=0.08233, over 4257436.88 frames. ], batch size: 332, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:25:54,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1205562.0, ans=0.0 2023-06-22 10:25:54,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1205562.0, ans=0.125 2023-06-22 10:26:05,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1205622.0, ans=0.09899494936611666 2023-06-22 10:26:08,219 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.466e+02 3.180e+02 3.649e+02 4.821e+02 7.234e+02, threshold=7.298e+02, percent-clipped=0.0 2023-06-22 10:26:08,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1205622.0, ans=0.2 2023-06-22 10:26:08,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1205622.0, ans=0.1 2023-06-22 10:26:12,612 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.02 vs. limit=15.0 2023-06-22 10:27:10,975 INFO [train.py:996] (2/4) Epoch 7, batch 18000, loss[loss=0.1953, simple_loss=0.2549, pruned_loss=0.06781, over 21120.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3179, pruned_loss=0.08159, over 4258052.12 frames. ], batch size: 548, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:27:10,976 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-22 10:27:29,140 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.5.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.5577, 1.7420, 3.6661, 2.3562], device='cuda:2') 2023-06-22 10:27:30,139 INFO [train.py:1028] (2/4) Epoch 7, validation: loss=0.265, simple_loss=0.3646, pruned_loss=0.08269, over 1796401.00 frames. 2023-06-22 10:27:30,140 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-22 10:27:55,054 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.57 vs. limit=15.0 2023-06-22 10:27:57,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1205862.0, ans=0.125 2023-06-22 10:28:18,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1205922.0, ans=0.125 2023-06-22 10:28:40,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1205982.0, ans=0.125 2023-06-22 10:28:59,115 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.57 vs. limit=6.0 2023-06-22 10:29:12,729 INFO [train.py:996] (2/4) Epoch 7, batch 18050, loss[loss=0.2426, simple_loss=0.3028, pruned_loss=0.09124, over 21292.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3137, pruned_loss=0.08135, over 4265126.16 frames. ], batch size: 471, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:29:52,825 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.553e+02 3.561e+02 4.207e+02 5.144e+02 1.104e+03, threshold=8.414e+02, percent-clipped=10.0 2023-06-22 10:30:50,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1206342.0, ans=0.09899494936611666 2023-06-22 10:30:55,000 INFO [train.py:996] (2/4) Epoch 7, batch 18100, loss[loss=0.2298, simple_loss=0.3202, pruned_loss=0.06971, over 21251.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3188, pruned_loss=0.08417, over 4261145.32 frames. ], batch size: 143, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:30:58,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1206402.0, ans=0.125 2023-06-22 10:31:33,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1206522.0, ans=0.1 2023-06-22 10:32:21,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1206642.0, ans=0.125 2023-06-22 10:32:29,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1206642.0, ans=0.0 2023-06-22 10:32:35,096 INFO [train.py:996] (2/4) Epoch 7, batch 18150, loss[loss=0.2219, simple_loss=0.2874, pruned_loss=0.07819, over 21458.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3197, pruned_loss=0.0837, over 4254357.57 frames. ], batch size: 212, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:33:15,096 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.590e+02 3.134e+02 3.517e+02 4.943e+02 8.965e+02, threshold=7.034e+02, percent-clipped=1.0 2023-06-22 10:33:37,097 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-22 10:34:13,169 INFO [train.py:996] (2/4) Epoch 7, batch 18200, loss[loss=0.2134, simple_loss=0.2831, pruned_loss=0.0719, over 21467.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3143, pruned_loss=0.08343, over 4245199.73 frames. ], batch size: 211, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:34:26,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1207002.0, ans=0.125 2023-06-22 10:35:04,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1207182.0, ans=0.125 2023-06-22 10:35:08,564 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-06-22 10:35:42,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1207242.0, ans=0.2 2023-06-22 10:35:44,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1207242.0, ans=0.125 2023-06-22 10:35:50,297 INFO [train.py:996] (2/4) Epoch 7, batch 18250, loss[loss=0.2723, simple_loss=0.3548, pruned_loss=0.0949, over 19901.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3073, pruned_loss=0.08128, over 4246515.82 frames. ], batch size: 702, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:36:21,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_na.min_abs, batch_count=1207422.0, ans=0.02 2023-06-22 10:36:25,624 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.022e+02 3.178e+02 4.108e+02 6.214e+02 1.567e+03, threshold=8.215e+02, percent-clipped=16.0 2023-06-22 10:37:02,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1207542.0, ans=0.2 2023-06-22 10:37:21,107 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.35 vs. limit=22.5 2023-06-22 10:37:29,321 INFO [train.py:996] (2/4) Epoch 7, batch 18300, loss[loss=0.2505, simple_loss=0.3608, pruned_loss=0.07011, over 21673.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.306, pruned_loss=0.08083, over 4250823.98 frames. ], batch size: 389, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:39:08,693 INFO [train.py:996] (2/4) Epoch 7, batch 18350, loss[loss=0.1838, simple_loss=0.2586, pruned_loss=0.05447, over 17129.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3126, pruned_loss=0.08104, over 4254004.76 frames. ], batch size: 65, lr: 4.28e-03, grad_scale: 16.0 2023-06-22 10:39:21,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1207902.0, ans=0.125 2023-06-22 10:39:26,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1207962.0, ans=0.0 2023-06-22 10:39:43,987 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.350e+02 3.179e+02 3.735e+02 4.992e+02 1.231e+03, threshold=7.469e+02, percent-clipped=7.0 2023-06-22 10:39:48,604 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.78 vs. limit=12.0 2023-06-22 10:40:06,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1208082.0, ans=0.2 2023-06-22 10:40:49,851 INFO [train.py:996] (2/4) Epoch 7, batch 18400, loss[loss=0.198, simple_loss=0.2851, pruned_loss=0.05541, over 21751.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3076, pruned_loss=0.07965, over 4253138.69 frames. ], batch size: 333, lr: 4.28e-03, grad_scale: 32.0 2023-06-22 10:40:55,257 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:41:01,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1208202.0, ans=0.125 2023-06-22 10:41:24,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1208322.0, ans=0.04949747468305833 2023-06-22 10:41:26,270 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-22 10:41:27,310 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:41:52,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1208382.0, ans=0.1 2023-06-22 10:42:07,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1208442.0, ans=0.125 2023-06-22 10:42:29,236 INFO [train.py:996] (2/4) Epoch 7, batch 18450, loss[loss=0.1892, simple_loss=0.2808, pruned_loss=0.04883, over 21717.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3037, pruned_loss=0.07559, over 4249238.74 frames. ], batch size: 298, lr: 4.27e-03, grad_scale: 32.0 2023-06-22 10:42:55,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1208562.0, ans=0.0 2023-06-22 10:43:04,258 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.358e+02 3.170e+02 3.772e+02 5.072e+02 1.044e+03, threshold=7.545e+02, percent-clipped=1.0 2023-06-22 10:43:24,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1208682.0, ans=0.07 2023-06-22 10:43:50,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1208742.0, ans=0.125 2023-06-22 10:44:06,941 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=22.5 2023-06-22 10:44:08,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1208802.0, ans=0.2 2023-06-22 10:44:09,103 INFO [train.py:996] (2/4) Epoch 7, batch 18500, loss[loss=0.2373, simple_loss=0.2894, pruned_loss=0.09259, over 21904.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2997, pruned_loss=0.0749, over 4251675.95 frames. ], batch size: 98, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 10:44:24,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1208862.0, ans=0.0 2023-06-22 10:45:27,227 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.21 vs. limit=10.0 2023-06-22 10:45:50,051 INFO [train.py:996] (2/4) Epoch 7, batch 18550, loss[loss=0.2009, simple_loss=0.2838, pruned_loss=0.05896, over 21679.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2993, pruned_loss=0.07457, over 4250994.00 frames. ], batch size: 298, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 10:46:19,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1209162.0, ans=0.1 2023-06-22 10:46:20,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1209222.0, ans=0.2 2023-06-22 10:46:32,518 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.076e+02 3.124e+02 3.693e+02 4.756e+02 1.140e+03, threshold=7.385e+02, percent-clipped=12.0 2023-06-22 10:46:34,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1209222.0, ans=0.0 2023-06-22 10:46:42,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1209282.0, ans=0.0 2023-06-22 10:46:52,430 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-22 10:47:10,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1209342.0, ans=0.1 2023-06-22 10:47:27,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1209342.0, ans=0.0 2023-06-22 10:47:30,137 INFO [train.py:996] (2/4) Epoch 7, batch 18600, loss[loss=0.2076, simple_loss=0.2688, pruned_loss=0.07322, over 21202.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2967, pruned_loss=0.07511, over 4243045.47 frames. ], batch size: 176, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 10:47:43,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1209402.0, ans=0.125 2023-06-22 10:47:54,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1209462.0, ans=0.0 2023-06-22 10:48:00,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1209522.0, ans=0.1 2023-06-22 10:48:02,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1209522.0, ans=10.0 2023-06-22 10:48:03,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1209522.0, ans=0.0 2023-06-22 10:49:09,430 INFO [train.py:996] (2/4) Epoch 7, batch 18650, loss[loss=0.242, simple_loss=0.3008, pruned_loss=0.09165, over 21879.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2963, pruned_loss=0.07503, over 4248539.73 frames. ], batch size: 107, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 10:49:42,746 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:49:45,394 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.122e+02 3.160e+02 3.578e+02 4.366e+02 8.700e+02, threshold=7.156e+02, percent-clipped=2.0 2023-06-22 10:50:07,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1209882.0, ans=0.1 2023-06-22 10:50:17,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1209882.0, ans=0.125 2023-06-22 10:50:43,591 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.34 vs. limit=15.0 2023-06-22 10:50:47,140 INFO [train.py:996] (2/4) Epoch 7, batch 18700, loss[loss=0.2175, simple_loss=0.2796, pruned_loss=0.07765, over 21603.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2942, pruned_loss=0.07635, over 4260905.72 frames. ], batch size: 263, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 10:50:55,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1210002.0, ans=0.125 2023-06-22 10:51:08,704 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:51:08,779 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 10:51:28,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1210122.0, ans=0.125 2023-06-22 10:51:50,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1210182.0, ans=0.125 2023-06-22 10:52:26,870 INFO [train.py:996] (2/4) Epoch 7, batch 18750, loss[loss=0.2257, simple_loss=0.2857, pruned_loss=0.08284, over 21595.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.2962, pruned_loss=0.07904, over 4272420.02 frames. ], batch size: 548, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 10:52:27,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1210302.0, ans=0.1 2023-06-22 10:53:03,959 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.372e+02 3.195e+02 3.885e+02 4.969e+02 1.061e+03, threshold=7.770e+02, percent-clipped=4.0 2023-06-22 10:53:12,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1210422.0, ans=0.125 2023-06-22 10:53:22,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1210482.0, ans=0.0 2023-06-22 10:54:01,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1210542.0, ans=0.2 2023-06-22 10:54:05,684 INFO [train.py:996] (2/4) Epoch 7, batch 18800, loss[loss=0.2603, simple_loss=0.3616, pruned_loss=0.07948, over 20821.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.302, pruned_loss=0.08039, over 4273664.33 frames. ], batch size: 608, lr: 4.27e-03, grad_scale: 32.0 2023-06-22 10:55:44,296 INFO [train.py:996] (2/4) Epoch 7, batch 18850, loss[loss=0.2064, simple_loss=0.273, pruned_loss=0.06985, over 21524.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2981, pruned_loss=0.07589, over 4263908.41 frames. ], batch size: 195, lr: 4.27e-03, grad_scale: 32.0 2023-06-22 10:55:50,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1210902.0, ans=0.1 2023-06-22 10:56:00,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1210962.0, ans=0.1 2023-06-22 10:56:16,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1211022.0, ans=0.125 2023-06-22 10:56:20,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1211022.0, ans=0.0 2023-06-22 10:56:21,054 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.825e+02 3.160e+02 3.995e+02 5.299e+02 8.301e+02, threshold=7.991e+02, percent-clipped=3.0 2023-06-22 10:56:29,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1211022.0, ans=0.1 2023-06-22 10:56:59,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1211082.0, ans=0.1 2023-06-22 10:57:24,809 INFO [train.py:996] (2/4) Epoch 7, batch 18900, loss[loss=0.1988, simple_loss=0.2597, pruned_loss=0.06891, over 21542.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2961, pruned_loss=0.0763, over 4265045.54 frames. ], batch size: 231, lr: 4.27e-03, grad_scale: 32.0 2023-06-22 10:57:37,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1211202.0, ans=0.125 2023-06-22 10:57:44,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1211262.0, ans=0.125 2023-06-22 10:59:00,567 INFO [train.py:996] (2/4) Epoch 7, batch 18950, loss[loss=0.256, simple_loss=0.321, pruned_loss=0.09555, over 21821.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.2987, pruned_loss=0.07961, over 4277906.78 frames. ], batch size: 124, lr: 4.27e-03, grad_scale: 32.0 2023-06-22 10:59:38,374 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.670e+02 3.300e+02 3.868e+02 4.844e+02 6.994e+02, threshold=7.736e+02, percent-clipped=0.0 2023-06-22 10:59:56,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1211622.0, ans=0.0 2023-06-22 11:00:16,043 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.39 vs. limit=15.0 2023-06-22 11:00:28,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1211742.0, ans=0.125 2023-06-22 11:00:38,045 INFO [train.py:996] (2/4) Epoch 7, batch 19000, loss[loss=0.2668, simple_loss=0.3486, pruned_loss=0.09246, over 21880.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3091, pruned_loss=0.08168, over 4281067.46 frames. ], batch size: 372, lr: 4.27e-03, grad_scale: 32.0 2023-06-22 11:02:07,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1212042.0, ans=0.125 2023-06-22 11:02:18,608 INFO [train.py:996] (2/4) Epoch 7, batch 19050, loss[loss=0.241, simple_loss=0.3039, pruned_loss=0.0891, over 21660.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3162, pruned_loss=0.08643, over 4288838.56 frames. ], batch size: 263, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:02:19,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1212102.0, ans=0.025 2023-06-22 11:03:06,385 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.394e+02 3.274e+02 3.680e+02 4.051e+02 6.947e+02, threshold=7.360e+02, percent-clipped=0.0 2023-06-22 11:03:57,454 INFO [train.py:996] (2/4) Epoch 7, batch 19100, loss[loss=0.2734, simple_loss=0.3266, pruned_loss=0.1101, over 21486.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3139, pruned_loss=0.08748, over 4291612.38 frames. ], batch size: 548, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:04:08,665 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.27 vs. limit=12.0 2023-06-22 11:04:19,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1212462.0, ans=0.1 2023-06-22 11:05:15,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1212582.0, ans=0.125 2023-06-22 11:05:40,899 INFO [train.py:996] (2/4) Epoch 7, batch 19150, loss[loss=0.2183, simple_loss=0.3078, pruned_loss=0.06436, over 21412.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3148, pruned_loss=0.08752, over 4282521.02 frames. ], batch size: 194, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:06:15,162 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.39 vs. limit=22.5 2023-06-22 11:06:30,728 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.93 vs. limit=15.0 2023-06-22 11:06:42,984 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.634e+02 3.652e+02 4.521e+02 6.039e+02 1.131e+03, threshold=9.042e+02, percent-clipped=10.0 2023-06-22 11:07:05,240 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-22 11:07:23,658 INFO [train.py:996] (2/4) Epoch 7, batch 19200, loss[loss=0.2862, simple_loss=0.4077, pruned_loss=0.08234, over 20728.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3261, pruned_loss=0.08888, over 4281935.43 frames. ], batch size: 607, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:07:31,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1213002.0, ans=0.125 2023-06-22 11:08:30,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1213182.0, ans=0.0 2023-06-22 11:08:45,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1213242.0, ans=0.125 2023-06-22 11:08:47,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1213242.0, ans=0.125 2023-06-22 11:08:47,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1213242.0, ans=0.0 2023-06-22 11:08:57,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1213242.0, ans=0.0 2023-06-22 11:09:03,366 INFO [train.py:996] (2/4) Epoch 7, batch 19250, loss[loss=0.1757, simple_loss=0.2765, pruned_loss=0.03746, over 21726.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3252, pruned_loss=0.08308, over 4279583.16 frames. ], batch size: 298, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:10:04,466 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.971e+02 3.337e+02 4.181e+02 5.636e+02 1.044e+03, threshold=8.362e+02, percent-clipped=4.0 2023-06-22 11:10:29,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1213542.0, ans=0.125 2023-06-22 11:10:43,609 INFO [train.py:996] (2/4) Epoch 7, batch 19300, loss[loss=0.2334, simple_loss=0.2975, pruned_loss=0.08469, over 21471.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3209, pruned_loss=0.08086, over 4282680.99 frames. ], batch size: 194, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:11:21,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1213662.0, ans=0.0 2023-06-22 11:11:46,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1213782.0, ans=0.0 2023-06-22 11:11:55,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1213782.0, ans=0.125 2023-06-22 11:12:05,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1213842.0, ans=0.125 2023-06-22 11:12:07,987 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-22 11:12:31,343 INFO [train.py:996] (2/4) Epoch 7, batch 19350, loss[loss=0.1966, simple_loss=0.272, pruned_loss=0.06057, over 21434.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3136, pruned_loss=0.07671, over 4277388.12 frames. ], batch size: 195, lr: 4.27e-03, grad_scale: 16.0 2023-06-22 11:12:35,707 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.24 vs. limit=15.0 2023-06-22 11:13:20,167 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.163e+02 3.132e+02 3.696e+02 4.468e+02 9.223e+02, threshold=7.391e+02, percent-clipped=2.0 2023-06-22 11:13:29,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1214082.0, ans=0.125 2023-06-22 11:13:29,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1214082.0, ans=0.125 2023-06-22 11:13:36,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1214082.0, ans=0.0 2023-06-22 11:13:41,570 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.35 vs. limit=12.0 2023-06-22 11:13:59,158 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=22.5 2023-06-22 11:14:04,202 INFO [train.py:996] (2/4) Epoch 7, batch 19400, loss[loss=0.2008, simple_loss=0.2678, pruned_loss=0.06686, over 21207.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3105, pruned_loss=0.07599, over 4275445.49 frames. ], batch size: 143, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:14:06,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1214202.0, ans=0.125 2023-06-22 11:14:48,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1214262.0, ans=0.2 2023-06-22 11:14:56,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1214322.0, ans=10.0 2023-06-22 11:15:16,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1214382.0, ans=0.125 2023-06-22 11:15:43,331 INFO [train.py:996] (2/4) Epoch 7, batch 19450, loss[loss=0.2664, simple_loss=0.3194, pruned_loss=0.1067, over 21760.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3088, pruned_loss=0.07905, over 4279786.70 frames. ], batch size: 102, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:15:58,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1214502.0, ans=0.0 2023-06-22 11:16:04,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1214562.0, ans=0.1 2023-06-22 11:16:12,358 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.32 vs. limit=15.0 2023-06-22 11:16:27,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1214562.0, ans=0.2 2023-06-22 11:16:37,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1214622.0, ans=0.125 2023-06-22 11:16:37,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1214622.0, ans=0.125 2023-06-22 11:16:38,382 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.537e+02 3.055e+02 3.774e+02 4.517e+02 1.086e+03, threshold=7.548e+02, percent-clipped=5.0 2023-06-22 11:16:44,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1214682.0, ans=0.0 2023-06-22 11:16:55,292 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.55 vs. limit=6.0 2023-06-22 11:17:23,329 INFO [train.py:996] (2/4) Epoch 7, batch 19500, loss[loss=0.2323, simple_loss=0.2761, pruned_loss=0.09424, over 20832.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3056, pruned_loss=0.08097, over 4274631.23 frames. ], batch size: 608, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:18:36,828 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.11 vs. limit=6.0 2023-06-22 11:18:53,854 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.76 vs. limit=12.0 2023-06-22 11:19:05,686 INFO [train.py:996] (2/4) Epoch 7, batch 19550, loss[loss=0.1832, simple_loss=0.2445, pruned_loss=0.06098, over 21232.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3032, pruned_loss=0.08012, over 4277684.61 frames. ], batch size: 159, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:19:10,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1215102.0, ans=0.1 2023-06-22 11:19:15,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1215102.0, ans=0.125 2023-06-22 11:19:17,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1215102.0, ans=0.0 2023-06-22 11:19:55,192 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.262e+02 3.073e+02 3.530e+02 4.388e+02 8.690e+02, threshold=7.059e+02, percent-clipped=2.0 2023-06-22 11:20:39,256 INFO [train.py:996] (2/4) Epoch 7, batch 19600, loss[loss=0.2802, simple_loss=0.341, pruned_loss=0.1097, over 21741.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3055, pruned_loss=0.08096, over 4279795.41 frames. ], batch size: 389, lr: 4.26e-03, grad_scale: 32.0 2023-06-22 11:21:17,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1215462.0, ans=0.125 2023-06-22 11:21:18,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1215462.0, ans=0.1 2023-06-22 11:21:56,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1215582.0, ans=0.0 2023-06-22 11:22:21,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1215702.0, ans=0.125 2023-06-22 11:22:22,651 INFO [train.py:996] (2/4) Epoch 7, batch 19650, loss[loss=0.2395, simple_loss=0.3125, pruned_loss=0.08324, over 21341.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3101, pruned_loss=0.08402, over 4279128.97 frames. ], batch size: 548, lr: 4.26e-03, grad_scale: 32.0 2023-06-22 11:22:33,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1215702.0, ans=0.0 2023-06-22 11:22:46,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1215702.0, ans=0.1 2023-06-22 11:23:15,645 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.661e+02 3.627e+02 4.077e+02 5.113e+02 8.180e+02, threshold=8.154e+02, percent-clipped=7.0 2023-06-22 11:23:41,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1215882.0, ans=0.125 2023-06-22 11:24:16,440 INFO [train.py:996] (2/4) Epoch 7, batch 19700, loss[loss=0.2506, simple_loss=0.3388, pruned_loss=0.08125, over 21655.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3146, pruned_loss=0.08546, over 4281659.33 frames. ], batch size: 414, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:25:49,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1216242.0, ans=0.0 2023-06-22 11:25:58,821 INFO [train.py:996] (2/4) Epoch 7, batch 19750, loss[loss=0.2636, simple_loss=0.3533, pruned_loss=0.08694, over 21775.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3239, pruned_loss=0.08592, over 4285249.23 frames. ], batch size: 332, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:26:08,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1216302.0, ans=0.5 2023-06-22 11:26:44,444 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.745e+02 3.729e+02 4.611e+02 5.991e+02 1.312e+03, threshold=9.223e+02, percent-clipped=7.0 2023-06-22 11:27:33,100 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.48 vs. limit=15.0 2023-06-22 11:27:38,675 INFO [train.py:996] (2/4) Epoch 7, batch 19800, loss[loss=0.2365, simple_loss=0.3216, pruned_loss=0.07575, over 21538.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3234, pruned_loss=0.08625, over 4288035.66 frames. ], batch size: 471, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:28:01,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1216662.0, ans=0.125 2023-06-22 11:29:21,342 INFO [train.py:996] (2/4) Epoch 7, batch 19850, loss[loss=0.1584, simple_loss=0.2316, pruned_loss=0.04259, over 21788.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.314, pruned_loss=0.08075, over 4291350.07 frames. ], batch size: 124, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:30:03,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1217022.0, ans=0.125 2023-06-22 11:30:12,592 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.250e+02 2.965e+02 3.588e+02 4.617e+02 1.028e+03, threshold=7.176e+02, percent-clipped=3.0 2023-06-22 11:31:00,257 INFO [train.py:996] (2/4) Epoch 7, batch 19900, loss[loss=0.2112, simple_loss=0.2761, pruned_loss=0.07312, over 21840.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3141, pruned_loss=0.0788, over 4283495.68 frames. ], batch size: 118, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:31:14,519 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.25 vs. limit=10.0 2023-06-22 11:31:20,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1217262.0, ans=0.125 2023-06-22 11:32:07,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1217322.0, ans=0.125 2023-06-22 11:32:37,070 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-22 11:32:42,596 INFO [train.py:996] (2/4) Epoch 7, batch 19950, loss[loss=0.2205, simple_loss=0.2828, pruned_loss=0.0791, over 21618.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3082, pruned_loss=0.07906, over 4284310.89 frames. ], batch size: 298, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:32:47,157 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.62 vs. limit=15.0 2023-06-22 11:32:51,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1217502.0, ans=0.125 2023-06-22 11:33:09,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1217562.0, ans=0.1 2023-06-22 11:33:39,615 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.408e+02 3.305e+02 4.065e+02 5.440e+02 9.798e+02, threshold=8.130e+02, percent-clipped=10.0 2023-06-22 11:34:15,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1217742.0, ans=0.0 2023-06-22 11:34:22,891 INFO [train.py:996] (2/4) Epoch 7, batch 20000, loss[loss=0.2616, simple_loss=0.3383, pruned_loss=0.09243, over 21739.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3096, pruned_loss=0.07926, over 4285681.67 frames. ], batch size: 282, lr: 4.26e-03, grad_scale: 32.0 2023-06-22 11:35:49,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1218042.0, ans=0.125 2023-06-22 11:36:02,151 INFO [train.py:996] (2/4) Epoch 7, batch 20050, loss[loss=0.2188, simple_loss=0.2971, pruned_loss=0.07029, over 21930.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3103, pruned_loss=0.08083, over 4283127.31 frames. ], batch size: 316, lr: 4.26e-03, grad_scale: 32.0 2023-06-22 11:37:00,021 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.588e+02 3.182e+02 3.871e+02 4.475e+02 7.153e+02, threshold=7.741e+02, percent-clipped=0.0 2023-06-22 11:37:49,399 INFO [train.py:996] (2/4) Epoch 7, batch 20100, loss[loss=0.2292, simple_loss=0.3001, pruned_loss=0.07919, over 21431.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3133, pruned_loss=0.08309, over 4290072.15 frames. ], batch size: 211, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:37:51,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1218402.0, ans=0.5 2023-06-22 11:37:53,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1218402.0, ans=0.125 2023-06-22 11:38:33,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1218522.0, ans=0.1 2023-06-22 11:39:20,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1218642.0, ans=0.2 2023-06-22 11:39:26,422 INFO [train.py:996] (2/4) Epoch 7, batch 20150, loss[loss=0.2389, simple_loss=0.3098, pruned_loss=0.08399, over 21681.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.3242, pruned_loss=0.08657, over 4290861.90 frames. ], batch size: 263, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:40:28,332 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.601e+02 4.003e+02 4.787e+02 6.267e+02 1.040e+03, threshold=9.575e+02, percent-clipped=17.0 2023-06-22 11:41:15,542 INFO [train.py:996] (2/4) Epoch 7, batch 20200, loss[loss=0.3134, simple_loss=0.4085, pruned_loss=0.1091, over 21531.00 frames. ], tot_loss[loss=0.2577, simple_loss=0.333, pruned_loss=0.09121, over 4291922.43 frames. ], batch size: 471, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:42:17,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1219182.0, ans=0.2 2023-06-22 11:42:25,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1219182.0, ans=0.0 2023-06-22 11:42:49,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1219242.0, ans=0.0 2023-06-22 11:43:00,754 INFO [train.py:996] (2/4) Epoch 7, batch 20250, loss[loss=0.2887, simple_loss=0.3627, pruned_loss=0.1073, over 21547.00 frames. ], tot_loss[loss=0.2575, simple_loss=0.3344, pruned_loss=0.09024, over 4290291.66 frames. ], batch size: 471, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:43:33,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1219362.0, ans=0.0 2023-06-22 11:43:54,245 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.400e+02 3.109e+02 3.852e+02 4.558e+02 1.289e+03, threshold=7.704e+02, percent-clipped=1.0 2023-06-22 11:44:40,568 INFO [train.py:996] (2/4) Epoch 7, batch 20300, loss[loss=0.2544, simple_loss=0.3495, pruned_loss=0.07968, over 21258.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3323, pruned_loss=0.08743, over 4286974.81 frames. ], batch size: 548, lr: 4.26e-03, grad_scale: 16.0 2023-06-22 11:45:00,600 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=15.0 2023-06-22 11:45:23,204 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.56 vs. limit=15.0 2023-06-22 11:45:23,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1219722.0, ans=0.1 2023-06-22 11:45:56,765 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 11:46:14,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1219842.0, ans=0.0 2023-06-22 11:46:15,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1219842.0, ans=0.05 2023-06-22 11:46:18,487 INFO [train.py:996] (2/4) Epoch 7, batch 20350, loss[loss=0.2529, simple_loss=0.3247, pruned_loss=0.09056, over 20925.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.331, pruned_loss=0.0874, over 4276969.17 frames. ], batch size: 607, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:46:38,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1219962.0, ans=0.125 2023-06-22 11:46:39,611 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.01 vs. limit=15.0 2023-06-22 11:47:11,888 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.263e+02 3.199e+02 3.639e+02 4.659e+02 8.452e+02, threshold=7.278e+02, percent-clipped=1.0 2023-06-22 11:47:18,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1220082.0, ans=0.025 2023-06-22 11:47:21,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1220082.0, ans=0.04949747468305833 2023-06-22 11:47:58,706 INFO [train.py:996] (2/4) Epoch 7, batch 20400, loss[loss=0.3106, simple_loss=0.3801, pruned_loss=0.1205, over 21647.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3314, pruned_loss=0.08898, over 4263255.11 frames. ], batch size: 414, lr: 4.25e-03, grad_scale: 32.0 2023-06-22 11:48:13,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1220202.0, ans=0.125 2023-06-22 11:48:25,408 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.84 vs. limit=15.0 2023-06-22 11:48:36,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1220322.0, ans=0.2 2023-06-22 11:48:38,787 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.90 vs. limit=15.0 2023-06-22 11:48:52,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1220322.0, ans=0.95 2023-06-22 11:49:43,959 INFO [train.py:996] (2/4) Epoch 7, batch 20450, loss[loss=0.2241, simple_loss=0.2914, pruned_loss=0.07835, over 21930.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3334, pruned_loss=0.09267, over 4261891.73 frames. ], batch size: 316, lr: 4.25e-03, grad_scale: 32.0 2023-06-22 11:50:17,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1220622.0, ans=0.2 2023-06-22 11:50:30,935 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.680e+02 3.380e+02 3.850e+02 4.870e+02 7.513e+02, threshold=7.700e+02, percent-clipped=1.0 2023-06-22 11:51:16,543 INFO [train.py:996] (2/4) Epoch 7, batch 20500, loss[loss=0.2667, simple_loss=0.3121, pruned_loss=0.1106, over 21564.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.328, pruned_loss=0.09249, over 4265404.33 frames. ], batch size: 508, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:51:28,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1220802.0, ans=0.0 2023-06-22 11:52:22,600 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=12.0 2023-06-22 11:52:34,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1221042.0, ans=0.0 2023-06-22 11:53:01,598 INFO [train.py:996] (2/4) Epoch 7, batch 20550, loss[loss=0.2561, simple_loss=0.346, pruned_loss=0.08304, over 21615.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3212, pruned_loss=0.09074, over 4257650.01 frames. ], batch size: 389, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:53:10,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1221102.0, ans=0.125 2023-06-22 11:53:51,913 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.346e+02 3.265e+02 4.144e+02 5.422e+02 9.318e+02, threshold=8.288e+02, percent-clipped=6.0 2023-06-22 11:54:08,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1221282.0, ans=0.0 2023-06-22 11:54:11,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1221282.0, ans=0.125 2023-06-22 11:54:38,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1221342.0, ans=10.0 2023-06-22 11:54:40,923 INFO [train.py:996] (2/4) Epoch 7, batch 20600, loss[loss=0.2392, simple_loss=0.3191, pruned_loss=0.07969, over 21229.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3202, pruned_loss=0.08782, over 4249532.85 frames. ], batch size: 176, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:55:00,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1221462.0, ans=0.125 2023-06-22 11:55:06,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1221462.0, ans=0.125 2023-06-22 11:55:08,412 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.77 vs. limit=6.0 2023-06-22 11:55:46,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1221582.0, ans=0.0 2023-06-22 11:55:46,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1221582.0, ans=0.125 2023-06-22 11:56:19,425 INFO [train.py:996] (2/4) Epoch 7, batch 20650, loss[loss=0.23, simple_loss=0.2987, pruned_loss=0.08065, over 21241.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3173, pruned_loss=0.08795, over 4249456.34 frames. ], batch size: 143, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:56:21,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1221702.0, ans=0.025 2023-06-22 11:56:22,092 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.37 vs. limit=15.0 2023-06-22 11:56:43,186 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.61 vs. limit=15.0 2023-06-22 11:57:08,831 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.629e+02 3.307e+02 4.027e+02 4.834e+02 1.059e+03, threshold=8.054e+02, percent-clipped=3.0 2023-06-22 11:57:59,186 INFO [train.py:996] (2/4) Epoch 7, batch 20700, loss[loss=0.1658, simple_loss=0.2392, pruned_loss=0.04619, over 21326.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3096, pruned_loss=0.0841, over 4250099.01 frames. ], batch size: 176, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 11:58:03,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1222002.0, ans=0.0 2023-06-22 11:58:08,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1222002.0, ans=0.0 2023-06-22 11:58:35,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1222122.0, ans=0.125 2023-06-22 11:58:47,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1222122.0, ans=0.125 2023-06-22 11:59:09,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1222182.0, ans=0.1 2023-06-22 11:59:41,093 INFO [train.py:996] (2/4) Epoch 7, batch 20750, loss[loss=0.2526, simple_loss=0.3275, pruned_loss=0.08883, over 21212.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3093, pruned_loss=0.08288, over 4250745.21 frames. ], batch size: 159, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:00:36,498 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.565e+02 3.483e+02 4.528e+02 6.877e+02 1.317e+03, threshold=9.056e+02, percent-clipped=16.0 2023-06-22 12:00:43,991 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=2.98 vs. limit=12.0 2023-06-22 12:00:44,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1222482.0, ans=0.0 2023-06-22 12:01:14,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1222542.0, ans=0.125 2023-06-22 12:01:26,496 INFO [train.py:996] (2/4) Epoch 7, batch 20800, loss[loss=0.2896, simple_loss=0.3706, pruned_loss=0.1043, over 21403.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3144, pruned_loss=0.08428, over 4259059.23 frames. ], batch size: 471, lr: 4.25e-03, grad_scale: 32.0 2023-06-22 12:01:53,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1222662.0, ans=0.2 2023-06-22 12:02:34,359 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-22 12:02:35,636 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-22 12:03:02,323 INFO [train.py:996] (2/4) Epoch 7, batch 20850, loss[loss=0.2489, simple_loss=0.3092, pruned_loss=0.09426, over 21136.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3087, pruned_loss=0.08246, over 4254608.44 frames. ], batch size: 608, lr: 4.25e-03, grad_scale: 32.0 2023-06-22 12:03:17,025 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=22.5 2023-06-22 12:03:30,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1222962.0, ans=0.125 2023-06-22 12:03:34,278 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:03:45,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1223022.0, ans=0.1 2023-06-22 12:03:49,729 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.19 vs. limit=15.0 2023-06-22 12:04:01,379 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.483e+02 3.647e+02 5.072e+02 6.568e+02 1.337e+03, threshold=1.014e+03, percent-clipped=9.0 2023-06-22 12:04:02,465 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.35 vs. limit=15.0 2023-06-22 12:04:23,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1223082.0, ans=0.125 2023-06-22 12:04:46,061 INFO [train.py:996] (2/4) Epoch 7, batch 20900, loss[loss=0.2626, simple_loss=0.3242, pruned_loss=0.1005, over 21747.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.311, pruned_loss=0.08385, over 4256772.43 frames. ], batch size: 441, lr: 4.25e-03, grad_scale: 32.0 2023-06-22 12:05:53,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1223382.0, ans=0.0 2023-06-22 12:06:19,590 INFO [train.py:996] (2/4) Epoch 7, batch 20950, loss[loss=0.1979, simple_loss=0.2663, pruned_loss=0.06473, over 21942.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3063, pruned_loss=0.08066, over 4255486.49 frames. ], batch size: 98, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:06:32,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1223502.0, ans=0.125 2023-06-22 12:06:43,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1223562.0, ans=0.09899494936611666 2023-06-22 12:07:06,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1223622.0, ans=0.125 2023-06-22 12:07:15,568 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.342e+02 3.036e+02 3.516e+02 4.387e+02 8.628e+02, threshold=7.032e+02, percent-clipped=0.0 2023-06-22 12:07:41,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1223742.0, ans=0.0 2023-06-22 12:07:49,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1223742.0, ans=0.2 2023-06-22 12:07:57,788 INFO [train.py:996] (2/4) Epoch 7, batch 21000, loss[loss=0.2422, simple_loss=0.3179, pruned_loss=0.08326, over 21839.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3045, pruned_loss=0.08066, over 4267959.48 frames. ], batch size: 333, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:07:57,789 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-22 12:08:10,148 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.8170, 3.2144, 1.6750, 1.7351], device='cuda:2') 2023-06-22 12:08:15,954 INFO [train.py:1028] (2/4) Epoch 7, validation: loss=0.2689, simple_loss=0.3672, pruned_loss=0.08525, over 1796401.00 frames. 2023-06-22 12:08:15,954 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-22 12:08:52,070 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=22.5 2023-06-22 12:08:52,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1223922.0, ans=0.1 2023-06-22 12:08:54,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1223922.0, ans=0.0 2023-06-22 12:09:54,857 INFO [train.py:996] (2/4) Epoch 7, batch 21050, loss[loss=0.2169, simple_loss=0.2793, pruned_loss=0.07725, over 21498.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3016, pruned_loss=0.07999, over 4267092.35 frames. ], batch size: 230, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:10:08,179 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.74 vs. limit=6.0 2023-06-22 12:10:12,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1224162.0, ans=0.125 2023-06-22 12:10:14,304 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=15.0 2023-06-22 12:10:49,867 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.371e+02 3.015e+02 3.354e+02 4.094e+02 5.427e+02, threshold=6.709e+02, percent-clipped=0.0 2023-06-22 12:11:33,479 INFO [train.py:996] (2/4) Epoch 7, batch 21100, loss[loss=0.2054, simple_loss=0.2757, pruned_loss=0.06754, over 21726.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.297, pruned_loss=0.07921, over 4255822.39 frames. ], batch size: 371, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:11:51,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1224462.0, ans=0.0 2023-06-22 12:11:56,364 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:13:07,717 INFO [train.py:996] (2/4) Epoch 7, batch 21150, loss[loss=0.2173, simple_loss=0.2778, pruned_loss=0.07839, over 21777.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2954, pruned_loss=0.08023, over 4249434.51 frames. ], batch size: 124, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:14:08,451 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.503e+02 3.198e+02 3.741e+02 4.699e+02 9.376e+02, threshold=7.483e+02, percent-clipped=8.0 2023-06-22 12:14:46,333 INFO [train.py:996] (2/4) Epoch 7, batch 21200, loss[loss=0.2269, simple_loss=0.2949, pruned_loss=0.07947, over 21637.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2903, pruned_loss=0.07874, over 4254500.22 frames. ], batch size: 332, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:15:42,833 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.47 vs. limit=15.0 2023-06-22 12:15:50,593 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=22.5 2023-06-22 12:15:55,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1225182.0, ans=0.1 2023-06-22 12:16:11,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1225242.0, ans=0.125 2023-06-22 12:16:30,835 INFO [train.py:996] (2/4) Epoch 7, batch 21250, loss[loss=0.3066, simple_loss=0.3799, pruned_loss=0.1167, over 21701.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2889, pruned_loss=0.07849, over 4264747.05 frames. ], batch size: 391, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:16:47,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1225362.0, ans=0.0 2023-06-22 12:17:09,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1225422.0, ans=0.125 2023-06-22 12:17:27,709 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.499e+02 3.297e+02 3.945e+02 5.021e+02 1.062e+03, threshold=7.890e+02, percent-clipped=7.0 2023-06-22 12:17:54,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1225542.0, ans=0.05 2023-06-22 12:18:03,957 INFO [train.py:996] (2/4) Epoch 7, batch 21300, loss[loss=0.2673, simple_loss=0.3277, pruned_loss=0.1034, over 21926.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.2947, pruned_loss=0.08047, over 4258114.56 frames. ], batch size: 333, lr: 4.25e-03, grad_scale: 16.0 2023-06-22 12:19:20,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1225782.0, ans=0.125 2023-06-22 12:19:23,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1225782.0, ans=0.0 2023-06-22 12:19:38,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1225842.0, ans=0.2 2023-06-22 12:19:47,786 INFO [train.py:996] (2/4) Epoch 7, batch 21350, loss[loss=0.2439, simple_loss=0.3332, pruned_loss=0.07728, over 21633.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.2983, pruned_loss=0.08108, over 4253797.64 frames. ], batch size: 389, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:19:52,158 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-22 12:19:53,534 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2023-06-22 12:20:02,802 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.44 vs. limit=15.0 2023-06-22 12:20:45,049 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.091e+02 3.191e+02 3.567e+02 4.757e+02 8.464e+02, threshold=7.133e+02, percent-clipped=1.0 2023-06-22 12:21:26,926 INFO [train.py:996] (2/4) Epoch 7, batch 21400, loss[loss=0.2228, simple_loss=0.3046, pruned_loss=0.07055, over 21935.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3022, pruned_loss=0.08078, over 4263474.18 frames. ], batch size: 316, lr: 4.24e-03, grad_scale: 8.0 2023-06-22 12:21:36,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1226202.0, ans=0.125 2023-06-22 12:22:12,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1226322.0, ans=0.125 2023-06-22 12:22:29,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1226382.0, ans=0.1 2023-06-22 12:22:42,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1226382.0, ans=0.125 2023-06-22 12:23:06,007 INFO [train.py:996] (2/4) Epoch 7, batch 21450, loss[loss=0.2915, simple_loss=0.3619, pruned_loss=0.1105, over 21331.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3071, pruned_loss=0.08311, over 4271766.99 frames. ], batch size: 548, lr: 4.24e-03, grad_scale: 8.0 2023-06-22 12:23:16,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1226502.0, ans=0.5 2023-06-22 12:23:22,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1226562.0, ans=0.0 2023-06-22 12:24:03,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1226682.0, ans=0.0 2023-06-22 12:24:04,445 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.467e+02 3.251e+02 3.638e+02 4.479e+02 7.872e+02, threshold=7.276e+02, percent-clipped=2.0 2023-06-22 12:24:15,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1226682.0, ans=0.2 2023-06-22 12:24:35,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1226742.0, ans=0.2 2023-06-22 12:24:45,008 INFO [train.py:996] (2/4) Epoch 7, batch 21500, loss[loss=0.2027, simple_loss=0.2637, pruned_loss=0.07086, over 21401.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3075, pruned_loss=0.08372, over 4259662.21 frames. ], batch size: 194, lr: 4.24e-03, grad_scale: 8.0 2023-06-22 12:25:24,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1226922.0, ans=0.0 2023-06-22 12:26:22,141 INFO [train.py:996] (2/4) Epoch 7, batch 21550, loss[loss=0.1986, simple_loss=0.2644, pruned_loss=0.0664, over 21697.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3025, pruned_loss=0.08139, over 4262883.67 frames. ], batch size: 112, lr: 4.24e-03, grad_scale: 8.0 2023-06-22 12:26:22,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1227102.0, ans=0.125 2023-06-22 12:27:22,171 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.230e+02 3.394e+02 4.273e+02 5.102e+02 8.166e+02, threshold=8.546e+02, percent-clipped=3.0 2023-06-22 12:27:24,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1227282.0, ans=0.125 2023-06-22 12:27:35,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1227282.0, ans=0.0 2023-06-22 12:27:35,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1227282.0, ans=0.125 2023-06-22 12:27:52,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1227342.0, ans=0.0 2023-06-22 12:28:03,003 INFO [train.py:996] (2/4) Epoch 7, batch 21600, loss[loss=0.1745, simple_loss=0.2556, pruned_loss=0.04672, over 21198.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.2977, pruned_loss=0.08032, over 4261481.52 frames. ], batch size: 548, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:28:46,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1227522.0, ans=0.2 2023-06-22 12:29:01,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1227582.0, ans=0.2 2023-06-22 12:29:24,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1227582.0, ans=0.1 2023-06-22 12:29:27,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1227642.0, ans=0.0 2023-06-22 12:29:44,721 INFO [train.py:996] (2/4) Epoch 7, batch 21650, loss[loss=0.2483, simple_loss=0.3191, pruned_loss=0.08875, over 21182.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3008, pruned_loss=0.07802, over 4270470.70 frames. ], batch size: 143, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:29:45,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1227702.0, ans=0.125 2023-06-22 12:30:12,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1227762.0, ans=0.125 2023-06-22 12:30:12,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1227762.0, ans=0.125 2023-06-22 12:30:22,305 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.69 vs. limit=15.0 2023-06-22 12:30:31,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1227822.0, ans=0.0 2023-06-22 12:30:37,785 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.31 vs. limit=6.0 2023-06-22 12:30:47,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1227882.0, ans=0.125 2023-06-22 12:30:48,051 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.276e+02 3.272e+02 4.072e+02 5.244e+02 1.561e+03, threshold=8.145e+02, percent-clipped=7.0 2023-06-22 12:31:02,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1227882.0, ans=0.0 2023-06-22 12:31:21,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1228002.0, ans=10.0 2023-06-22 12:31:22,586 INFO [train.py:996] (2/4) Epoch 7, batch 21700, loss[loss=0.2039, simple_loss=0.2644, pruned_loss=0.07167, over 21481.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3003, pruned_loss=0.07657, over 4271189.24 frames. ], batch size: 195, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:31:43,251 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-22 12:32:05,189 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:32:12,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1228122.0, ans=0.125 2023-06-22 12:32:12,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1228122.0, ans=0.0 2023-06-22 12:32:58,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1228242.0, ans=0.2 2023-06-22 12:32:58,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1228242.0, ans=0.0 2023-06-22 12:33:00,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1228302.0, ans=0.95 2023-06-22 12:33:01,426 INFO [train.py:996] (2/4) Epoch 7, batch 21750, loss[loss=0.2221, simple_loss=0.2806, pruned_loss=0.08178, over 21693.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2962, pruned_loss=0.07661, over 4273924.87 frames. ], batch size: 299, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:33:35,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1228362.0, ans=0.0 2023-06-22 12:33:58,198 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.37 vs. limit=15.0 2023-06-22 12:34:00,302 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.383e+02 3.122e+02 3.637e+02 4.917e+02 1.048e+03, threshold=7.274e+02, percent-clipped=3.0 2023-06-22 12:34:40,468 INFO [train.py:996] (2/4) Epoch 7, batch 21800, loss[loss=0.2156, simple_loss=0.292, pruned_loss=0.06963, over 21535.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2953, pruned_loss=0.07813, over 4250576.18 frames. ], batch size: 230, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:35:02,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1228662.0, ans=0.035 2023-06-22 12:35:10,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1228662.0, ans=0.0 2023-06-22 12:35:18,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1228662.0, ans=0.0 2023-06-22 12:35:32,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1228722.0, ans=0.125 2023-06-22 12:36:13,610 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-22 12:36:20,662 INFO [train.py:996] (2/4) Epoch 7, batch 21850, loss[loss=0.2249, simple_loss=0.2931, pruned_loss=0.07833, over 21830.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3025, pruned_loss=0.07859, over 4255907.58 frames. ], batch size: 282, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:36:24,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1228902.0, ans=0.5 2023-06-22 12:36:46,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1228962.0, ans=0.125 2023-06-22 12:36:48,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1228962.0, ans=0.125 2023-06-22 12:37:24,819 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.520e+02 3.283e+02 3.846e+02 4.671e+02 1.030e+03, threshold=7.692e+02, percent-clipped=3.0 2023-06-22 12:37:34,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1229082.0, ans=0.125 2023-06-22 12:37:38,464 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.98 vs. limit=8.0 2023-06-22 12:37:42,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1229082.0, ans=0.125 2023-06-22 12:38:00,766 INFO [train.py:996] (2/4) Epoch 7, batch 21900, loss[loss=0.2068, simple_loss=0.2738, pruned_loss=0.06992, over 21706.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3023, pruned_loss=0.07998, over 4260858.34 frames. ], batch size: 264, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:38:37,890 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:38:41,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1229322.0, ans=0.125 2023-06-22 12:39:44,760 INFO [train.py:996] (2/4) Epoch 7, batch 21950, loss[loss=0.2242, simple_loss=0.2821, pruned_loss=0.08311, over 21512.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.2972, pruned_loss=0.07908, over 4271349.69 frames. ], batch size: 195, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:40:10,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1229562.0, ans=0.125 2023-06-22 12:40:12,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1229562.0, ans=0.125 2023-06-22 12:40:27,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1229622.0, ans=0.125 2023-06-22 12:40:48,667 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.892e+02 2.971e+02 3.645e+02 4.413e+02 9.727e+02, threshold=7.291e+02, percent-clipped=1.0 2023-06-22 12:40:52,732 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.55 vs. limit=15.0 2023-06-22 12:40:54,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1229682.0, ans=0.125 2023-06-22 12:41:24,094 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-06-22 12:41:24,792 INFO [train.py:996] (2/4) Epoch 7, batch 22000, loss[loss=0.2105, simple_loss=0.2802, pruned_loss=0.07043, over 21609.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2913, pruned_loss=0.07676, over 4264654.20 frames. ], batch size: 247, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:41:49,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1229862.0, ans=0.125 2023-06-22 12:41:54,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1229862.0, ans=0.125 2023-06-22 12:42:16,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1229922.0, ans=0.125 2023-06-22 12:42:37,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1229982.0, ans=0.125 2023-06-22 12:42:42,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1229982.0, ans=0.1 2023-06-22 12:43:11,540 INFO [train.py:996] (2/4) Epoch 7, batch 22050, loss[loss=0.2515, simple_loss=0.3319, pruned_loss=0.08553, over 21559.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2961, pruned_loss=0.07736, over 4257499.31 frames. ], batch size: 230, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:43:12,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1230102.0, ans=0.0 2023-06-22 12:43:12,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1230102.0, ans=0.0 2023-06-22 12:43:19,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1230102.0, ans=0.1 2023-06-22 12:44:14,083 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.284e+02 3.765e+02 5.011e+02 6.386e+02 1.691e+03, threshold=1.002e+03, percent-clipped=17.0 2023-06-22 12:44:52,496 INFO [train.py:996] (2/4) Epoch 7, batch 22100, loss[loss=0.2289, simple_loss=0.2959, pruned_loss=0.08096, over 21801.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3072, pruned_loss=0.08293, over 4256141.88 frames. ], batch size: 247, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:44:55,947 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:45:20,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1230462.0, ans=0.1 2023-06-22 12:46:17,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1230642.0, ans=0.1 2023-06-22 12:46:30,169 INFO [train.py:996] (2/4) Epoch 7, batch 22150, loss[loss=0.3144, simple_loss=0.3499, pruned_loss=0.1394, over 21756.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.311, pruned_loss=0.08516, over 4260180.08 frames. ], batch size: 508, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:47:12,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1230822.0, ans=0.1 2023-06-22 12:47:22,776 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=22.5 2023-06-22 12:47:29,579 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.621e+02 3.676e+02 4.235e+02 5.035e+02 1.205e+03, threshold=8.469e+02, percent-clipped=1.0 2023-06-22 12:47:42,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1230942.0, ans=0.1 2023-06-22 12:48:02,746 INFO [train.py:996] (2/4) Epoch 7, batch 22200, loss[loss=0.3018, simple_loss=0.3807, pruned_loss=0.1115, over 21869.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3132, pruned_loss=0.08575, over 4270292.14 frames. ], batch size: 371, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:48:24,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1231062.0, ans=0.1 2023-06-22 12:48:25,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1231062.0, ans=0.0 2023-06-22 12:48:35,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1231062.0, ans=0.0 2023-06-22 12:48:49,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1231122.0, ans=0.2 2023-06-22 12:49:28,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1231242.0, ans=0.125 2023-06-22 12:49:32,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1231242.0, ans=0.1 2023-06-22 12:49:48,292 INFO [train.py:996] (2/4) Epoch 7, batch 22250, loss[loss=0.2455, simple_loss=0.3096, pruned_loss=0.09075, over 21278.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3201, pruned_loss=0.08815, over 4275410.78 frames. ], batch size: 176, lr: 4.24e-03, grad_scale: 16.0 2023-06-22 12:50:25,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1231362.0, ans=0.1 2023-06-22 12:50:43,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1231422.0, ans=0.125 2023-06-22 12:50:45,158 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.32 vs. limit=6.0 2023-06-22 12:50:50,639 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.541e+02 3.545e+02 4.219e+02 5.859e+02 1.258e+03, threshold=8.437e+02, percent-clipped=3.0 2023-06-22 12:50:58,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1231482.0, ans=0.125 2023-06-22 12:51:29,293 INFO [train.py:996] (2/4) Epoch 7, batch 22300, loss[loss=0.2456, simple_loss=0.3096, pruned_loss=0.09079, over 21928.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3205, pruned_loss=0.08943, over 4278431.11 frames. ], batch size: 333, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 12:51:34,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1231602.0, ans=0.1 2023-06-22 12:51:37,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1231602.0, ans=0.04949747468305833 2023-06-22 12:51:37,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1231602.0, ans=0.04949747468305833 2023-06-22 12:51:45,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1231662.0, ans=0.0 2023-06-22 12:52:27,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1231782.0, ans=0.0 2023-06-22 12:52:50,490 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=15.0 2023-06-22 12:53:10,174 INFO [train.py:996] (2/4) Epoch 7, batch 22350, loss[loss=0.2515, simple_loss=0.3318, pruned_loss=0.08563, over 20111.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3194, pruned_loss=0.09061, over 4287317.31 frames. ], batch size: 703, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 12:53:49,947 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 12:54:04,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1232022.0, ans=0.125 2023-06-22 12:54:14,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1232082.0, ans=0.2 2023-06-22 12:54:17,416 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.270e+02 3.296e+02 3.742e+02 4.441e+02 8.144e+02, threshold=7.483e+02, percent-clipped=0.0 2023-06-22 12:54:22,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1232082.0, ans=0.125 2023-06-22 12:54:50,709 INFO [train.py:996] (2/4) Epoch 7, batch 22400, loss[loss=0.2374, simple_loss=0.2942, pruned_loss=0.09028, over 21310.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3141, pruned_loss=0.0861, over 4288910.76 frames. ], batch size: 177, lr: 4.23e-03, grad_scale: 32.0 2023-06-22 12:54:53,341 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.36 vs. limit=15.0 2023-06-22 12:55:57,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1232382.0, ans=0.125 2023-06-22 12:56:08,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1232442.0, ans=0.0 2023-06-22 12:56:30,505 INFO [train.py:996] (2/4) Epoch 7, batch 22450, loss[loss=0.2251, simple_loss=0.2863, pruned_loss=0.08201, over 21808.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.308, pruned_loss=0.08525, over 4282077.26 frames. ], batch size: 118, lr: 4.23e-03, grad_scale: 32.0 2023-06-22 12:56:43,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1232502.0, ans=0.125 2023-06-22 12:57:02,368 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=12.0 2023-06-22 12:57:16,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1232622.0, ans=0.0 2023-06-22 12:57:34,108 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.451e+02 3.003e+02 3.355e+02 3.883e+02 5.692e+02, threshold=6.709e+02, percent-clipped=0.0 2023-06-22 12:57:39,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1232682.0, ans=0.125 2023-06-22 12:57:48,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1232742.0, ans=0.0 2023-06-22 12:58:16,666 INFO [train.py:996] (2/4) Epoch 7, batch 22500, loss[loss=0.2356, simple_loss=0.3404, pruned_loss=0.06545, over 20837.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3041, pruned_loss=0.08425, over 4272203.48 frames. ], batch size: 607, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 12:59:15,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1232982.0, ans=0.125 2023-06-22 12:59:33,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1233042.0, ans=0.1 2023-06-22 12:59:59,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1233102.0, ans=0.1 2023-06-22 13:00:01,039 INFO [train.py:996] (2/4) Epoch 7, batch 22550, loss[loss=0.2323, simple_loss=0.3057, pruned_loss=0.07938, over 21301.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3091, pruned_loss=0.08461, over 4281256.29 frames. ], batch size: 176, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 13:00:03,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.out_whiten.whitening_limit, batch_count=1233102.0, ans=8.0 2023-06-22 13:00:31,189 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=15.0 2023-06-22 13:00:44,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1233222.0, ans=0.1 2023-06-22 13:01:06,755 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.865e+02 3.475e+02 4.180e+02 5.606e+02 1.235e+03, threshold=8.360e+02, percent-clipped=11.0 2023-06-22 13:01:41,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1233342.0, ans=0.0 2023-06-22 13:01:45,197 INFO [train.py:996] (2/4) Epoch 7, batch 22600, loss[loss=0.2639, simple_loss=0.3373, pruned_loss=0.09529, over 21744.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3102, pruned_loss=0.0845, over 4280386.04 frames. ], batch size: 298, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 13:02:13,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1233462.0, ans=0.125 2023-06-22 13:02:31,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1233522.0, ans=0.125 2023-06-22 13:02:42,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1233582.0, ans=0.125 2023-06-22 13:02:44,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1233582.0, ans=0.05 2023-06-22 13:02:46,478 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.06 vs. limit=15.0 2023-06-22 13:03:18,571 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=15.0 2023-06-22 13:03:25,490 INFO [train.py:996] (2/4) Epoch 7, batch 22650, loss[loss=0.231, simple_loss=0.3, pruned_loss=0.08099, over 21641.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3097, pruned_loss=0.0856, over 4277131.59 frames. ], batch size: 263, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 13:03:30,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1233702.0, ans=0.035 2023-06-22 13:04:08,422 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=15.0 2023-06-22 13:04:19,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1233822.0, ans=0.1 2023-06-22 13:04:20,112 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.07 vs. limit=15.0 2023-06-22 13:04:32,413 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.918e+02 3.837e+02 4.775e+02 6.238e+02 8.753e+02, threshold=9.549e+02, percent-clipped=4.0 2023-06-22 13:04:43,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1233882.0, ans=0.2 2023-06-22 13:05:04,924 INFO [train.py:996] (2/4) Epoch 7, batch 22700, loss[loss=0.243, simple_loss=0.3041, pruned_loss=0.091, over 21810.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.303, pruned_loss=0.08416, over 4269966.18 frames. ], batch size: 317, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 13:06:42,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1234242.0, ans=0.1 2023-06-22 13:06:45,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1234302.0, ans=0.025 2023-06-22 13:06:46,562 INFO [train.py:996] (2/4) Epoch 7, batch 22750, loss[loss=0.3059, simple_loss=0.3653, pruned_loss=0.1233, over 21824.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3074, pruned_loss=0.08677, over 4270505.25 frames. ], batch size: 124, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 13:07:09,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1234362.0, ans=0.125 2023-06-22 13:07:40,087 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2023-06-22 13:07:53,253 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.685e+02 3.471e+02 4.141e+02 5.452e+02 1.173e+03, threshold=8.282e+02, percent-clipped=2.0 2023-06-22 13:08:25,181 INFO [train.py:996] (2/4) Epoch 7, batch 22800, loss[loss=0.2076, simple_loss=0.2917, pruned_loss=0.06172, over 21840.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3109, pruned_loss=0.08848, over 4269337.69 frames. ], batch size: 124, lr: 4.23e-03, grad_scale: 32.0 2023-06-22 13:08:26,486 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=22.5 2023-06-22 13:08:45,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1234662.0, ans=0.0 2023-06-22 13:10:04,263 INFO [train.py:996] (2/4) Epoch 7, batch 22850, loss[loss=0.1752, simple_loss=0.2595, pruned_loss=0.04546, over 19862.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3068, pruned_loss=0.08758, over 4270239.64 frames. ], batch size: 703, lr: 4.23e-03, grad_scale: 32.0 2023-06-22 13:10:11,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1234902.0, ans=0.125 2023-06-22 13:10:11,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1234902.0, ans=0.0 2023-06-22 13:10:19,847 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.26 vs. limit=6.0 2023-06-22 13:10:34,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1234962.0, ans=0.0 2023-06-22 13:10:39,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1234962.0, ans=0.5 2023-06-22 13:10:54,497 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.85 vs. limit=15.0 2023-06-22 13:11:01,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1235082.0, ans=0.2 2023-06-22 13:11:09,853 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.451e+02 3.461e+02 4.069e+02 5.005e+02 9.619e+02, threshold=8.139e+02, percent-clipped=3.0 2023-06-22 13:11:10,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1235082.0, ans=0.2 2023-06-22 13:11:45,236 INFO [train.py:996] (2/4) Epoch 7, batch 22900, loss[loss=0.2228, simple_loss=0.3198, pruned_loss=0.06296, over 21736.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3074, pruned_loss=0.08745, over 4263885.73 frames. ], batch size: 247, lr: 4.23e-03, grad_scale: 8.0 2023-06-22 13:11:47,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1235202.0, ans=0.125 2023-06-22 13:13:06,755 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.65 vs. limit=15.0 2023-06-22 13:13:16,782 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=15.0 2023-06-22 13:13:32,254 INFO [train.py:996] (2/4) Epoch 7, batch 22950, loss[loss=0.2597, simple_loss=0.3816, pruned_loss=0.0689, over 21660.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3189, pruned_loss=0.08574, over 4264591.44 frames. ], batch size: 414, lr: 4.23e-03, grad_scale: 8.0 2023-06-22 13:13:34,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1235502.0, ans=0.125 2023-06-22 13:13:36,385 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.05 vs. limit=12.0 2023-06-22 13:13:43,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1235502.0, ans=0.0 2023-06-22 13:13:47,380 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.35 vs. limit=15.0 2023-06-22 13:14:19,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1235622.0, ans=0.2 2023-06-22 13:14:42,481 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.514e+02 3.271e+02 4.396e+02 6.484e+02 1.017e+03, threshold=8.792e+02, percent-clipped=10.0 2023-06-22 13:14:42,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1235682.0, ans=0.125 2023-06-22 13:14:57,764 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=15.0 2023-06-22 13:15:11,710 INFO [train.py:996] (2/4) Epoch 7, batch 23000, loss[loss=0.2408, simple_loss=0.3177, pruned_loss=0.08201, over 21672.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.318, pruned_loss=0.08381, over 4266377.48 frames. ], batch size: 389, lr: 4.23e-03, grad_scale: 8.0 2023-06-22 13:15:20,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1235802.0, ans=0.0 2023-06-22 13:15:44,916 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-22 13:16:08,390 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2023-06-22 13:16:25,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1235982.0, ans=0.125 2023-06-22 13:16:33,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1235982.0, ans=0.0 2023-06-22 13:16:39,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1236042.0, ans=0.2 2023-06-22 13:16:52,135 INFO [train.py:996] (2/4) Epoch 7, batch 23050, loss[loss=0.2727, simple_loss=0.341, pruned_loss=0.1021, over 21829.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3191, pruned_loss=0.08525, over 4266549.69 frames. ], batch size: 247, lr: 4.23e-03, grad_scale: 8.0 2023-06-22 13:16:57,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1236102.0, ans=0.125 2023-06-22 13:17:22,107 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-22 13:18:03,100 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.533e+02 3.591e+02 4.450e+02 5.560e+02 1.062e+03, threshold=8.900e+02, percent-clipped=1.0 2023-06-22 13:18:25,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1236342.0, ans=0.1 2023-06-22 13:18:33,440 INFO [train.py:996] (2/4) Epoch 7, batch 23100, loss[loss=0.32, simple_loss=0.373, pruned_loss=0.1335, over 21401.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.316, pruned_loss=0.08574, over 4265939.22 frames. ], batch size: 471, lr: 4.23e-03, grad_scale: 8.0 2023-06-22 13:19:05,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1236462.0, ans=0.1 2023-06-22 13:19:28,538 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.18 vs. limit=10.0 2023-06-22 13:19:29,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1236522.0, ans=0.07 2023-06-22 13:19:31,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1236522.0, ans=0.0 2023-06-22 13:19:32,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1236522.0, ans=0.125 2023-06-22 13:19:39,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1236582.0, ans=0.125 2023-06-22 13:19:40,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1236582.0, ans=0.1 2023-06-22 13:19:43,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1236582.0, ans=0.125 2023-06-22 13:19:59,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1236642.0, ans=0.125 2023-06-22 13:20:09,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1236642.0, ans=0.125 2023-06-22 13:20:11,889 INFO [train.py:996] (2/4) Epoch 7, batch 23150, loss[loss=0.2041, simple_loss=0.2712, pruned_loss=0.06855, over 21842.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3122, pruned_loss=0.08584, over 4275488.45 frames. ], batch size: 298, lr: 4.23e-03, grad_scale: 8.0 2023-06-22 13:20:35,259 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 13:21:18,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1236882.0, ans=0.0 2023-06-22 13:21:21,226 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.669e+02 3.574e+02 4.225e+02 5.615e+02 9.377e+02, threshold=8.449e+02, percent-clipped=1.0 2023-06-22 13:21:50,741 INFO [train.py:996] (2/4) Epoch 7, batch 23200, loss[loss=0.2719, simple_loss=0.3203, pruned_loss=0.1117, over 21614.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3118, pruned_loss=0.08687, over 4282969.80 frames. ], batch size: 471, lr: 4.23e-03, grad_scale: 16.0 2023-06-22 13:22:54,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1237182.0, ans=0.04949747468305833 2023-06-22 13:22:57,497 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.82 vs. limit=15.0 2023-06-22 13:23:07,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1237242.0, ans=0.1 2023-06-22 13:23:11,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1237242.0, ans=0.0 2023-06-22 13:23:30,175 INFO [train.py:996] (2/4) Epoch 7, batch 23250, loss[loss=0.2262, simple_loss=0.2905, pruned_loss=0.08089, over 21809.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3099, pruned_loss=0.08682, over 4286980.63 frames. ], batch size: 247, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:23:53,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1237362.0, ans=0.125 2023-06-22 13:23:56,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1237362.0, ans=0.2 2023-06-22 13:24:30,743 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 13:24:46,816 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.614e+02 3.587e+02 4.602e+02 6.286e+02 1.178e+03, threshold=9.205e+02, percent-clipped=7.0 2023-06-22 13:24:51,176 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.15 vs. limit=22.5 2023-06-22 13:25:16,765 INFO [train.py:996] (2/4) Epoch 7, batch 23300, loss[loss=0.2196, simple_loss=0.3057, pruned_loss=0.06677, over 21854.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3186, pruned_loss=0.08948, over 4291643.36 frames. ], batch size: 124, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:25:49,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1237662.0, ans=0.0 2023-06-22 13:26:11,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1237722.0, ans=0.125 2023-06-22 13:26:45,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1237842.0, ans=0.125 2023-06-22 13:26:57,948 INFO [train.py:996] (2/4) Epoch 7, batch 23350, loss[loss=0.1851, simple_loss=0.2587, pruned_loss=0.05572, over 21241.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3222, pruned_loss=0.08851, over 4292510.51 frames. ], batch size: 159, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:27:35,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1237962.0, ans=0.0 2023-06-22 13:27:40,664 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2023-06-22 13:27:43,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1238022.0, ans=0.0 2023-06-22 13:28:00,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1238082.0, ans=0.0 2023-06-22 13:28:06,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1238082.0, ans=0.2 2023-06-22 13:28:07,265 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.01 vs. limit=15.0 2023-06-22 13:28:09,256 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.152e+02 3.411e+02 4.317e+02 5.452e+02 1.291e+03, threshold=8.634e+02, percent-clipped=4.0 2023-06-22 13:28:09,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1238082.0, ans=0.1 2023-06-22 13:28:28,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1238142.0, ans=0.2 2023-06-22 13:28:38,120 INFO [train.py:996] (2/4) Epoch 7, batch 23400, loss[loss=0.2588, simple_loss=0.3225, pruned_loss=0.09759, over 21933.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3157, pruned_loss=0.08512, over 4287178.77 frames. ], batch size: 107, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:29:14,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1238262.0, ans=0.125 2023-06-22 13:29:21,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1238322.0, ans=0.0 2023-06-22 13:29:21,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1238322.0, ans=0.0 2023-06-22 13:29:44,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1238382.0, ans=0.125 2023-06-22 13:30:12,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1238442.0, ans=0.05 2023-06-22 13:30:24,427 INFO [train.py:996] (2/4) Epoch 7, batch 23450, loss[loss=0.3042, simple_loss=0.3751, pruned_loss=0.1166, over 21860.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3179, pruned_loss=0.08628, over 4282248.51 frames. ], batch size: 124, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:30:32,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1238502.0, ans=0.05 2023-06-22 13:30:53,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1238562.0, ans=0.0 2023-06-22 13:30:59,934 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.11 vs. limit=12.0 2023-06-22 13:31:18,105 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.33 vs. limit=10.0 2023-06-22 13:31:31,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1238682.0, ans=0.125 2023-06-22 13:31:34,390 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.532e+02 3.929e+02 4.982e+02 6.736e+02 9.588e+02, threshold=9.965e+02, percent-clipped=2.0 2023-06-22 13:31:42,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1238742.0, ans=0.125 2023-06-22 13:32:01,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1238802.0, ans=0.125 2023-06-22 13:32:02,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=15.0 2023-06-22 13:32:07,434 INFO [train.py:996] (2/4) Epoch 7, batch 23500, loss[loss=0.2218, simple_loss=0.2849, pruned_loss=0.07934, over 21549.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3189, pruned_loss=0.08768, over 4284013.23 frames. ], batch size: 195, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:32:09,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1238802.0, ans=0.2 2023-06-22 13:32:15,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1238802.0, ans=0.1 2023-06-22 13:32:21,337 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.40 vs. limit=22.5 2023-06-22 13:32:22,837 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.26 vs. limit=12.0 2023-06-22 13:32:35,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1238862.0, ans=0.1 2023-06-22 13:33:08,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1238982.0, ans=0.0 2023-06-22 13:33:48,029 INFO [train.py:996] (2/4) Epoch 7, batch 23550, loss[loss=0.2498, simple_loss=0.3033, pruned_loss=0.09812, over 21478.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3135, pruned_loss=0.08816, over 4274017.42 frames. ], batch size: 548, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:34:11,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1239162.0, ans=0.5 2023-06-22 13:34:52,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1239282.0, ans=0.0 2023-06-22 13:34:55,612 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.824e+02 3.361e+02 3.874e+02 4.874e+02 9.234e+02, threshold=7.748e+02, percent-clipped=0.0 2023-06-22 13:35:04,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1239342.0, ans=0.125 2023-06-22 13:35:23,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1239342.0, ans=0.125 2023-06-22 13:35:29,849 INFO [train.py:996] (2/4) Epoch 7, batch 23600, loss[loss=0.1844, simple_loss=0.2404, pruned_loss=0.06422, over 20881.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3132, pruned_loss=0.08834, over 4272853.35 frames. ], batch size: 608, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:35:57,896 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.42 vs. limit=15.0 2023-06-22 13:36:54,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1239642.0, ans=0.0 2023-06-22 13:37:09,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1239642.0, ans=0.0 2023-06-22 13:37:12,150 INFO [train.py:996] (2/4) Epoch 7, batch 23650, loss[loss=0.2336, simple_loss=0.3182, pruned_loss=0.07449, over 21914.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3139, pruned_loss=0.08672, over 4267483.21 frames. ], batch size: 371, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:37:16,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1239702.0, ans=10.0 2023-06-22 13:37:34,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1239762.0, ans=0.1 2023-06-22 13:38:24,073 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-22 13:38:29,765 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.732e+02 3.692e+02 5.064e+02 6.588e+02 1.428e+03, threshold=1.013e+03, percent-clipped=16.0 2023-06-22 13:38:53,661 INFO [train.py:996] (2/4) Epoch 7, batch 23700, loss[loss=0.2295, simple_loss=0.3162, pruned_loss=0.07146, over 21578.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3158, pruned_loss=0.08603, over 4267467.56 frames. ], batch size: 441, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:39:55,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1240122.0, ans=0.1 2023-06-22 13:40:00,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1240122.0, ans=0.125 2023-06-22 13:40:15,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1240182.0, ans=0.125 2023-06-22 13:40:21,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1240242.0, ans=0.125 2023-06-22 13:40:40,720 INFO [train.py:996] (2/4) Epoch 7, batch 23750, loss[loss=0.2426, simple_loss=0.3108, pruned_loss=0.08719, over 21743.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3181, pruned_loss=0.08705, over 4269126.85 frames. ], batch size: 247, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:41:38,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1240482.0, ans=0.2 2023-06-22 13:41:47,994 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.371e+02 3.292e+02 4.228e+02 5.463e+02 1.067e+03, threshold=8.456e+02, percent-clipped=1.0 2023-06-22 13:42:00,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1240542.0, ans=0.95 2023-06-22 13:42:23,013 INFO [train.py:996] (2/4) Epoch 7, batch 23800, loss[loss=0.3288, simple_loss=0.415, pruned_loss=0.1213, over 21589.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3173, pruned_loss=0.08458, over 4269789.75 frames. ], batch size: 414, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:43:25,850 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 13:43:30,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1240782.0, ans=0.125 2023-06-22 13:43:37,742 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=12.0 2023-06-22 13:43:54,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=1240842.0, ans=22.5 2023-06-22 13:44:00,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1240842.0, ans=0.125 2023-06-22 13:44:04,743 INFO [train.py:996] (2/4) Epoch 7, batch 23850, loss[loss=0.3403, simple_loss=0.4042, pruned_loss=0.1383, over 21390.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3252, pruned_loss=0.08651, over 4271461.71 frames. ], batch size: 507, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:44:05,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1240902.0, ans=0.2 2023-06-22 13:44:23,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1240902.0, ans=0.1 2023-06-22 13:44:35,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1240962.0, ans=0.1 2023-06-22 13:44:36,598 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-06-22 13:45:00,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1241022.0, ans=0.125 2023-06-22 13:45:02,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1241022.0, ans=0.2 2023-06-22 13:45:23,638 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.822e+02 3.704e+02 4.399e+02 5.519e+02 1.068e+03, threshold=8.797e+02, percent-clipped=5.0 2023-06-22 13:45:39,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1241142.0, ans=0.1 2023-06-22 13:45:44,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1241142.0, ans=0.5 2023-06-22 13:45:47,903 INFO [train.py:996] (2/4) Epoch 7, batch 23900, loss[loss=0.2388, simple_loss=0.3334, pruned_loss=0.07215, over 21717.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3323, pruned_loss=0.08882, over 4270107.22 frames. ], batch size: 332, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:47:13,541 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.76 vs. limit=10.0 2023-06-22 13:47:30,440 INFO [train.py:996] (2/4) Epoch 7, batch 23950, loss[loss=0.2702, simple_loss=0.346, pruned_loss=0.09721, over 21451.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.326, pruned_loss=0.08792, over 4268146.51 frames. ], batch size: 131, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:48:10,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1241562.0, ans=0.0 2023-06-22 13:48:33,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1241682.0, ans=0.125 2023-06-22 13:48:35,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1241682.0, ans=0.1 2023-06-22 13:48:46,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1241682.0, ans=0.125 2023-06-22 13:48:47,875 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.707e+02 3.506e+02 4.327e+02 5.535e+02 8.905e+02, threshold=8.653e+02, percent-clipped=1.0 2023-06-22 13:48:50,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1241682.0, ans=15.0 2023-06-22 13:48:59,054 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=22.5 2023-06-22 13:49:17,838 INFO [train.py:996] (2/4) Epoch 7, batch 24000, loss[loss=0.3551, simple_loss=0.3985, pruned_loss=0.1559, over 21467.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3271, pruned_loss=0.09121, over 4270021.90 frames. ], batch size: 510, lr: 4.22e-03, grad_scale: 32.0 2023-06-22 13:49:17,838 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-22 13:49:33,460 INFO [train.py:1028] (2/4) Epoch 7, validation: loss=0.2773, simple_loss=0.3696, pruned_loss=0.09254, over 1796401.00 frames. 2023-06-22 13:49:33,461 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-22 13:49:55,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1241862.0, ans=0.0 2023-06-22 13:50:49,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1242042.0, ans=0.125 2023-06-22 13:51:05,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1242042.0, ans=0.0 2023-06-22 13:51:16,755 INFO [train.py:996] (2/4) Epoch 7, batch 24050, loss[loss=0.1929, simple_loss=0.2844, pruned_loss=0.05074, over 21758.00 frames. ], tot_loss[loss=0.2557, simple_loss=0.3283, pruned_loss=0.09154, over 4273511.27 frames. ], batch size: 247, lr: 4.22e-03, grad_scale: 16.0 2023-06-22 13:51:19,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1242102.0, ans=0.125 2023-06-22 13:51:33,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1242162.0, ans=0.1 2023-06-22 13:52:13,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1242222.0, ans=0.125 2023-06-22 13:52:35,700 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.538e+02 3.768e+02 5.092e+02 6.295e+02 1.003e+03, threshold=1.018e+03, percent-clipped=2.0 2023-06-22 13:52:44,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1242342.0, ans=0.125 2023-06-22 13:52:48,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1242342.0, ans=0.2 2023-06-22 13:52:53,723 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.20 vs. limit=15.0 2023-06-22 13:52:58,061 INFO [train.py:996] (2/4) Epoch 7, batch 24100, loss[loss=0.2879, simple_loss=0.3625, pruned_loss=0.1067, over 21693.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3278, pruned_loss=0.09007, over 4273403.33 frames. ], batch size: 351, lr: 4.22e-03, grad_scale: 8.0 2023-06-22 13:53:45,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1242522.0, ans=0.125 2023-06-22 13:53:48,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1242522.0, ans=0.125 2023-06-22 13:54:38,959 INFO [train.py:996] (2/4) Epoch 7, batch 24150, loss[loss=0.2472, simple_loss=0.3151, pruned_loss=0.08965, over 21921.00 frames. ], tot_loss[loss=0.2546, simple_loss=0.3263, pruned_loss=0.09148, over 4280957.43 frames. ], batch size: 333, lr: 4.22e-03, grad_scale: 8.0 2023-06-22 13:55:20,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1242822.0, ans=0.0 2023-06-22 13:55:33,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1242822.0, ans=0.1 2023-06-22 13:55:57,968 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.873e+02 3.729e+02 4.537e+02 5.592e+02 8.815e+02, threshold=9.074e+02, percent-clipped=0.0 2023-06-22 13:56:18,594 INFO [train.py:996] (2/4) Epoch 7, batch 24200, loss[loss=0.2113, simple_loss=0.2899, pruned_loss=0.06634, over 21420.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.328, pruned_loss=0.09209, over 4284181.43 frames. ], batch size: 131, lr: 4.22e-03, grad_scale: 8.0 2023-06-22 13:57:05,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1243122.0, ans=0.125 2023-06-22 13:57:45,862 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 13:58:03,625 INFO [train.py:996] (2/4) Epoch 7, batch 24250, loss[loss=0.1979, simple_loss=0.3024, pruned_loss=0.04674, over 21683.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3248, pruned_loss=0.0854, over 4286106.11 frames. ], batch size: 414, lr: 4.21e-03, grad_scale: 8.0 2023-06-22 13:58:06,086 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 13:58:52,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1243422.0, ans=0.0 2023-06-22 13:59:07,502 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.070e+02 3.091e+02 3.731e+02 4.711e+02 7.099e+02, threshold=7.462e+02, percent-clipped=0.0 2023-06-22 13:59:30,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1243542.0, ans=0.125 2023-06-22 13:59:33,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1243542.0, ans=0.0 2023-06-22 13:59:38,159 INFO [train.py:996] (2/4) Epoch 7, batch 24300, loss[loss=0.1607, simple_loss=0.23, pruned_loss=0.04571, over 21230.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3171, pruned_loss=0.07932, over 4282458.46 frames. ], batch size: 176, lr: 4.21e-03, grad_scale: 8.0 2023-06-22 14:00:26,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1243722.0, ans=0.125 2023-06-22 14:00:33,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1243782.0, ans=0.125 2023-06-22 14:00:43,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1243782.0, ans=0.125 2023-06-22 14:01:14,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1243842.0, ans=0.1 2023-06-22 14:01:17,184 INFO [train.py:996] (2/4) Epoch 7, batch 24350, loss[loss=0.2555, simple_loss=0.3242, pruned_loss=0.09344, over 21897.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3129, pruned_loss=0.07894, over 4286022.95 frames. ], batch size: 316, lr: 4.21e-03, grad_scale: 8.0 2023-06-22 14:01:23,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1243902.0, ans=0.125 2023-06-22 14:01:34,502 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:02:30,996 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.102e+02 3.282e+02 3.867e+02 5.054e+02 1.141e+03, threshold=7.734e+02, percent-clipped=5.0 2023-06-22 14:02:56,600 INFO [train.py:996] (2/4) Epoch 7, batch 24400, loss[loss=0.2167, simple_loss=0.295, pruned_loss=0.06925, over 20665.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3191, pruned_loss=0.08296, over 4284725.38 frames. ], batch size: 607, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:03:11,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1244202.0, ans=0.125 2023-06-22 14:03:11,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1244202.0, ans=0.0 2023-06-22 14:04:22,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1244442.0, ans=0.125 2023-06-22 14:04:42,174 INFO [train.py:996] (2/4) Epoch 7, batch 24450, loss[loss=0.2099, simple_loss=0.2884, pruned_loss=0.06577, over 21286.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3213, pruned_loss=0.08448, over 4282959.52 frames. ], batch size: 159, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:04:50,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1244502.0, ans=0.125 2023-06-22 14:05:57,128 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.669e+02 3.418e+02 4.354e+02 5.626e+02 1.189e+03, threshold=8.709e+02, percent-clipped=8.0 2023-06-22 14:06:22,864 INFO [train.py:996] (2/4) Epoch 7, batch 24500, loss[loss=0.2288, simple_loss=0.3076, pruned_loss=0.07496, over 21917.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3224, pruned_loss=0.08459, over 4283616.24 frames. ], batch size: 316, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:07:08,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1244922.0, ans=0.125 2023-06-22 14:07:22,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1244922.0, ans=0.125 2023-06-22 14:07:34,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1244982.0, ans=0.0 2023-06-22 14:07:57,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1245042.0, ans=0.0 2023-06-22 14:08:08,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1245042.0, ans=0.015 2023-06-22 14:08:08,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1245042.0, ans=0.125 2023-06-22 14:08:08,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1245042.0, ans=0.125 2023-06-22 14:08:11,131 INFO [train.py:996] (2/4) Epoch 7, batch 24550, loss[loss=0.2573, simple_loss=0.3312, pruned_loss=0.09171, over 21589.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3236, pruned_loss=0.08648, over 4287953.62 frames. ], batch size: 263, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:08:17,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1245102.0, ans=0.0 2023-06-22 14:08:26,442 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.89 vs. limit=6.0 2023-06-22 14:08:43,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1245162.0, ans=0.125 2023-06-22 14:08:43,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1245162.0, ans=0.1 2023-06-22 14:09:22,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1245282.0, ans=0.2 2023-06-22 14:09:25,435 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.637e+02 3.383e+02 4.024e+02 4.711e+02 8.418e+02, threshold=8.048e+02, percent-clipped=0.0 2023-06-22 14:09:35,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1245342.0, ans=0.125 2023-06-22 14:09:37,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1245342.0, ans=0.025 2023-06-22 14:09:51,558 INFO [train.py:996] (2/4) Epoch 7, batch 24600, loss[loss=0.3123, simple_loss=0.3522, pruned_loss=0.1361, over 21359.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3227, pruned_loss=0.08865, over 4291817.39 frames. ], batch size: 507, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:10:31,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1245522.0, ans=15.0 2023-06-22 14:10:32,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1245522.0, ans=0.125 2023-06-22 14:10:55,788 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2023-06-22 14:11:04,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1245582.0, ans=0.025 2023-06-22 14:11:32,120 INFO [train.py:996] (2/4) Epoch 7, batch 24650, loss[loss=0.2422, simple_loss=0.3072, pruned_loss=0.08854, over 21245.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3173, pruned_loss=0.08808, over 4278784.25 frames. ], batch size: 159, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:11:35,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1245702.0, ans=0.125 2023-06-22 14:12:46,321 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.600e+02 3.556e+02 4.332e+02 6.409e+02 1.192e+03, threshold=8.664e+02, percent-clipped=12.0 2023-06-22 14:12:53,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1245942.0, ans=0.2 2023-06-22 14:13:12,566 INFO [train.py:996] (2/4) Epoch 7, batch 24700, loss[loss=0.2113, simple_loss=0.2811, pruned_loss=0.07075, over 21737.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3144, pruned_loss=0.08591, over 4285233.29 frames. ], batch size: 112, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:13:37,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1246062.0, ans=0.125 2023-06-22 14:13:37,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1246062.0, ans=0.125 2023-06-22 14:13:43,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1246062.0, ans=0.04949747468305833 2023-06-22 14:14:36,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1246242.0, ans=0.0 2023-06-22 14:14:54,467 INFO [train.py:996] (2/4) Epoch 7, batch 24750, loss[loss=0.2213, simple_loss=0.2798, pruned_loss=0.08138, over 21329.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3065, pruned_loss=0.08317, over 4262152.77 frames. ], batch size: 160, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:15:03,086 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.95 vs. limit=15.0 2023-06-22 14:15:57,508 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.95 vs. limit=15.0 2023-06-22 14:16:06,670 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.17 vs. limit=22.5 2023-06-22 14:16:07,199 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.368e+02 3.441e+02 4.025e+02 5.655e+02 9.587e+02, threshold=8.050e+02, percent-clipped=3.0 2023-06-22 14:16:21,408 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.71 vs. limit=10.0 2023-06-22 14:16:33,207 INFO [train.py:996] (2/4) Epoch 7, batch 24800, loss[loss=0.2266, simple_loss=0.2942, pruned_loss=0.07951, over 21842.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3015, pruned_loss=0.08264, over 4269915.33 frames. ], batch size: 333, lr: 4.21e-03, grad_scale: 32.0 2023-06-22 14:16:52,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1246662.0, ans=0.0 2023-06-22 14:17:55,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1246842.0, ans=0.125 2023-06-22 14:17:58,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1246842.0, ans=0.2 2023-06-22 14:18:00,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=1246842.0, ans=0.1 2023-06-22 14:18:14,875 INFO [train.py:996] (2/4) Epoch 7, batch 24850, loss[loss=0.1819, simple_loss=0.2476, pruned_loss=0.05815, over 21828.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.302, pruned_loss=0.08439, over 4274617.31 frames. ], batch size: 118, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:19:31,871 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.875e+02 3.674e+02 4.443e+02 5.678e+02 1.334e+03, threshold=8.887e+02, percent-clipped=8.0 2023-06-22 14:19:33,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1247142.0, ans=0.2 2023-06-22 14:19:35,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1247142.0, ans=0.125 2023-06-22 14:19:56,844 INFO [train.py:996] (2/4) Epoch 7, batch 24900, loss[loss=0.2516, simple_loss=0.3185, pruned_loss=0.09229, over 21583.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3065, pruned_loss=0.08575, over 4277762.16 frames. ], batch size: 230, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:19:58,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1247202.0, ans=0.125 2023-06-22 14:20:25,651 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=10.05 vs. limit=15.0 2023-06-22 14:20:40,656 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-06-22 14:21:24,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1247442.0, ans=0.95 2023-06-22 14:21:43,152 INFO [train.py:996] (2/4) Epoch 7, batch 24950, loss[loss=0.2801, simple_loss=0.3408, pruned_loss=0.1097, over 21819.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3141, pruned_loss=0.08971, over 4277173.47 frames. ], batch size: 282, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:21:53,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1247502.0, ans=0.0 2023-06-22 14:22:31,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1247622.0, ans=0.125 2023-06-22 14:22:50,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1247682.0, ans=0.0 2023-06-22 14:23:02,241 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.769e+02 3.624e+02 4.184e+02 5.539e+02 8.778e+02, threshold=8.368e+02, percent-clipped=0.0 2023-06-22 14:23:15,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1247742.0, ans=0.1 2023-06-22 14:23:17,371 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.74 vs. limit=22.5 2023-06-22 14:23:20,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1247742.0, ans=0.0 2023-06-22 14:23:23,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1247742.0, ans=0.125 2023-06-22 14:23:26,302 INFO [train.py:996] (2/4) Epoch 7, batch 25000, loss[loss=0.2042, simple_loss=0.2677, pruned_loss=0.07033, over 21381.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3184, pruned_loss=0.09073, over 4279975.55 frames. ], batch size: 211, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:23:28,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1247802.0, ans=0.125 2023-06-22 14:23:55,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1247862.0, ans=0.125 2023-06-22 14:24:07,234 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.48 vs. limit=22.5 2023-06-22 14:24:21,067 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:25:10,069 INFO [train.py:996] (2/4) Epoch 7, batch 25050, loss[loss=0.2152, simple_loss=0.2744, pruned_loss=0.07796, over 21521.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3126, pruned_loss=0.0895, over 4270310.85 frames. ], batch size: 195, lr: 4.21e-03, grad_scale: 8.0 2023-06-22 14:25:13,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1248102.0, ans=0.2 2023-06-22 14:26:28,133 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.523e+02 3.310e+02 3.900e+02 4.690e+02 8.099e+02, threshold=7.799e+02, percent-clipped=0.0 2023-06-22 14:26:35,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1248342.0, ans=0.0 2023-06-22 14:26:46,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1248342.0, ans=0.0 2023-06-22 14:26:49,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1248402.0, ans=0.0 2023-06-22 14:26:51,215 INFO [train.py:996] (2/4) Epoch 7, batch 25100, loss[loss=0.2178, simple_loss=0.2721, pruned_loss=0.08179, over 21257.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.307, pruned_loss=0.08765, over 4267404.95 frames. ], batch size: 176, lr: 4.21e-03, grad_scale: 8.0 2023-06-22 14:28:18,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1248642.0, ans=0.2 2023-06-22 14:28:25,028 INFO [train.py:996] (2/4) Epoch 7, batch 25150, loss[loss=0.2133, simple_loss=0.3022, pruned_loss=0.06218, over 21439.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3102, pruned_loss=0.08586, over 4254205.32 frames. ], batch size: 131, lr: 4.21e-03, grad_scale: 8.0 2023-06-22 14:29:31,217 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=15.0 2023-06-22 14:29:33,781 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:29:48,296 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.302e+02 3.375e+02 4.359e+02 5.711e+02 9.666e+02, threshold=8.717e+02, percent-clipped=5.0 2023-06-22 14:30:06,541 INFO [train.py:996] (2/4) Epoch 7, batch 25200, loss[loss=0.2116, simple_loss=0.2915, pruned_loss=0.06588, over 21698.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3088, pruned_loss=0.08267, over 4249874.94 frames. ], batch size: 298, lr: 4.21e-03, grad_scale: 16.0 2023-06-22 14:30:23,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=22.5 2023-06-22 14:31:41,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1249242.0, ans=0.2 2023-06-22 14:31:45,943 INFO [train.py:996] (2/4) Epoch 7, batch 25250, loss[loss=0.1878, simple_loss=0.2652, pruned_loss=0.05526, over 21142.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3058, pruned_loss=0.08082, over 4252066.11 frames. ], batch size: 548, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:32:39,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1249422.0, ans=0.125 2023-06-22 14:32:51,888 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.73 vs. limit=15.0 2023-06-22 14:33:04,016 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.324e+02 3.268e+02 4.285e+02 6.443e+02 1.274e+03, threshold=8.570e+02, percent-clipped=9.0 2023-06-22 14:33:16,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1249542.0, ans=0.0 2023-06-22 14:33:32,036 INFO [train.py:996] (2/4) Epoch 7, batch 25300, loss[loss=0.2051, simple_loss=0.2922, pruned_loss=0.05901, over 21766.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3044, pruned_loss=0.08093, over 4256055.51 frames. ], batch size: 351, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:33:36,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1249602.0, ans=0.0 2023-06-22 14:33:42,616 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.56 vs. limit=12.0 2023-06-22 14:33:50,878 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.68 vs. limit=12.0 2023-06-22 14:34:13,331 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-06-22 14:34:41,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1249782.0, ans=0.125 2023-06-22 14:35:02,465 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-22 14:35:12,018 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.36 vs. limit=22.5 2023-06-22 14:35:12,498 INFO [train.py:996] (2/4) Epoch 7, batch 25350, loss[loss=0.1992, simple_loss=0.2933, pruned_loss=0.05258, over 21748.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3083, pruned_loss=0.08054, over 4255333.01 frames. ], batch size: 332, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:35:34,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1249962.0, ans=0.1 2023-06-22 14:35:35,386 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.49 vs. limit=10.0 2023-06-22 14:35:38,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1249962.0, ans=0.125 2023-06-22 14:36:25,525 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 1.905e+02 3.273e+02 3.948e+02 5.188e+02 1.185e+03, threshold=7.895e+02, percent-clipped=3.0 2023-06-22 14:36:30,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1250142.0, ans=0.95 2023-06-22 14:36:47,857 INFO [train.py:996] (2/4) Epoch 7, batch 25400, loss[loss=0.1852, simple_loss=0.2687, pruned_loss=0.05086, over 21608.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3038, pruned_loss=0.07976, over 4263769.51 frames. ], batch size: 263, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:37:00,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1250202.0, ans=0.5 2023-06-22 14:37:05,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1250202.0, ans=0.125 2023-06-22 14:37:20,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1250262.0, ans=0.0 2023-06-22 14:37:20,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1250262.0, ans=0.0 2023-06-22 14:38:04,869 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=22.5 2023-06-22 14:38:33,451 INFO [train.py:996] (2/4) Epoch 7, batch 25450, loss[loss=0.258, simple_loss=0.3453, pruned_loss=0.0853, over 21811.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3041, pruned_loss=0.08111, over 4276611.83 frames. ], batch size: 371, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:39:09,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1250562.0, ans=0.125 2023-06-22 14:39:47,980 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.452e+02 3.153e+02 3.590e+02 4.840e+02 1.017e+03, threshold=7.180e+02, percent-clipped=5.0 2023-06-22 14:39:50,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1250742.0, ans=0.04949747468305833 2023-06-22 14:40:16,008 INFO [train.py:996] (2/4) Epoch 7, batch 25500, loss[loss=0.2388, simple_loss=0.3277, pruned_loss=0.07497, over 21754.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3047, pruned_loss=0.07765, over 4265383.77 frames. ], batch size: 351, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:40:19,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1250802.0, ans=0.1 2023-06-22 14:40:45,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1250862.0, ans=0.125 2023-06-22 14:41:44,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1251042.0, ans=0.0 2023-06-22 14:41:44,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1251042.0, ans=0.125 2023-06-22 14:41:57,262 INFO [train.py:996] (2/4) Epoch 7, batch 25550, loss[loss=0.2231, simple_loss=0.3013, pruned_loss=0.0724, over 21202.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3124, pruned_loss=0.07849, over 4268291.57 frames. ], batch size: 159, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:43:12,693 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.52 vs. limit=12.0 2023-06-22 14:43:16,629 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.404e+02 3.498e+02 4.719e+02 6.187e+02 1.035e+03, threshold=9.438e+02, percent-clipped=13.0 2023-06-22 14:43:18,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1251342.0, ans=0.125 2023-06-22 14:43:20,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1251342.0, ans=0.0 2023-06-22 14:43:39,330 INFO [train.py:996] (2/4) Epoch 7, batch 25600, loss[loss=0.2569, simple_loss=0.3391, pruned_loss=0.08737, over 21340.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3166, pruned_loss=0.07992, over 4278225.91 frames. ], batch size: 131, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:44:20,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1251522.0, ans=0.0 2023-06-22 14:44:48,342 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.96 vs. limit=10.0 2023-06-22 14:45:20,806 INFO [train.py:996] (2/4) Epoch 7, batch 25650, loss[loss=0.2084, simple_loss=0.2759, pruned_loss=0.07047, over 21764.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3165, pruned_loss=0.08235, over 4271841.05 frames. ], batch size: 317, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:45:32,572 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.69 vs. limit=15.0 2023-06-22 14:46:07,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1251822.0, ans=0.0 2023-06-22 14:46:09,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1251822.0, ans=0.0 2023-06-22 14:46:12,971 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=9.35 vs. limit=15.0 2023-06-22 14:46:37,525 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.765e+02 3.700e+02 4.314e+02 5.225e+02 1.015e+03, threshold=8.627e+02, percent-clipped=1.0 2023-06-22 14:47:05,268 INFO [train.py:996] (2/4) Epoch 7, batch 25700, loss[loss=0.303, simple_loss=0.3599, pruned_loss=0.123, over 21579.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3138, pruned_loss=0.08369, over 4269591.06 frames. ], batch size: 471, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:47:10,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1252002.0, ans=0.0 2023-06-22 14:47:23,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1252062.0, ans=0.1 2023-06-22 14:48:47,169 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=22.5 2023-06-22 14:48:47,653 INFO [train.py:996] (2/4) Epoch 7, batch 25750, loss[loss=0.262, simple_loss=0.3397, pruned_loss=0.09212, over 21543.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3185, pruned_loss=0.08635, over 4268421.42 frames. ], batch size: 414, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:49:43,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1252422.0, ans=0.0 2023-06-22 14:49:52,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1252482.0, ans=0.125 2023-06-22 14:50:10,483 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:50:13,043 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.599e+02 3.834e+02 4.976e+02 6.309e+02 1.046e+03, threshold=9.952e+02, percent-clipped=8.0 2023-06-22 14:50:28,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1252542.0, ans=0.1 2023-06-22 14:50:31,419 INFO [train.py:996] (2/4) Epoch 7, batch 25800, loss[loss=0.285, simple_loss=0.3653, pruned_loss=0.1023, over 20710.00 frames. ], tot_loss[loss=0.2576, simple_loss=0.3319, pruned_loss=0.09165, over 4263220.50 frames. ], batch size: 607, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:50:45,107 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:52:13,824 INFO [train.py:996] (2/4) Epoch 7, batch 25850, loss[loss=0.2291, simple_loss=0.2974, pruned_loss=0.08045, over 20146.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3318, pruned_loss=0.09038, over 4269413.11 frames. ], batch size: 702, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:52:54,208 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.87 vs. limit=6.0 2023-06-22 14:53:04,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1253022.0, ans=0.125 2023-06-22 14:53:06,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1253022.0, ans=0.2 2023-06-22 14:53:09,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1253022.0, ans=0.125 2023-06-22 14:53:39,283 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.599e+02 3.379e+02 4.198e+02 5.392e+02 1.106e+03, threshold=8.396e+02, percent-clipped=2.0 2023-06-22 14:54:05,887 INFO [train.py:996] (2/4) Epoch 7, batch 25900, loss[loss=0.2738, simple_loss=0.3868, pruned_loss=0.08043, over 20938.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3319, pruned_loss=0.09073, over 4276287.33 frames. ], batch size: 607, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:54:10,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1253202.0, ans=0.2 2023-06-22 14:55:52,695 INFO [train.py:996] (2/4) Epoch 7, batch 25950, loss[loss=0.2439, simple_loss=0.3262, pruned_loss=0.08078, over 21674.00 frames. ], tot_loss[loss=0.2619, simple_loss=0.338, pruned_loss=0.09292, over 4268774.36 frames. ], batch size: 298, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:56:19,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1253562.0, ans=0.125 2023-06-22 14:56:23,406 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.27 vs. limit=15.0 2023-06-22 14:56:24,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1253562.0, ans=10.0 2023-06-22 14:56:48,101 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=22.5 2023-06-22 14:57:11,653 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.576e+02 3.589e+02 4.463e+02 5.427e+02 9.904e+02, threshold=8.927e+02, percent-clipped=3.0 2023-06-22 14:57:35,022 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 14:57:37,677 INFO [train.py:996] (2/4) Epoch 7, batch 26000, loss[loss=0.243, simple_loss=0.3204, pruned_loss=0.08281, over 21269.00 frames. ], tot_loss[loss=0.2596, simple_loss=0.3364, pruned_loss=0.09143, over 4267792.38 frames. ], batch size: 176, lr: 4.20e-03, grad_scale: 32.0 2023-06-22 14:58:02,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1253862.0, ans=0.125 2023-06-22 14:59:18,542 INFO [train.py:996] (2/4) Epoch 7, batch 26050, loss[loss=0.2622, simple_loss=0.3323, pruned_loss=0.09612, over 21896.00 frames. ], tot_loss[loss=0.2607, simple_loss=0.3356, pruned_loss=0.09288, over 4277269.57 frames. ], batch size: 124, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 14:59:37,852 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.11 vs. limit=15.0 2023-06-22 15:00:29,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1254342.0, ans=0.0 2023-06-22 15:00:37,205 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.714e+02 3.622e+02 4.374e+02 5.321e+02 8.486e+02, threshold=8.748e+02, percent-clipped=0.0 2023-06-22 15:00:56,279 INFO [train.py:996] (2/4) Epoch 7, batch 26100, loss[loss=0.2103, simple_loss=0.2838, pruned_loss=0.06839, over 21458.00 frames. ], tot_loss[loss=0.2578, simple_loss=0.331, pruned_loss=0.09228, over 4273594.61 frames. ], batch size: 131, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 15:01:06,658 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-22 15:01:14,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1254462.0, ans=0.2 2023-06-22 15:01:18,624 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.91 vs. limit=6.0 2023-06-22 15:01:34,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1254522.0, ans=0.1 2023-06-22 15:01:40,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1254522.0, ans=0.125 2023-06-22 15:01:40,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1254522.0, ans=0.125 2023-06-22 15:01:50,957 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.27 vs. limit=15.0 2023-06-22 15:02:32,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1254642.0, ans=0.0 2023-06-22 15:02:36,844 INFO [train.py:996] (2/4) Epoch 7, batch 26150, loss[loss=0.2758, simple_loss=0.3372, pruned_loss=0.1072, over 20948.00 frames. ], tot_loss[loss=0.2573, simple_loss=0.3288, pruned_loss=0.09289, over 4282618.81 frames. ], batch size: 608, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 15:03:00,272 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=15.0 2023-06-22 15:03:02,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1254762.0, ans=0.125 2023-06-22 15:03:23,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1254822.0, ans=0.1 2023-06-22 15:04:01,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1254942.0, ans=0.125 2023-06-22 15:04:04,139 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.633e+02 3.200e+02 3.625e+02 4.544e+02 6.697e+02, threshold=7.250e+02, percent-clipped=0.0 2023-06-22 15:04:19,708 INFO [train.py:996] (2/4) Epoch 7, batch 26200, loss[loss=0.2336, simple_loss=0.3351, pruned_loss=0.06603, over 21650.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3289, pruned_loss=0.08997, over 4282491.87 frames. ], batch size: 263, lr: 4.20e-03, grad_scale: 16.0 2023-06-22 15:04:33,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1255002.0, ans=0.0 2023-06-22 15:04:59,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1255122.0, ans=0.1 2023-06-22 15:05:55,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1255242.0, ans=0.125 2023-06-22 15:05:58,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1255302.0, ans=0.1 2023-06-22 15:05:59,766 INFO [train.py:996] (2/4) Epoch 7, batch 26250, loss[loss=0.2389, simple_loss=0.303, pruned_loss=0.08736, over 21472.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3319, pruned_loss=0.08901, over 4282280.60 frames. ], batch size: 194, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:06:04,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1255302.0, ans=0.1 2023-06-22 15:06:44,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1255422.0, ans=0.0 2023-06-22 15:06:53,211 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=15.0 2023-06-22 15:07:21,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1255542.0, ans=0.125 2023-06-22 15:07:24,249 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.543e+02 3.568e+02 4.427e+02 6.029e+02 1.421e+03, threshold=8.855e+02, percent-clipped=13.0 2023-06-22 15:07:37,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1255602.0, ans=0.2 2023-06-22 15:07:38,964 INFO [train.py:996] (2/4) Epoch 7, batch 26300, loss[loss=0.2482, simple_loss=0.3135, pruned_loss=0.09149, over 21932.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3287, pruned_loss=0.08926, over 4284451.16 frames. ], batch size: 414, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:08:07,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1255662.0, ans=0.125 2023-06-22 15:08:31,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1255722.0, ans=0.125 2023-06-22 15:09:11,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1255842.0, ans=0.07 2023-06-22 15:09:19,814 INFO [train.py:996] (2/4) Epoch 7, batch 26350, loss[loss=0.2756, simple_loss=0.342, pruned_loss=0.1046, over 21308.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3274, pruned_loss=0.08973, over 4282174.78 frames. ], batch size: 176, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:10:06,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1256022.0, ans=0.1 2023-06-22 15:10:24,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1256022.0, ans=0.125 2023-06-22 15:10:24,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1256022.0, ans=0.0 2023-06-22 15:10:45,397 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.781e+02 3.544e+02 4.038e+02 5.377e+02 1.137e+03, threshold=8.075e+02, percent-clipped=4.0 2023-06-22 15:11:00,707 INFO [train.py:996] (2/4) Epoch 7, batch 26400, loss[loss=0.2266, simple_loss=0.2778, pruned_loss=0.08773, over 21208.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3213, pruned_loss=0.08994, over 4270789.98 frames. ], batch size: 144, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:11:02,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1256202.0, ans=0.125 2023-06-22 15:11:40,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1256262.0, ans=0.125 2023-06-22 15:12:40,471 INFO [train.py:996] (2/4) Epoch 7, batch 26450, loss[loss=0.2651, simple_loss=0.3941, pruned_loss=0.06807, over 20767.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3224, pruned_loss=0.08956, over 4263023.07 frames. ], batch size: 607, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:12:58,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1256502.0, ans=0.1 2023-06-22 15:13:25,260 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:14:08,659 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.783e+02 3.688e+02 4.621e+02 8.318e+02 2.033e+03, threshold=9.242e+02, percent-clipped=27.0 2023-06-22 15:14:24,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1256742.0, ans=0.0 2023-06-22 15:14:37,752 INFO [train.py:996] (2/4) Epoch 7, batch 26500, loss[loss=0.1979, simple_loss=0.2543, pruned_loss=0.07071, over 21793.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3233, pruned_loss=0.08808, over 4261712.57 frames. ], batch size: 102, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:15:15,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1256922.0, ans=0.125 2023-06-22 15:15:38,332 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-22 15:16:27,013 INFO [train.py:996] (2/4) Epoch 7, batch 26550, loss[loss=0.1998, simple_loss=0.2939, pruned_loss=0.05281, over 21736.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.32, pruned_loss=0.08539, over 4261812.99 frames. ], batch size: 332, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:16:47,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1257162.0, ans=0.1 2023-06-22 15:16:50,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1257162.0, ans=0.125 2023-06-22 15:16:53,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1257162.0, ans=0.125 2023-06-22 15:17:55,907 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.468e+02 3.702e+02 5.125e+02 7.307e+02 1.235e+03, threshold=1.025e+03, percent-clipped=15.0 2023-06-22 15:18:08,958 INFO [train.py:996] (2/4) Epoch 7, batch 26600, loss[loss=0.2235, simple_loss=0.2958, pruned_loss=0.07564, over 21588.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3195, pruned_loss=0.08185, over 4257771.35 frames. ], batch size: 263, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:18:21,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1257402.0, ans=0.2 2023-06-22 15:18:29,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1257462.0, ans=0.1 2023-06-22 15:19:18,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1257582.0, ans=0.1 2023-06-22 15:19:18,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1257582.0, ans=0.0 2023-06-22 15:19:28,215 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.75 vs. limit=15.0 2023-06-22 15:19:45,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1257702.0, ans=0.0 2023-06-22 15:19:46,580 INFO [train.py:996] (2/4) Epoch 7, batch 26650, loss[loss=0.1738, simple_loss=0.2657, pruned_loss=0.04098, over 21803.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3118, pruned_loss=0.08064, over 4256053.68 frames. ], batch size: 352, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:20:40,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1257822.0, ans=0.125 2023-06-22 15:20:42,164 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=15.0 2023-06-22 15:21:12,601 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.157e+02 3.375e+02 4.059e+02 5.279e+02 9.271e+02, threshold=8.118e+02, percent-clipped=0.0 2023-06-22 15:21:14,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1257942.0, ans=0.125 2023-06-22 15:21:16,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1257942.0, ans=0.5 2023-06-22 15:21:25,458 INFO [train.py:996] (2/4) Epoch 7, batch 26700, loss[loss=0.2267, simple_loss=0.2908, pruned_loss=0.08133, over 21791.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3058, pruned_loss=0.07847, over 4256164.41 frames. ], batch size: 247, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:22:02,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1258122.0, ans=0.125 2023-06-22 15:22:15,941 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-06-22 15:23:08,478 INFO [train.py:996] (2/4) Epoch 7, batch 26750, loss[loss=0.2424, simple_loss=0.3215, pruned_loss=0.08164, over 21677.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3056, pruned_loss=0.07762, over 4257446.78 frames. ], batch size: 298, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:23:22,572 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.00 vs. limit=10.0 2023-06-22 15:24:37,077 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.356e+02 3.345e+02 4.329e+02 5.688e+02 1.116e+03, threshold=8.657e+02, percent-clipped=5.0 2023-06-22 15:24:48,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1258602.0, ans=0.0 2023-06-22 15:24:48,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1258602.0, ans=0.2 2023-06-22 15:24:50,376 INFO [train.py:996] (2/4) Epoch 7, batch 26800, loss[loss=0.3076, simple_loss=0.369, pruned_loss=0.1231, over 21445.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3134, pruned_loss=0.08243, over 4261965.61 frames. ], batch size: 471, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:25:14,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1258662.0, ans=0.125 2023-06-22 15:26:00,463 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.43 vs. limit=6.0 2023-06-22 15:26:05,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1258782.0, ans=0.0 2023-06-22 15:26:18,325 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.50 vs. limit=10.0 2023-06-22 15:26:27,882 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=12.0 2023-06-22 15:26:37,054 INFO [train.py:996] (2/4) Epoch 7, batch 26850, loss[loss=0.2215, simple_loss=0.2743, pruned_loss=0.08431, over 21249.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3141, pruned_loss=0.08462, over 4267261.31 frames. ], batch size: 159, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:27:11,478 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:27:27,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1259022.0, ans=0.0 2023-06-22 15:27:37,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1259022.0, ans=0.125 2023-06-22 15:27:38,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1259082.0, ans=0.125 2023-06-22 15:27:47,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1259082.0, ans=0.125 2023-06-22 15:27:59,611 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.773e+02 3.440e+02 3.815e+02 4.750e+02 7.886e+02, threshold=7.630e+02, percent-clipped=0.0 2023-06-22 15:28:17,071 INFO [train.py:996] (2/4) Epoch 7, batch 26900, loss[loss=0.2027, simple_loss=0.2644, pruned_loss=0.07046, over 21637.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3062, pruned_loss=0.08404, over 4263669.24 frames. ], batch size: 298, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:29:04,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1259322.0, ans=0.0 2023-06-22 15:29:25,508 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=12.0 2023-06-22 15:29:56,674 INFO [train.py:996] (2/4) Epoch 7, batch 26950, loss[loss=0.2703, simple_loss=0.3499, pruned_loss=0.09541, over 21604.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3062, pruned_loss=0.08443, over 4270747.11 frames. ], batch size: 414, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:30:03,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1259502.0, ans=0.1 2023-06-22 15:30:27,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1259562.0, ans=0.025 2023-06-22 15:30:31,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1259562.0, ans=0.5 2023-06-22 15:30:45,104 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.815e-03 2023-06-22 15:30:46,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1259622.0, ans=0.0 2023-06-22 15:31:12,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1259682.0, ans=0.125 2023-06-22 15:31:17,693 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.01 vs. limit=15.0 2023-06-22 15:31:18,838 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.384e+02 3.421e+02 4.204e+02 5.403e+02 1.152e+03, threshold=8.409e+02, percent-clipped=7.0 2023-06-22 15:31:33,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1259742.0, ans=0.0 2023-06-22 15:31:40,910 INFO [train.py:996] (2/4) Epoch 7, batch 27000, loss[loss=0.2277, simple_loss=0.3064, pruned_loss=0.07447, over 21625.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3088, pruned_loss=0.0828, over 4276743.99 frames. ], batch size: 263, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:31:40,910 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-22 15:31:59,803 INFO [train.py:1028] (2/4) Epoch 7, validation: loss=0.2427, simple_loss=0.3424, pruned_loss=0.07152, over 1796401.00 frames. 2023-06-22 15:31:59,804 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-22 15:32:03,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1259802.0, ans=0.125 2023-06-22 15:32:37,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1259922.0, ans=0.0 2023-06-22 15:33:24,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1260042.0, ans=0.2 2023-06-22 15:33:38,743 INFO [train.py:996] (2/4) Epoch 7, batch 27050, loss[loss=0.243, simple_loss=0.3244, pruned_loss=0.08079, over 21867.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3117, pruned_loss=0.07955, over 4283910.33 frames. ], batch size: 316, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:33:53,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1260102.0, ans=0.125 2023-06-22 15:34:01,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1260162.0, ans=0.2 2023-06-22 15:34:19,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1260222.0, ans=0.0 2023-06-22 15:35:07,260 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.166e+02 3.078e+02 3.767e+02 4.545e+02 7.806e+02, threshold=7.533e+02, percent-clipped=0.0 2023-06-22 15:35:23,460 INFO [train.py:996] (2/4) Epoch 7, batch 27100, loss[loss=0.2354, simple_loss=0.3223, pruned_loss=0.07423, over 21228.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3138, pruned_loss=0.08036, over 4285345.19 frames. ], batch size: 159, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:35:43,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1260462.0, ans=0.125 2023-06-22 15:36:10,393 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.37 vs. limit=15.0 2023-06-22 15:36:50,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1260642.0, ans=0.125 2023-06-22 15:37:02,875 INFO [train.py:996] (2/4) Epoch 7, batch 27150, loss[loss=0.249, simple_loss=0.3336, pruned_loss=0.08217, over 21263.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3258, pruned_loss=0.08416, over 4284831.18 frames. ], batch size: 176, lr: 4.19e-03, grad_scale: 16.0 2023-06-22 15:37:16,778 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-22 15:37:27,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1260762.0, ans=0.0 2023-06-22 15:38:31,261 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.782e+02 4.019e+02 5.172e+02 7.242e+02 1.500e+03, threshold=1.034e+03, percent-clipped=23.0 2023-06-22 15:38:35,661 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-06-22 15:38:42,499 INFO [train.py:996] (2/4) Epoch 7, batch 27200, loss[loss=0.2923, simple_loss=0.3706, pruned_loss=0.107, over 21748.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3343, pruned_loss=0.08727, over 4285365.90 frames. ], batch size: 351, lr: 4.19e-03, grad_scale: 32.0 2023-06-22 15:38:49,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1261002.0, ans=0.125 2023-06-22 15:38:52,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1261002.0, ans=0.125 2023-06-22 15:38:57,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1261062.0, ans=0.04949747468305833 2023-06-22 15:38:57,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1261062.0, ans=0.1 2023-06-22 15:38:58,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1261062.0, ans=0.2 2023-06-22 15:39:06,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1261062.0, ans=0.0 2023-06-22 15:39:33,308 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.37 vs. limit=10.0 2023-06-22 15:40:23,219 INFO [train.py:996] (2/4) Epoch 7, batch 27250, loss[loss=0.309, simple_loss=0.3661, pruned_loss=0.1259, over 21328.00 frames. ], tot_loss[loss=0.259, simple_loss=0.3363, pruned_loss=0.09083, over 4279139.36 frames. ], batch size: 159, lr: 4.18e-03, grad_scale: 32.0 2023-06-22 15:40:49,056 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:41:15,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1261422.0, ans=0.125 2023-06-22 15:41:52,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1261542.0, ans=0.125 2023-06-22 15:41:54,925 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.033e+02 3.741e+02 4.375e+02 5.438e+02 9.965e+02, threshold=8.750e+02, percent-clipped=0.0 2023-06-22 15:42:08,895 INFO [train.py:996] (2/4) Epoch 7, batch 27300, loss[loss=0.2552, simple_loss=0.345, pruned_loss=0.08267, over 21912.00 frames. ], tot_loss[loss=0.2604, simple_loss=0.3369, pruned_loss=0.092, over 4269375.96 frames. ], batch size: 372, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:43:27,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1261842.0, ans=0.125 2023-06-22 15:43:48,243 INFO [train.py:996] (2/4) Epoch 7, batch 27350, loss[loss=0.2616, simple_loss=0.3388, pruned_loss=0.09224, over 21811.00 frames. ], tot_loss[loss=0.2617, simple_loss=0.3385, pruned_loss=0.09247, over 4269965.72 frames. ], batch size: 118, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:44:25,276 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.89 vs. limit=15.0 2023-06-22 15:45:15,955 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.698e+02 3.868e+02 4.861e+02 6.523e+02 1.170e+03, threshold=9.722e+02, percent-clipped=10.0 2023-06-22 15:45:25,598 INFO [train.py:996] (2/4) Epoch 7, batch 27400, loss[loss=0.2286, simple_loss=0.2956, pruned_loss=0.08086, over 21782.00 frames. ], tot_loss[loss=0.2601, simple_loss=0.3357, pruned_loss=0.09228, over 4274513.74 frames. ], batch size: 351, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:45:57,438 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.36 vs. limit=12.0 2023-06-22 15:46:12,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1262322.0, ans=0.0 2023-06-22 15:47:09,264 INFO [train.py:996] (2/4) Epoch 7, batch 27450, loss[loss=0.24, simple_loss=0.311, pruned_loss=0.08444, over 21317.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.33, pruned_loss=0.09059, over 4268823.63 frames. ], batch size: 211, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:47:36,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1262562.0, ans=0.0 2023-06-22 15:47:56,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1262622.0, ans=0.09899494936611666 2023-06-22 15:48:28,400 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.635e+02 3.611e+02 4.155e+02 4.904e+02 8.641e+02, threshold=8.310e+02, percent-clipped=0.0 2023-06-22 15:48:41,011 INFO [train.py:996] (2/4) Epoch 7, batch 27500, loss[loss=0.2289, simple_loss=0.299, pruned_loss=0.0794, over 21269.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3278, pruned_loss=0.09052, over 4275638.93 frames. ], batch size: 176, lr: 4.18e-03, grad_scale: 8.0 2023-06-22 15:49:59,609 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.00 vs. limit=10.0 2023-06-22 15:50:23,460 INFO [train.py:996] (2/4) Epoch 7, batch 27550, loss[loss=0.2089, simple_loss=0.2765, pruned_loss=0.0707, over 21754.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3214, pruned_loss=0.08715, over 4279271.79 frames. ], batch size: 124, lr: 4.18e-03, grad_scale: 8.0 2023-06-22 15:50:37,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1263162.0, ans=0.0 2023-06-22 15:50:44,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1263162.0, ans=0.0 2023-06-22 15:51:47,987 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.633e+02 3.463e+02 4.160e+02 5.154e+02 1.063e+03, threshold=8.319e+02, percent-clipped=3.0 2023-06-22 15:52:01,055 INFO [train.py:996] (2/4) Epoch 7, batch 27600, loss[loss=0.2756, simple_loss=0.3291, pruned_loss=0.111, over 14840.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3138, pruned_loss=0.08561, over 4277352.98 frames. ], batch size: 60, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:52:41,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1263522.0, ans=10.0 2023-06-22 15:53:06,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1263642.0, ans=0.125 2023-06-22 15:53:32,809 INFO [train.py:996] (2/4) Epoch 7, batch 27650, loss[loss=0.2169, simple_loss=0.2949, pruned_loss=0.06945, over 21858.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3084, pruned_loss=0.08532, over 4270232.34 frames. ], batch size: 316, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:53:42,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1263702.0, ans=0.125 2023-06-22 15:54:58,109 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.477e+02 3.224e+02 3.872e+02 5.377e+02 9.163e+02, threshold=7.744e+02, percent-clipped=1.0 2023-06-22 15:55:10,429 INFO [train.py:996] (2/4) Epoch 7, batch 27700, loss[loss=0.218, simple_loss=0.2987, pruned_loss=0.06866, over 21320.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3077, pruned_loss=0.08286, over 4276212.08 frames. ], batch size: 176, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:55:58,236 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.472e-03 2023-06-22 15:56:01,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1264122.0, ans=0.2 2023-06-22 15:56:07,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1264182.0, ans=0.2 2023-06-22 15:56:31,183 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.34 vs. limit=22.5 2023-06-22 15:56:53,348 INFO [train.py:996] (2/4) Epoch 7, batch 27750, loss[loss=0.2043, simple_loss=0.3019, pruned_loss=0.05334, over 20911.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3106, pruned_loss=0.08216, over 4278666.29 frames. ], batch size: 608, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:57:00,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1264302.0, ans=0.09899494936611666 2023-06-22 15:57:01,918 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 15:57:12,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1264362.0, ans=0.125 2023-06-22 15:57:28,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1264422.0, ans=0.0 2023-06-22 15:57:49,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1264482.0, ans=0.0 2023-06-22 15:58:05,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1264542.0, ans=0.09899494936611666 2023-06-22 15:58:17,875 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.549e+02 3.542e+02 4.380e+02 5.826e+02 1.163e+03, threshold=8.759e+02, percent-clipped=13.0 2023-06-22 15:58:22,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1264542.0, ans=0.125 2023-06-22 15:58:26,075 INFO [train.py:996] (2/4) Epoch 7, batch 27800, loss[loss=0.2316, simple_loss=0.3002, pruned_loss=0.08148, over 21926.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3099, pruned_loss=0.083, over 4278375.08 frames. ], batch size: 316, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 15:58:56,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1264662.0, ans=0.2 2023-06-22 15:59:10,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1264722.0, ans=0.2 2023-06-22 15:59:24,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1264782.0, ans=0.0 2023-06-22 15:59:51,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1264842.0, ans=0.125 2023-06-22 16:00:09,290 INFO [train.py:996] (2/4) Epoch 7, batch 27850, loss[loss=0.2111, simple_loss=0.2795, pruned_loss=0.07133, over 21164.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.309, pruned_loss=0.08458, over 4284432.61 frames. ], batch size: 608, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 16:00:18,324 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=15.0 2023-06-22 16:00:41,664 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.04 vs. limit=12.0 2023-06-22 16:01:44,412 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.686e+02 3.358e+02 4.006e+02 5.079e+02 1.358e+03, threshold=8.013e+02, percent-clipped=6.0 2023-06-22 16:01:50,945 INFO [train.py:996] (2/4) Epoch 7, batch 27900, loss[loss=0.2472, simple_loss=0.3218, pruned_loss=0.08635, over 21821.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3207, pruned_loss=0.08704, over 4291296.65 frames. ], batch size: 112, lr: 4.18e-03, grad_scale: 8.0 2023-06-22 16:02:10,916 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.72 vs. limit=10.0 2023-06-22 16:02:19,060 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-22 16:02:37,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1265322.0, ans=0.1 2023-06-22 16:03:30,619 INFO [train.py:996] (2/4) Epoch 7, batch 27950, loss[loss=0.2848, simple_loss=0.3666, pruned_loss=0.1016, over 21684.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3206, pruned_loss=0.08331, over 4283223.99 frames. ], batch size: 441, lr: 4.18e-03, grad_scale: 8.0 2023-06-22 16:03:37,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1265502.0, ans=0.125 2023-06-22 16:03:39,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_ff3.min_abs, batch_count=1265502.0, ans=0.2 2023-06-22 16:03:40,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1265502.0, ans=0.0 2023-06-22 16:04:19,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1265622.0, ans=0.2 2023-06-22 16:04:38,411 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 16:04:46,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1265682.0, ans=0.125 2023-06-22 16:05:01,854 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.261e+02 3.331e+02 4.215e+02 5.224e+02 1.025e+03, threshold=8.430e+02, percent-clipped=5.0 2023-06-22 16:05:13,096 INFO [train.py:996] (2/4) Epoch 7, batch 28000, loss[loss=0.2075, simple_loss=0.2776, pruned_loss=0.06866, over 21699.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3181, pruned_loss=0.08077, over 4282391.75 frames. ], batch size: 263, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 16:05:23,776 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-22 16:06:11,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1265922.0, ans=0.0 2023-06-22 16:06:24,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1265982.0, ans=0.125 2023-06-22 16:06:52,781 INFO [train.py:996] (2/4) Epoch 7, batch 28050, loss[loss=0.2842, simple_loss=0.3721, pruned_loss=0.09815, over 21246.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3154, pruned_loss=0.08197, over 4279640.61 frames. ], batch size: 548, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 16:06:57,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1266102.0, ans=0.0 2023-06-22 16:07:06,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1266102.0, ans=0.125 2023-06-22 16:07:06,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1266102.0, ans=0.125 2023-06-22 16:07:29,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1266162.0, ans=0.125 2023-06-22 16:07:34,374 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=15.0 2023-06-22 16:07:53,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1266282.0, ans=0.125 2023-06-22 16:08:04,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1266282.0, ans=0.0 2023-06-22 16:08:17,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1266342.0, ans=0.125 2023-06-22 16:08:25,037 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.549e+02 3.562e+02 4.496e+02 6.296e+02 1.484e+03, threshold=8.993e+02, percent-clipped=8.0 2023-06-22 16:08:25,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1266342.0, ans=0.125 2023-06-22 16:08:30,588 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.83 vs. limit=12.0 2023-06-22 16:08:31,023 INFO [train.py:996] (2/4) Epoch 7, batch 28100, loss[loss=0.2016, simple_loss=0.2573, pruned_loss=0.07297, over 20790.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3138, pruned_loss=0.08196, over 4271972.38 frames. ], batch size: 608, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 16:09:07,004 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.40 vs. limit=10.0 2023-06-22 16:09:09,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1266522.0, ans=0.125 2023-06-22 16:09:33,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1266582.0, ans=0.125 2023-06-22 16:09:34,636 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 16:09:37,019 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=22.5 2023-06-22 16:09:57,374 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=22.5 2023-06-22 16:10:07,970 INFO [train.py:996] (2/4) Epoch 7, batch 28150, loss[loss=0.1857, simple_loss=0.2335, pruned_loss=0.06893, over 20792.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3058, pruned_loss=0.08212, over 4265562.56 frames. ], batch size: 609, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 16:10:33,425 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 16:11:01,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1266822.0, ans=0.1 2023-06-22 16:11:30,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1266942.0, ans=0.0 2023-06-22 16:11:39,700 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.795e+02 3.363e+02 3.938e+02 4.881e+02 1.435e+03, threshold=7.877e+02, percent-clipped=4.0 2023-06-22 16:11:46,177 INFO [train.py:996] (2/4) Epoch 7, batch 28200, loss[loss=0.2465, simple_loss=0.3033, pruned_loss=0.09489, over 21844.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.305, pruned_loss=0.08311, over 4255289.29 frames. ], batch size: 98, lr: 4.18e-03, grad_scale: 16.0 2023-06-22 16:13:07,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1267242.0, ans=0.04949747468305833 2023-06-22 16:13:12,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1267242.0, ans=0.125 2023-06-22 16:13:23,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1267302.0, ans=0.0 2023-06-22 16:13:24,173 INFO [train.py:996] (2/4) Epoch 7, batch 28250, loss[loss=0.2745, simple_loss=0.3277, pruned_loss=0.1106, over 21571.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3103, pruned_loss=0.08661, over 4251732.32 frames. ], batch size: 441, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:13:39,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1267302.0, ans=0.125 2023-06-22 16:13:44,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1267362.0, ans=0.5 2023-06-22 16:14:02,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1267362.0, ans=0.0 2023-06-22 16:14:27,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1267482.0, ans=0.125 2023-06-22 16:14:56,636 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.948e+02 3.937e+02 4.825e+02 6.344e+02 1.441e+03, threshold=9.649e+02, percent-clipped=9.0 2023-06-22 16:15:07,524 INFO [train.py:996] (2/4) Epoch 7, batch 28300, loss[loss=0.2199, simple_loss=0.3091, pruned_loss=0.06534, over 21785.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3073, pruned_loss=0.08398, over 4253075.39 frames. ], batch size: 371, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:15:29,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1267662.0, ans=0.125 2023-06-22 16:15:34,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1267662.0, ans=0.0 2023-06-22 16:15:52,356 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.18 vs. limit=15.0 2023-06-22 16:16:01,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1267722.0, ans=0.0 2023-06-22 16:16:23,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1267782.0, ans=0.125 2023-06-22 16:16:51,633 INFO [train.py:996] (2/4) Epoch 7, batch 28350, loss[loss=0.1978, simple_loss=0.2675, pruned_loss=0.06404, over 21674.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3036, pruned_loss=0.07769, over 4258809.70 frames. ], batch size: 282, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:17:26,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1268022.0, ans=0.125 2023-06-22 16:18:07,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1268142.0, ans=0.1 2023-06-22 16:18:16,784 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.359e+02 3.369e+02 4.514e+02 6.821e+02 1.544e+03, threshold=9.028e+02, percent-clipped=6.0 2023-06-22 16:18:28,277 INFO [train.py:996] (2/4) Epoch 7, batch 28400, loss[loss=0.2647, simple_loss=0.3319, pruned_loss=0.09879, over 21703.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3006, pruned_loss=0.07686, over 4258612.71 frames. ], batch size: 351, lr: 4.17e-03, grad_scale: 32.0 2023-06-22 16:19:41,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1268382.0, ans=0.125 2023-06-22 16:20:10,440 INFO [train.py:996] (2/4) Epoch 7, batch 28450, loss[loss=0.2739, simple_loss=0.337, pruned_loss=0.1055, over 21830.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.306, pruned_loss=0.0813, over 4258221.14 frames. ], batch size: 414, lr: 4.17e-03, grad_scale: 32.0 2023-06-22 16:20:45,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1268622.0, ans=0.0 2023-06-22 16:20:51,727 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 16:21:44,196 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.730e+02 3.754e+02 4.591e+02 6.046e+02 1.139e+03, threshold=9.182e+02, percent-clipped=3.0 2023-06-22 16:21:49,024 INFO [train.py:996] (2/4) Epoch 7, batch 28500, loss[loss=0.2286, simple_loss=0.2914, pruned_loss=0.0829, over 21199.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3091, pruned_loss=0.08448, over 4270789.18 frames. ], batch size: 608, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:21:54,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1268802.0, ans=0.0 2023-06-22 16:21:55,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1268802.0, ans=0.125 2023-06-22 16:22:10,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1268862.0, ans=0.1 2023-06-22 16:22:13,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1268862.0, ans=0.05 2023-06-22 16:22:49,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1268982.0, ans=0.0 2023-06-22 16:22:52,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1268982.0, ans=0.0 2023-06-22 16:23:33,163 INFO [train.py:996] (2/4) Epoch 7, batch 28550, loss[loss=0.2725, simple_loss=0.3592, pruned_loss=0.09293, over 21444.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3183, pruned_loss=0.08793, over 4270541.60 frames. ], batch size: 211, lr: 4.17e-03, grad_scale: 8.0 2023-06-22 16:23:44,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1269102.0, ans=0.1 2023-06-22 16:23:46,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1269102.0, ans=0.125 2023-06-22 16:24:23,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1269222.0, ans=0.125 2023-06-22 16:24:23,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1269222.0, ans=0.2 2023-06-22 16:25:07,209 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.907e+02 3.553e+02 4.352e+02 5.602e+02 1.151e+03, threshold=8.703e+02, percent-clipped=1.0 2023-06-22 16:25:10,382 INFO [train.py:996] (2/4) Epoch 7, batch 28600, loss[loss=0.3215, simple_loss=0.376, pruned_loss=0.1335, over 21416.00 frames. ], tot_loss[loss=0.252, simple_loss=0.324, pruned_loss=0.09005, over 4271099.67 frames. ], batch size: 509, lr: 4.17e-03, grad_scale: 8.0 2023-06-22 16:25:28,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1269402.0, ans=0.125 2023-06-22 16:25:55,067 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.12 vs. limit=10.0 2023-06-22 16:25:58,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1269522.0, ans=0.2 2023-06-22 16:26:18,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1269582.0, ans=0.125 2023-06-22 16:26:30,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1269642.0, ans=0.0 2023-06-22 16:26:36,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1269642.0, ans=0.1 2023-06-22 16:26:48,951 INFO [train.py:996] (2/4) Epoch 7, batch 28650, loss[loss=0.2338, simple_loss=0.2935, pruned_loss=0.08702, over 21996.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3179, pruned_loss=0.08863, over 4272597.25 frames. ], batch size: 375, lr: 4.17e-03, grad_scale: 8.0 2023-06-22 16:27:24,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1269762.0, ans=0.125 2023-06-22 16:28:24,456 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.670e+02 4.036e+02 5.079e+02 7.151e+02 1.408e+03, threshold=1.016e+03, percent-clipped=11.0 2023-06-22 16:28:24,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1269942.0, ans=0.1 2023-06-22 16:28:32,074 INFO [train.py:996] (2/4) Epoch 7, batch 28700, loss[loss=0.225, simple_loss=0.2938, pruned_loss=0.07809, over 20721.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3172, pruned_loss=0.0893, over 4278008.70 frames. ], batch size: 607, lr: 4.17e-03, grad_scale: 8.0 2023-06-22 16:29:46,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1270182.0, ans=0.1 2023-06-22 16:30:09,815 INFO [train.py:996] (2/4) Epoch 7, batch 28750, loss[loss=0.2392, simple_loss=0.3209, pruned_loss=0.07878, over 21784.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3171, pruned_loss=0.08983, over 4286091.43 frames. ], batch size: 414, lr: 4.17e-03, grad_scale: 8.0 2023-06-22 16:30:25,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1270302.0, ans=0.125 2023-06-22 16:31:06,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1270482.0, ans=0.5 2023-06-22 16:31:20,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.whiten.whitening_limit, batch_count=1270482.0, ans=12.0 2023-06-22 16:31:28,388 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.69 vs. limit=15.0 2023-06-22 16:31:45,880 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.805e+02 3.311e+02 3.906e+02 4.905e+02 1.219e+03, threshold=7.811e+02, percent-clipped=5.0 2023-06-22 16:31:49,098 INFO [train.py:996] (2/4) Epoch 7, batch 28800, loss[loss=0.3516, simple_loss=0.398, pruned_loss=0.1526, over 21451.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3194, pruned_loss=0.08971, over 4282221.94 frames. ], batch size: 471, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:31:57,142 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 16:32:04,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1270602.0, ans=0.2 2023-06-22 16:32:34,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1270722.0, ans=0.0 2023-06-22 16:32:58,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1270782.0, ans=0.125 2023-06-22 16:33:28,958 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=15.0 2023-06-22 16:33:30,948 INFO [train.py:996] (2/4) Epoch 7, batch 28850, loss[loss=0.2493, simple_loss=0.3092, pruned_loss=0.09474, over 21311.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3198, pruned_loss=0.09058, over 4284671.84 frames. ], batch size: 159, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:33:34,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1270902.0, ans=0.0 2023-06-22 16:33:37,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1270902.0, ans=0.125 2023-06-22 16:33:37,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1270902.0, ans=0.125 2023-06-22 16:33:44,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1270902.0, ans=0.125 2023-06-22 16:34:39,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1271082.0, ans=0.125 2023-06-22 16:35:03,020 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.913e+02 3.986e+02 4.811e+02 7.732e+02 1.672e+03, threshold=9.622e+02, percent-clipped=22.0 2023-06-22 16:35:06,507 INFO [train.py:996] (2/4) Epoch 7, batch 28900, loss[loss=0.2308, simple_loss=0.302, pruned_loss=0.07983, over 21362.00 frames. ], tot_loss[loss=0.2572, simple_loss=0.3266, pruned_loss=0.0939, over 4284317.33 frames. ], batch size: 159, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:35:17,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1271202.0, ans=0.2 2023-06-22 16:35:19,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1271202.0, ans=0.125 2023-06-22 16:36:29,953 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=14.26 vs. limit=15.0 2023-06-22 16:36:42,215 INFO [train.py:996] (2/4) Epoch 7, batch 28950, loss[loss=0.2443, simple_loss=0.3479, pruned_loss=0.0704, over 21690.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3277, pruned_loss=0.09303, over 4280930.66 frames. ], batch size: 414, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:37:46,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1271622.0, ans=0.125 2023-06-22 16:38:07,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1271742.0, ans=0.0 2023-06-22 16:38:23,460 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.694e+02 3.738e+02 4.662e+02 6.144e+02 1.231e+03, threshold=9.324e+02, percent-clipped=1.0 2023-06-22 16:38:31,537 INFO [train.py:996] (2/4) Epoch 7, batch 29000, loss[loss=0.2847, simple_loss=0.3441, pruned_loss=0.1126, over 20149.00 frames. ], tot_loss[loss=0.2571, simple_loss=0.3302, pruned_loss=0.09203, over 4278266.06 frames. ], batch size: 707, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:39:16,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1271922.0, ans=0.125 2023-06-22 16:39:17,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1271922.0, ans=0.015 2023-06-22 16:39:54,435 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-22 16:40:05,711 INFO [train.py:996] (2/4) Epoch 7, batch 29050, loss[loss=0.244, simple_loss=0.3145, pruned_loss=0.08678, over 21872.00 frames. ], tot_loss[loss=0.2598, simple_loss=0.3306, pruned_loss=0.09445, over 4288467.54 frames. ], batch size: 107, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:40:29,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1272162.0, ans=0.125 2023-06-22 16:40:55,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1272222.0, ans=0.0 2023-06-22 16:41:02,397 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.06 vs. limit=15.0 2023-06-22 16:41:20,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1272342.0, ans=0.0 2023-06-22 16:41:39,489 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.766e+02 3.571e+02 4.314e+02 5.526e+02 9.643e+02, threshold=8.628e+02, percent-clipped=2.0 2023-06-22 16:41:42,553 INFO [train.py:996] (2/4) Epoch 7, batch 29100, loss[loss=0.2206, simple_loss=0.2798, pruned_loss=0.08072, over 21782.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3211, pruned_loss=0.09113, over 4292622.66 frames. ], batch size: 124, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:41:44,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1272402.0, ans=0.0 2023-06-22 16:41:59,097 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=15.0 2023-06-22 16:43:19,106 INFO [train.py:996] (2/4) Epoch 7, batch 29150, loss[loss=0.2761, simple_loss=0.3502, pruned_loss=0.101, over 21518.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3206, pruned_loss=0.08929, over 4275724.83 frames. ], batch size: 389, lr: 4.17e-03, grad_scale: 16.0 2023-06-22 16:43:19,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1272702.0, ans=0.0 2023-06-22 16:43:37,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1272762.0, ans=0.1 2023-06-22 16:44:04,851 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=15.0 2023-06-22 16:44:06,281 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-06-22 16:44:16,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1272882.0, ans=0.05 2023-06-22 16:44:51,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1272942.0, ans=0.0 2023-06-22 16:44:52,748 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.751e+02 3.598e+02 4.101e+02 5.513e+02 1.275e+03, threshold=8.201e+02, percent-clipped=6.0 2023-06-22 16:44:55,700 INFO [train.py:996] (2/4) Epoch 7, batch 29200, loss[loss=0.2284, simple_loss=0.2826, pruned_loss=0.08716, over 21506.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3154, pruned_loss=0.08816, over 4269942.51 frames. ], batch size: 441, lr: 4.17e-03, grad_scale: 32.0 2023-06-22 16:45:39,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1273122.0, ans=0.125 2023-06-22 16:45:46,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1273122.0, ans=0.5 2023-06-22 16:45:48,591 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=15.0 2023-06-22 16:46:13,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1273242.0, ans=0.125 2023-06-22 16:46:31,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1273242.0, ans=0.1 2023-06-22 16:46:35,591 INFO [train.py:996] (2/4) Epoch 7, batch 29250, loss[loss=0.2556, simple_loss=0.3482, pruned_loss=0.08148, over 21620.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.313, pruned_loss=0.0853, over 4263114.57 frames. ], batch size: 442, lr: 4.16e-03, grad_scale: 32.0 2023-06-22 16:47:23,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1273422.0, ans=0.125 2023-06-22 16:47:29,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1273422.0, ans=0.0 2023-06-22 16:48:10,631 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.544e+02 3.452e+02 4.218e+02 5.407e+02 1.140e+03, threshold=8.437e+02, percent-clipped=6.0 2023-06-22 16:48:11,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1273542.0, ans=0.125 2023-06-22 16:48:13,949 INFO [train.py:996] (2/4) Epoch 7, batch 29300, loss[loss=0.219, simple_loss=0.2911, pruned_loss=0.07341, over 21488.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3155, pruned_loss=0.0849, over 4264975.66 frames. ], batch size: 389, lr: 4.16e-03, grad_scale: 32.0 2023-06-22 16:48:48,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1273662.0, ans=0.125 2023-06-22 16:49:07,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1273722.0, ans=0.2 2023-06-22 16:49:36,952 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.78 vs. limit=6.0 2023-06-22 16:49:46,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1273842.0, ans=0.2 2023-06-22 16:49:52,297 INFO [train.py:996] (2/4) Epoch 7, batch 29350, loss[loss=0.1853, simple_loss=0.2528, pruned_loss=0.05888, over 21574.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3103, pruned_loss=0.08381, over 4252937.17 frames. ], batch size: 263, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 16:50:31,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=1274022.0, ans=0.02 2023-06-22 16:50:43,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1274022.0, ans=0.125 2023-06-22 16:51:29,115 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.749e+02 3.690e+02 4.735e+02 5.967e+02 1.066e+03, threshold=9.470e+02, percent-clipped=8.0 2023-06-22 16:51:30,498 INFO [train.py:996] (2/4) Epoch 7, batch 29400, loss[loss=0.2369, simple_loss=0.3377, pruned_loss=0.06802, over 20782.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3081, pruned_loss=0.08121, over 4249964.12 frames. ], batch size: 609, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 16:51:38,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1274202.0, ans=0.125 2023-06-22 16:51:44,250 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-06-22 16:52:34,119 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2023-06-22 16:53:09,780 INFO [train.py:996] (2/4) Epoch 7, batch 29450, loss[loss=0.2659, simple_loss=0.3413, pruned_loss=0.09521, over 21907.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3049, pruned_loss=0.07937, over 4251173.28 frames. ], batch size: 372, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 16:53:18,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1274502.0, ans=0.0 2023-06-22 16:53:21,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1274502.0, ans=0.125 2023-06-22 16:53:27,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1274562.0, ans=0.125 2023-06-22 16:54:36,294 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.51 vs. limit=15.0 2023-06-22 16:54:41,847 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.353e+02 4.158e+02 5.429e+02 7.361e+02 1.574e+03, threshold=1.086e+03, percent-clipped=7.0 2023-06-22 16:54:42,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1274802.0, ans=0.1 2023-06-22 16:54:43,558 INFO [train.py:996] (2/4) Epoch 7, batch 29500, loss[loss=0.2394, simple_loss=0.3526, pruned_loss=0.06309, over 19758.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3116, pruned_loss=0.0833, over 4259642.56 frames. ], batch size: 703, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 16:54:54,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1274802.0, ans=0.0 2023-06-22 16:55:55,828 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.73 vs. limit=12.0 2023-06-22 16:56:19,583 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=15.0 2023-06-22 16:56:21,906 INFO [train.py:996] (2/4) Epoch 7, batch 29550, loss[loss=0.226, simple_loss=0.2857, pruned_loss=0.08312, over 21602.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3103, pruned_loss=0.08482, over 4267790.90 frames. ], batch size: 548, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 16:56:28,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1275102.0, ans=0.2 2023-06-22 16:56:33,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1275102.0, ans=0.0 2023-06-22 16:57:44,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1275342.0, ans=0.125 2023-06-22 16:57:46,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1275342.0, ans=0.2 2023-06-22 16:57:48,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1275342.0, ans=0.0 2023-06-22 16:57:59,476 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.645e+02 3.724e+02 4.638e+02 7.025e+02 1.809e+03, threshold=9.276e+02, percent-clipped=5.0 2023-06-22 16:58:01,072 INFO [train.py:996] (2/4) Epoch 7, batch 29600, loss[loss=0.2555, simple_loss=0.3428, pruned_loss=0.08412, over 21788.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3171, pruned_loss=0.08717, over 4275244.05 frames. ], batch size: 298, lr: 4.16e-03, grad_scale: 32.0 2023-06-22 16:58:04,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1275402.0, ans=0.1 2023-06-22 16:59:30,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1275642.0, ans=0.125 2023-06-22 16:59:37,930 INFO [train.py:996] (2/4) Epoch 7, batch 29650, loss[loss=0.2158, simple_loss=0.2799, pruned_loss=0.07582, over 21676.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3162, pruned_loss=0.08439, over 4280026.96 frames. ], batch size: 230, lr: 4.16e-03, grad_scale: 32.0 2023-06-22 17:00:25,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1275822.0, ans=0.025 2023-06-22 17:00:25,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1275822.0, ans=0.05 2023-06-22 17:00:31,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1275822.0, ans=0.07 2023-06-22 17:00:48,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1275882.0, ans=0.0 2023-06-22 17:01:16,734 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.600e+02 3.826e+02 4.944e+02 6.202e+02 1.000e+03, threshold=9.888e+02, percent-clipped=1.0 2023-06-22 17:01:16,755 INFO [train.py:996] (2/4) Epoch 7, batch 29700, loss[loss=0.2914, simple_loss=0.3879, pruned_loss=0.09746, over 21792.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3169, pruned_loss=0.08476, over 4287836.37 frames. ], batch size: 282, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:01:45,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1276062.0, ans=0.125 2023-06-22 17:01:45,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1276062.0, ans=0.125 2023-06-22 17:02:35,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1276182.0, ans=0.0 2023-06-22 17:02:36,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1276182.0, ans=0.0 2023-06-22 17:02:38,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1276242.0, ans=0.125 2023-06-22 17:02:55,335 INFO [train.py:996] (2/4) Epoch 7, batch 29750, loss[loss=0.2834, simple_loss=0.3675, pruned_loss=0.09967, over 21727.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3223, pruned_loss=0.08379, over 4282213.30 frames. ], batch size: 441, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:04:13,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1276482.0, ans=0.125 2023-06-22 17:04:18,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1276542.0, ans=0.125 2023-06-22 17:04:25,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1276542.0, ans=0.125 2023-06-22 17:04:32,158 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.731e+02 3.419e+02 3.988e+02 5.118e+02 1.049e+03, threshold=7.976e+02, percent-clipped=2.0 2023-06-22 17:04:32,189 INFO [train.py:996] (2/4) Epoch 7, batch 29800, loss[loss=0.2595, simple_loss=0.3241, pruned_loss=0.09746, over 21227.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3229, pruned_loss=0.0842, over 4286132.42 frames. ], batch size: 143, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:04:38,071 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=15.0 2023-06-22 17:04:55,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1276662.0, ans=0.0 2023-06-22 17:05:20,180 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.54 vs. limit=22.5 2023-06-22 17:05:47,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1276782.0, ans=0.1 2023-06-22 17:06:03,731 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.06 vs. limit=22.5 2023-06-22 17:06:06,776 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.21 vs. limit=15.0 2023-06-22 17:06:10,672 INFO [train.py:996] (2/4) Epoch 7, batch 29850, loss[loss=0.1771, simple_loss=0.2589, pruned_loss=0.04766, over 15934.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3187, pruned_loss=0.08314, over 4276276.77 frames. ], batch size: 60, lr: 4.16e-03, grad_scale: 8.0 2023-06-22 17:06:26,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1276962.0, ans=0.2 2023-06-22 17:06:37,231 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=15.0 2023-06-22 17:07:27,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff2.min_abs, batch_count=1277082.0, ans=0.1 2023-06-22 17:07:48,854 INFO [train.py:996] (2/4) Epoch 7, batch 29900, loss[loss=0.2403, simple_loss=0.3075, pruned_loss=0.08649, over 21846.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3171, pruned_loss=0.08479, over 4288281.25 frames. ], batch size: 247, lr: 4.16e-03, grad_scale: 8.0 2023-06-22 17:07:50,463 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.619e+02 3.325e+02 3.983e+02 5.006e+02 1.426e+03, threshold=7.966e+02, percent-clipped=5.0 2023-06-22 17:07:59,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1277202.0, ans=0.0 2023-06-22 17:08:29,397 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.79 vs. limit=15.0 2023-06-22 17:08:58,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1277382.0, ans=0.125 2023-06-22 17:09:33,572 INFO [train.py:996] (2/4) Epoch 7, batch 29950, loss[loss=0.2921, simple_loss=0.3577, pruned_loss=0.1133, over 21275.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3217, pruned_loss=0.08889, over 4288024.59 frames. ], batch size: 159, lr: 4.16e-03, grad_scale: 8.0 2023-06-22 17:10:01,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1277562.0, ans=0.125 2023-06-22 17:10:41,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1277682.0, ans=0.2 2023-06-22 17:11:13,196 INFO [train.py:996] (2/4) Epoch 7, batch 30000, loss[loss=0.2172, simple_loss=0.3174, pruned_loss=0.05844, over 21662.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3224, pruned_loss=0.08815, over 4286655.49 frames. ], batch size: 414, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:11:13,196 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-22 17:11:34,233 INFO [train.py:1028] (2/4) Epoch 7, validation: loss=0.2473, simple_loss=0.3461, pruned_loss=0.0743, over 1796401.00 frames. 2023-06-22 17:11:34,234 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-22 17:11:36,051 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.688e+02 3.810e+02 4.424e+02 5.666e+02 1.321e+03, threshold=8.847e+02, percent-clipped=8.0 2023-06-22 17:11:38,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1277802.0, ans=0.125 2023-06-22 17:11:41,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1277802.0, ans=0.125 2023-06-22 17:12:01,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1277862.0, ans=0.125 2023-06-22 17:12:03,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1277862.0, ans=0.125 2023-06-22 17:12:04,001 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.42 vs. limit=22.5 2023-06-22 17:12:05,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1277862.0, ans=0.125 2023-06-22 17:12:09,131 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.99 vs. limit=15.0 2023-06-22 17:12:16,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1277922.0, ans=0.0 2023-06-22 17:12:27,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1277922.0, ans=0.125 2023-06-22 17:12:47,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1277982.0, ans=0.125 2023-06-22 17:12:58,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=1278042.0, ans=0.02 2023-06-22 17:13:05,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1278042.0, ans=0.2 2023-06-22 17:13:26,176 INFO [train.py:996] (2/4) Epoch 7, batch 30050, loss[loss=0.2989, simple_loss=0.4142, pruned_loss=0.0918, over 21185.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3268, pruned_loss=0.08605, over 4282640.55 frames. ], batch size: 548, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:13:29,088 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=15.0 2023-06-22 17:13:31,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1278102.0, ans=0.2 2023-06-22 17:13:37,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1278102.0, ans=0.1 2023-06-22 17:14:24,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1278282.0, ans=0.95 2023-06-22 17:14:26,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1278282.0, ans=0.2 2023-06-22 17:14:45,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1278342.0, ans=0.125 2023-06-22 17:15:03,300 INFO [train.py:996] (2/4) Epoch 7, batch 30100, loss[loss=0.2302, simple_loss=0.2834, pruned_loss=0.08855, over 21223.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3267, pruned_loss=0.08634, over 4271734.64 frames. ], batch size: 159, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:15:04,862 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.804e+02 3.663e+02 4.877e+02 6.196e+02 1.469e+03, threshold=9.754e+02, percent-clipped=9.0 2023-06-22 17:15:24,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1278462.0, ans=0.0 2023-06-22 17:16:41,779 INFO [train.py:996] (2/4) Epoch 7, batch 30150, loss[loss=0.2581, simple_loss=0.3283, pruned_loss=0.09394, over 21795.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3221, pruned_loss=0.08741, over 4266178.67 frames. ], batch size: 333, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:16:48,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1278702.0, ans=0.125 2023-06-22 17:16:57,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1278762.0, ans=0.0 2023-06-22 17:17:01,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1278762.0, ans=0.125 2023-06-22 17:17:36,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1278822.0, ans=0.125 2023-06-22 17:17:50,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1278882.0, ans=0.0 2023-06-22 17:17:57,000 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 17:18:17,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1278942.0, ans=0.125 2023-06-22 17:18:24,037 INFO [train.py:996] (2/4) Epoch 7, batch 30200, loss[loss=0.2079, simple_loss=0.2955, pruned_loss=0.0601, over 21421.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3241, pruned_loss=0.086, over 4268580.69 frames. ], batch size: 211, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:18:25,699 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.591e+02 3.500e+02 4.327e+02 6.195e+02 1.104e+03, threshold=8.654e+02, percent-clipped=5.0 2023-06-22 17:18:42,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1279002.0, ans=0.1 2023-06-22 17:19:01,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1279062.0, ans=0.125 2023-06-22 17:19:07,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1279062.0, ans=0.125 2023-06-22 17:19:38,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1279182.0, ans=0.0 2023-06-22 17:19:42,652 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-06-22 17:19:44,133 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2023-06-22 17:20:09,052 INFO [train.py:996] (2/4) Epoch 7, batch 30250, loss[loss=0.2486, simple_loss=0.3307, pruned_loss=0.08323, over 21255.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3328, pruned_loss=0.08977, over 4271451.64 frames. ], batch size: 143, lr: 4.16e-03, grad_scale: 16.0 2023-06-22 17:20:39,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1279362.0, ans=0.125 2023-06-22 17:20:47,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1279362.0, ans=0.125 2023-06-22 17:21:24,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1279482.0, ans=0.0 2023-06-22 17:21:39,581 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.22 vs. limit=15.0 2023-06-22 17:21:41,074 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-22 17:21:45,794 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-06-22 17:21:48,211 INFO [train.py:996] (2/4) Epoch 7, batch 30300, loss[loss=0.2015, simple_loss=0.2647, pruned_loss=0.0692, over 21513.00 frames. ], tot_loss[loss=0.2537, simple_loss=0.3293, pruned_loss=0.08909, over 4267385.04 frames. ], batch size: 263, lr: 4.15e-03, grad_scale: 16.0 2023-06-22 17:21:49,809 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.688e+02 4.197e+02 5.234e+02 7.319e+02 1.495e+03, threshold=1.047e+03, percent-clipped=13.0 2023-06-22 17:21:50,860 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.35 vs. limit=15.0 2023-06-22 17:21:58,976 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.63 vs. limit=15.0 2023-06-22 17:22:23,103 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.07 vs. limit=15.0 2023-06-22 17:22:32,678 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-06-22 17:22:33,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1279722.0, ans=0.07 2023-06-22 17:22:44,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1279722.0, ans=0.125 2023-06-22 17:23:11,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1279842.0, ans=0.2 2023-06-22 17:23:33,742 INFO [train.py:996] (2/4) Epoch 7, batch 30350, loss[loss=0.2487, simple_loss=0.3359, pruned_loss=0.08082, over 21816.00 frames. ], tot_loss[loss=0.257, simple_loss=0.3319, pruned_loss=0.09104, over 4274762.82 frames. ], batch size: 352, lr: 4.15e-03, grad_scale: 16.0 2023-06-22 17:24:02,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1279962.0, ans=0.0 2023-06-22 17:24:05,532 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.30 vs. limit=15.0 2023-06-22 17:24:18,527 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-06-22 17:24:46,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1280142.0, ans=0.0 2023-06-22 17:24:52,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1280142.0, ans=0.125 2023-06-22 17:24:56,705 INFO [train.py:996] (2/4) Epoch 7, batch 30400, loss[loss=0.2309, simple_loss=0.276, pruned_loss=0.0929, over 20268.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3252, pruned_loss=0.0893, over 4261811.51 frames. ], batch size: 703, lr: 4.15e-03, grad_scale: 32.0 2023-06-22 17:24:58,178 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.840e+02 4.001e+02 6.030e+02 8.810e+02 1.556e+03, threshold=1.206e+03, percent-clipped=18.0 2023-06-22 17:25:17,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1280262.0, ans=0.0 2023-06-22 17:25:25,257 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-06-22 17:25:31,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1280322.0, ans=0.0 2023-06-22 17:25:59,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1280382.0, ans=0.1 2023-06-22 17:26:14,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1280442.0, ans=0.1 2023-06-22 17:26:21,152 INFO [train.py:996] (2/4) Epoch 7, batch 30450, loss[loss=0.2881, simple_loss=0.4069, pruned_loss=0.08462, over 19877.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3268, pruned_loss=0.08871, over 4202104.13 frames. ], batch size: 702, lr: 4.15e-03, grad_scale: 32.0 2023-06-22 17:26:29,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1280502.0, ans=0.125 2023-06-22 17:26:48,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1280562.0, ans=0.0 2023-06-22 17:26:57,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1280622.0, ans=0.125 2023-06-22 17:27:18,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1280682.0, ans=0.0 2023-06-22 17:27:21,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1280682.0, ans=0.07 2023-06-22 17:27:26,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1280742.0, ans=0.125 2023-06-22 17:29:06,163 INFO [train.py:996] (2/4) Epoch 8, batch 0, loss[loss=0.2267, simple_loss=0.2941, pruned_loss=0.07963, over 21594.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2941, pruned_loss=0.07963, over 21594.00 frames. ], batch size: 298, lr: 3.86e-03, grad_scale: 32.0 2023-06-22 17:29:06,163 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-22 17:29:21,699 INFO [train.py:1028] (2/4) Epoch 8, validation: loss=0.2437, simple_loss=0.3524, pruned_loss=0.06749, over 1796401.00 frames. 2023-06-22 17:29:21,700 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-22 17:29:27,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1280772.0, ans=0.125 2023-06-22 17:29:30,708 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.318e+02 7.236e+02 1.078e+03 1.767e+03 4.535e+03, threshold=2.157e+03, percent-clipped=44.0 2023-06-22 17:31:00,093 INFO [train.py:996] (2/4) Epoch 8, batch 50, loss[loss=0.3548, simple_loss=0.4149, pruned_loss=0.1473, over 21479.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3347, pruned_loss=0.08934, over 972386.79 frames. ], batch size: 471, lr: 3.86e-03, grad_scale: 32.0 2023-06-22 17:31:05,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1281072.0, ans=0.125 2023-06-22 17:31:15,280 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=12.0 2023-06-22 17:31:49,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1281192.0, ans=0.125 2023-06-22 17:32:04,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1281252.0, ans=0.1 2023-06-22 17:32:07,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1281252.0, ans=0.125 2023-06-22 17:32:33,421 INFO [train.py:996] (2/4) Epoch 8, batch 100, loss[loss=0.2531, simple_loss=0.3565, pruned_loss=0.07479, over 21257.00 frames. ], tot_loss[loss=0.2663, simple_loss=0.3486, pruned_loss=0.09194, over 1701285.24 frames. ], batch size: 176, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:32:44,096 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.56 vs. limit=12.0 2023-06-22 17:32:44,635 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.771e+02 3.609e+02 4.818e+02 6.662e+02 2.202e+03, threshold=9.637e+02, percent-clipped=1.0 2023-06-22 17:32:56,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1281432.0, ans=15.0 2023-06-22 17:34:06,762 INFO [train.py:996] (2/4) Epoch 8, batch 150, loss[loss=0.2517, simple_loss=0.3184, pruned_loss=0.0925, over 21876.00 frames. ], tot_loss[loss=0.2666, simple_loss=0.3505, pruned_loss=0.09129, over 2273954.27 frames. ], batch size: 118, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:34:10,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1281672.0, ans=0.125 2023-06-22 17:34:17,015 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=22.5 2023-06-22 17:34:22,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1281732.0, ans=0.125 2023-06-22 17:34:27,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1281732.0, ans=0.0 2023-06-22 17:35:14,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1281852.0, ans=0.125 2023-06-22 17:35:33,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1281912.0, ans=0.125 2023-06-22 17:35:39,222 INFO [train.py:996] (2/4) Epoch 8, batch 200, loss[loss=0.2509, simple_loss=0.3684, pruned_loss=0.06666, over 20736.00 frames. ], tot_loss[loss=0.2582, simple_loss=0.3412, pruned_loss=0.08762, over 2718984.31 frames. ], batch size: 607, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:35:49,888 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.926e+02 4.076e+02 5.203e+02 6.716e+02 1.490e+03, threshold=1.041e+03, percent-clipped=7.0 2023-06-22 17:35:53,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1282032.0, ans=0.0 2023-06-22 17:36:33,304 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.66 vs. limit=10.0 2023-06-22 17:37:18,873 INFO [train.py:996] (2/4) Epoch 8, batch 250, loss[loss=0.2532, simple_loss=0.3201, pruned_loss=0.09315, over 21799.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3363, pruned_loss=0.08746, over 3069743.35 frames. ], batch size: 298, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:37:46,913 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.47 vs. limit=15.0 2023-06-22 17:38:14,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1282392.0, ans=0.0 2023-06-22 17:38:42,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1282512.0, ans=0.2 2023-06-22 17:38:54,764 INFO [train.py:996] (2/4) Epoch 8, batch 300, loss[loss=0.2565, simple_loss=0.3626, pruned_loss=0.07523, over 19826.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3287, pruned_loss=0.08683, over 3332820.83 frames. ], batch size: 703, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:38:58,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1282572.0, ans=0.07 2023-06-22 17:39:00,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1282572.0, ans=0.125 2023-06-22 17:39:06,251 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.679e+02 4.058e+02 5.447e+02 7.307e+02 1.512e+03, threshold=1.089e+03, percent-clipped=7.0 2023-06-22 17:39:06,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1282572.0, ans=0.0 2023-06-22 17:39:58,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1282752.0, ans=0.0 2023-06-22 17:40:31,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1282812.0, ans=0.125 2023-06-22 17:40:35,695 INFO [train.py:996] (2/4) Epoch 8, batch 350, loss[loss=0.2283, simple_loss=0.2827, pruned_loss=0.08696, over 21348.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3225, pruned_loss=0.08652, over 3537761.38 frames. ], batch size: 160, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:40:50,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1282932.0, ans=0.125 2023-06-22 17:41:33,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1282992.0, ans=0.2 2023-06-22 17:41:54,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1283052.0, ans=0.015 2023-06-22 17:42:14,804 INFO [train.py:996] (2/4) Epoch 8, batch 400, loss[loss=0.259, simple_loss=0.3692, pruned_loss=0.07444, over 20856.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3178, pruned_loss=0.08616, over 3695441.38 frames. ], batch size: 608, lr: 3.86e-03, grad_scale: 32.0 2023-06-22 17:42:15,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1283172.0, ans=0.0 2023-06-22 17:42:16,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1283172.0, ans=0.0 2023-06-22 17:42:25,996 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.922e+02 3.737e+02 4.920e+02 6.486e+02 1.177e+03, threshold=9.840e+02, percent-clipped=3.0 2023-06-22 17:43:08,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1283292.0, ans=0.1 2023-06-22 17:43:19,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1283352.0, ans=0.125 2023-06-22 17:43:29,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1283352.0, ans=0.0 2023-06-22 17:43:36,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1283412.0, ans=0.125 2023-06-22 17:43:48,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1283412.0, ans=0.125 2023-06-22 17:43:54,175 INFO [train.py:996] (2/4) Epoch 8, batch 450, loss[loss=0.1805, simple_loss=0.2463, pruned_loss=0.05738, over 21128.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3145, pruned_loss=0.08303, over 3826731.22 frames. ], batch size: 176, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:45:14,143 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2023-06-22 17:45:32,285 INFO [train.py:996] (2/4) Epoch 8, batch 500, loss[loss=0.2154, simple_loss=0.3106, pruned_loss=0.06005, over 21276.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3161, pruned_loss=0.08214, over 3914297.17 frames. ], batch size: 131, lr: 3.86e-03, grad_scale: 16.0 2023-06-22 17:45:59,920 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.804e+02 3.923e+02 5.554e+02 7.720e+02 1.831e+03, threshold=1.111e+03, percent-clipped=13.0 2023-06-22 17:46:04,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1283832.0, ans=0.2 2023-06-22 17:46:10,312 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-22 17:46:11,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1283832.0, ans=0.0 2023-06-22 17:46:13,291 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.93 vs. limit=15.0 2023-06-22 17:46:57,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1284012.0, ans=0.0 2023-06-22 17:47:15,026 INFO [train.py:996] (2/4) Epoch 8, batch 550, loss[loss=0.2308, simple_loss=0.2836, pruned_loss=0.08904, over 21309.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3172, pruned_loss=0.08182, over 4001350.58 frames. ], batch size: 144, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 17:48:11,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1284252.0, ans=0.0 2023-06-22 17:48:11,624 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.60 vs. limit=12.0 2023-06-22 17:48:23,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1284252.0, ans=0.0 2023-06-22 17:48:46,261 INFO [train.py:996] (2/4) Epoch 8, batch 600, loss[loss=0.2095, simple_loss=0.2645, pruned_loss=0.07725, over 20000.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3185, pruned_loss=0.08261, over 4063413.84 frames. ], batch size: 704, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 17:48:48,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1284372.0, ans=0.125 2023-06-22 17:48:50,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1284372.0, ans=0.125 2023-06-22 17:49:08,676 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.731e+02 3.844e+02 4.934e+02 7.871e+02 2.167e+03, threshold=9.868e+02, percent-clipped=19.0 2023-06-22 17:49:28,167 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-22 17:49:30,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1284492.0, ans=0.125 2023-06-22 17:50:23,342 INFO [train.py:996] (2/4) Epoch 8, batch 650, loss[loss=0.2028, simple_loss=0.273, pruned_loss=0.06636, over 21538.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3206, pruned_loss=0.083, over 4120814.20 frames. ], batch size: 230, lr: 3.85e-03, grad_scale: 8.0 2023-06-22 17:50:25,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1284672.0, ans=0.0 2023-06-22 17:50:40,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1284672.0, ans=0.09899494936611666 2023-06-22 17:50:42,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1284672.0, ans=0.125 2023-06-22 17:50:48,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1284732.0, ans=0.0 2023-06-22 17:50:49,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1284732.0, ans=0.2 2023-06-22 17:51:19,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1284852.0, ans=0.125 2023-06-22 17:51:49,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1284912.0, ans=0.1 2023-06-22 17:51:56,384 INFO [train.py:996] (2/4) Epoch 8, batch 700, loss[loss=0.2539, simple_loss=0.3256, pruned_loss=0.09107, over 21768.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3223, pruned_loss=0.08278, over 4152273.14 frames. ], batch size: 351, lr: 3.85e-03, grad_scale: 8.0 2023-06-22 17:52:20,172 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.941e+02 4.186e+02 5.348e+02 7.319e+02 1.415e+03, threshold=1.070e+03, percent-clipped=6.0 2023-06-22 17:52:23,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1285032.0, ans=0.1 2023-06-22 17:52:58,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1285152.0, ans=0.1 2023-06-22 17:53:17,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1285212.0, ans=0.07 2023-06-22 17:53:29,828 INFO [train.py:996] (2/4) Epoch 8, batch 750, loss[loss=0.2539, simple_loss=0.3058, pruned_loss=0.101, over 21890.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3206, pruned_loss=0.08371, over 4187697.81 frames. ], batch size: 98, lr: 3.85e-03, grad_scale: 8.0 2023-06-22 17:53:31,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1285272.0, ans=0.0 2023-06-22 17:54:09,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1285332.0, ans=0.1 2023-06-22 17:55:07,716 INFO [train.py:996] (2/4) Epoch 8, batch 800, loss[loss=0.2132, simple_loss=0.2782, pruned_loss=0.07407, over 21777.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3177, pruned_loss=0.08439, over 4203316.89 frames. ], batch size: 351, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 17:55:35,847 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.791e+02 4.052e+02 4.658e+02 6.687e+02 1.387e+03, threshold=9.317e+02, percent-clipped=3.0 2023-06-22 17:55:49,616 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-06-22 17:56:27,379 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.30 vs. limit=10.0 2023-06-22 17:56:40,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1285812.0, ans=0.0 2023-06-22 17:56:54,333 INFO [train.py:996] (2/4) Epoch 8, batch 850, loss[loss=0.2118, simple_loss=0.282, pruned_loss=0.07086, over 21517.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3155, pruned_loss=0.08436, over 4224299.33 frames. ], batch size: 212, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 17:57:04,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1285872.0, ans=0.1 2023-06-22 17:57:31,840 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.18 vs. limit=15.0 2023-06-22 17:58:07,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1286112.0, ans=0.125 2023-06-22 17:58:20,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1286112.0, ans=0.0 2023-06-22 17:58:32,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1286172.0, ans=0.1 2023-06-22 17:58:33,077 INFO [train.py:996] (2/4) Epoch 8, batch 900, loss[loss=0.205, simple_loss=0.2766, pruned_loss=0.06671, over 21491.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3125, pruned_loss=0.08396, over 4237774.54 frames. ], batch size: 194, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 17:58:47,802 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.900e+02 3.949e+02 5.086e+02 6.787e+02 1.769e+03, threshold=1.017e+03, percent-clipped=9.0 2023-06-22 17:58:50,272 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=15.0 2023-06-22 17:58:52,070 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.28 vs. limit=15.0 2023-06-22 18:00:12,904 INFO [train.py:996] (2/4) Epoch 8, batch 950, loss[loss=0.2387, simple_loss=0.3085, pruned_loss=0.08446, over 21713.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3096, pruned_loss=0.08312, over 4244046.16 frames. ], batch size: 230, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:00:14,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1286472.0, ans=0.125 2023-06-22 18:01:19,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1286712.0, ans=0.125 2023-06-22 18:01:20,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1286712.0, ans=0.1 2023-06-22 18:01:48,897 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=22.5 2023-06-22 18:01:51,046 INFO [train.py:996] (2/4) Epoch 8, batch 1000, loss[loss=0.2582, simple_loss=0.3265, pruned_loss=0.09498, over 21674.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3103, pruned_loss=0.08344, over 4254403.81 frames. ], batch size: 230, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:02:05,739 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.829e+02 3.565e+02 4.375e+02 6.228e+02 1.305e+03, threshold=8.750e+02, percent-clipped=2.0 2023-06-22 18:03:04,387 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 18:03:22,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1287012.0, ans=15.0 2023-06-22 18:03:32,510 INFO [train.py:996] (2/4) Epoch 8, batch 1050, loss[loss=0.2142, simple_loss=0.306, pruned_loss=0.06123, over 21759.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3101, pruned_loss=0.08292, over 4267316.91 frames. ], batch size: 332, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:03:39,622 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-22 18:03:47,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1287132.0, ans=0.2 2023-06-22 18:04:16,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1287192.0, ans=0.125 2023-06-22 18:04:26,135 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.97 vs. limit=6.0 2023-06-22 18:04:57,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1287312.0, ans=0.2 2023-06-22 18:05:07,892 INFO [train.py:996] (2/4) Epoch 8, batch 1100, loss[loss=0.2626, simple_loss=0.3377, pruned_loss=0.09377, over 21472.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3092, pruned_loss=0.08218, over 4263050.05 frames. ], batch size: 194, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:05:21,942 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.840e+02 4.476e+02 5.815e+02 7.362e+02 1.371e+03, threshold=1.163e+03, percent-clipped=15.0 2023-06-22 18:06:43,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1287612.0, ans=0.0 2023-06-22 18:06:48,155 INFO [train.py:996] (2/4) Epoch 8, batch 1150, loss[loss=0.2383, simple_loss=0.3048, pruned_loss=0.08589, over 21823.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3095, pruned_loss=0.0824, over 4267587.35 frames. ], batch size: 441, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:07:03,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1287732.0, ans=0.0 2023-06-22 18:07:47,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1287792.0, ans=0.1 2023-06-22 18:07:57,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1287852.0, ans=0.1 2023-06-22 18:08:02,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1287852.0, ans=0.125 2023-06-22 18:08:14,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1287912.0, ans=0.2 2023-06-22 18:08:17,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1287912.0, ans=0.125 2023-06-22 18:08:24,307 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-22 18:08:24,608 INFO [train.py:996] (2/4) Epoch 8, batch 1200, loss[loss=0.2228, simple_loss=0.312, pruned_loss=0.06677, over 21830.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3111, pruned_loss=0.0823, over 4273000.54 frames. ], batch size: 282, lr: 3.85e-03, grad_scale: 32.0 2023-06-22 18:08:43,479 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.884e+02 3.897e+02 4.987e+02 7.014e+02 1.089e+03, threshold=9.974e+02, percent-clipped=0.0 2023-06-22 18:08:43,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1288032.0, ans=0.2 2023-06-22 18:08:59,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1288092.0, ans=0.0 2023-06-22 18:09:05,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1288092.0, ans=10.0 2023-06-22 18:09:47,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1288212.0, ans=0.125 2023-06-22 18:09:52,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1288212.0, ans=0.125 2023-06-22 18:10:03,629 INFO [train.py:996] (2/4) Epoch 8, batch 1250, loss[loss=0.2243, simple_loss=0.3023, pruned_loss=0.07321, over 21858.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3133, pruned_loss=0.08319, over 4282891.05 frames. ], batch size: 351, lr: 3.85e-03, grad_scale: 32.0 2023-06-22 18:10:10,930 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-22 18:10:23,983 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-22 18:10:37,248 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.65 vs. limit=15.0 2023-06-22 18:10:51,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1288392.0, ans=0.0 2023-06-22 18:11:01,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1288392.0, ans=15.0 2023-06-22 18:11:23,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1288452.0, ans=0.0 2023-06-22 18:11:25,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1288452.0, ans=0.125 2023-06-22 18:11:29,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1288512.0, ans=0.125 2023-06-22 18:11:33,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1288512.0, ans=0.125 2023-06-22 18:11:38,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1288512.0, ans=0.125 2023-06-22 18:11:44,063 INFO [train.py:996] (2/4) Epoch 8, batch 1300, loss[loss=0.2444, simple_loss=0.3085, pruned_loss=0.09021, over 21340.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3154, pruned_loss=0.08395, over 4287303.52 frames. ], batch size: 159, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:12:01,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1288572.0, ans=0.0 2023-06-22 18:12:04,927 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.549e+02 4.208e+02 5.615e+02 7.044e+02 1.517e+03, threshold=1.123e+03, percent-clipped=9.0 2023-06-22 18:12:20,478 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=16.36 vs. limit=15.0 2023-06-22 18:12:52,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1288752.0, ans=0.0 2023-06-22 18:13:04,284 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.88 vs. limit=10.0 2023-06-22 18:13:24,627 INFO [train.py:996] (2/4) Epoch 8, batch 1350, loss[loss=0.3103, simple_loss=0.3691, pruned_loss=0.1258, over 21382.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3169, pruned_loss=0.08485, over 4287704.47 frames. ], batch size: 509, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:13:37,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1288872.0, ans=0.125 2023-06-22 18:14:18,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1288992.0, ans=0.1 2023-06-22 18:14:32,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1289052.0, ans=0.1 2023-06-22 18:14:38,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1289052.0, ans=0.0 2023-06-22 18:14:48,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1289112.0, ans=0.125 2023-06-22 18:14:59,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1289112.0, ans=0.0 2023-06-22 18:15:05,851 INFO [train.py:996] (2/4) Epoch 8, batch 1400, loss[loss=0.2078, simple_loss=0.274, pruned_loss=0.07075, over 21643.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3146, pruned_loss=0.08426, over 4279215.08 frames. ], batch size: 298, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:15:18,340 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.65 vs. limit=22.5 2023-06-22 18:15:26,926 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.742e+02 3.782e+02 4.959e+02 6.793e+02 1.586e+03, threshold=9.917e+02, percent-clipped=6.0 2023-06-22 18:16:20,054 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.75 vs. limit=15.0 2023-06-22 18:16:21,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1289352.0, ans=0.0 2023-06-22 18:16:33,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1289412.0, ans=0.125 2023-06-22 18:16:41,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1289412.0, ans=0.125 2023-06-22 18:16:45,933 INFO [train.py:996] (2/4) Epoch 8, batch 1450, loss[loss=0.2234, simple_loss=0.2786, pruned_loss=0.08408, over 21467.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3153, pruned_loss=0.08518, over 4285054.04 frames. ], batch size: 195, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:16:51,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1289472.0, ans=0.0 2023-06-22 18:17:05,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1289532.0, ans=0.0 2023-06-22 18:17:06,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1289532.0, ans=0.1 2023-06-22 18:17:36,554 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-22 18:17:43,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1289592.0, ans=0.0 2023-06-22 18:17:52,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1289652.0, ans=0.1 2023-06-22 18:18:25,450 INFO [train.py:996] (2/4) Epoch 8, batch 1500, loss[loss=0.2255, simple_loss=0.3013, pruned_loss=0.07483, over 21337.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3171, pruned_loss=0.08662, over 4289326.85 frames. ], batch size: 176, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:18:25,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1289772.0, ans=0.0 2023-06-22 18:18:29,809 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-22 18:18:46,210 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.995e+02 3.712e+02 4.836e+02 6.899e+02 1.421e+03, threshold=9.672e+02, percent-clipped=7.0 2023-06-22 18:18:55,552 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-06-22 18:18:58,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1289832.0, ans=0.125 2023-06-22 18:19:11,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1289892.0, ans=0.0 2023-06-22 18:19:32,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1289952.0, ans=0.125 2023-06-22 18:20:07,054 INFO [train.py:996] (2/4) Epoch 8, batch 1550, loss[loss=0.1813, simple_loss=0.2505, pruned_loss=0.05603, over 21804.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3143, pruned_loss=0.08459, over 4283527.93 frames. ], batch size: 124, lr: 3.85e-03, grad_scale: 16.0 2023-06-22 18:21:48,295 INFO [train.py:996] (2/4) Epoch 8, batch 1600, loss[loss=0.1814, simple_loss=0.2349, pruned_loss=0.06394, over 16246.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3155, pruned_loss=0.08496, over 4280160.13 frames. ], batch size: 62, lr: 3.85e-03, grad_scale: 32.0 2023-06-22 18:22:16,349 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.959e+02 3.911e+02 5.598e+02 7.259e+02 1.641e+03, threshold=1.120e+03, percent-clipped=8.0 2023-06-22 18:22:33,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1290432.0, ans=0.125 2023-06-22 18:23:05,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1290552.0, ans=0.025 2023-06-22 18:23:13,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1290612.0, ans=0.1 2023-06-22 18:23:15,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1290612.0, ans=0.125 2023-06-22 18:23:36,622 INFO [train.py:996] (2/4) Epoch 8, batch 1650, loss[loss=0.2707, simple_loss=0.3353, pruned_loss=0.1031, over 21247.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3173, pruned_loss=0.08528, over 4282373.06 frames. ], batch size: 176, lr: 3.85e-03, grad_scale: 32.0 2023-06-22 18:23:58,797 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.56 vs. limit=15.0 2023-06-22 18:24:35,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1290852.0, ans=0.1 2023-06-22 18:24:40,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1290852.0, ans=0.0 2023-06-22 18:24:55,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1290912.0, ans=0.1 2023-06-22 18:25:12,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1290912.0, ans=0.0 2023-06-22 18:25:17,557 INFO [train.py:996] (2/4) Epoch 8, batch 1700, loss[loss=0.2326, simple_loss=0.3267, pruned_loss=0.06925, over 21726.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3203, pruned_loss=0.08626, over 4280254.41 frames. ], batch size: 298, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:25:39,834 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.23 vs. limit=10.0 2023-06-22 18:25:45,114 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.906e+02 3.944e+02 4.852e+02 6.481e+02 1.409e+03, threshold=9.704e+02, percent-clipped=2.0 2023-06-22 18:26:01,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1291092.0, ans=0.2 2023-06-22 18:26:57,287 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.13 vs. limit=15.0 2023-06-22 18:27:04,032 INFO [train.py:996] (2/4) Epoch 8, batch 1750, loss[loss=0.182, simple_loss=0.2828, pruned_loss=0.04063, over 21229.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3168, pruned_loss=0.08329, over 4281837.22 frames. ], batch size: 548, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:27:09,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1291272.0, ans=0.2 2023-06-22 18:27:16,145 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 18:27:18,260 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-22 18:27:41,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1291392.0, ans=0.2 2023-06-22 18:28:05,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1291452.0, ans=0.0 2023-06-22 18:28:45,994 INFO [train.py:996] (2/4) Epoch 8, batch 1800, loss[loss=0.2418, simple_loss=0.3293, pruned_loss=0.07719, over 19942.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3171, pruned_loss=0.08221, over 4279954.89 frames. ], batch size: 702, lr: 3.84e-03, grad_scale: 8.0 2023-06-22 18:29:04,663 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.938e+02 3.886e+02 4.989e+02 8.763e+02 2.376e+03, threshold=9.977e+02, percent-clipped=20.0 2023-06-22 18:29:05,567 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-22 18:29:10,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1291632.0, ans=0.0 2023-06-22 18:29:19,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1291632.0, ans=0.125 2023-06-22 18:29:23,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1291692.0, ans=0.2 2023-06-22 18:30:05,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1291752.0, ans=0.07 2023-06-22 18:30:26,407 INFO [train.py:996] (2/4) Epoch 8, batch 1850, loss[loss=0.1964, simple_loss=0.2888, pruned_loss=0.05204, over 21370.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3156, pruned_loss=0.07978, over 4285108.83 frames. ], batch size: 194, lr: 3.84e-03, grad_scale: 8.0 2023-06-22 18:30:29,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1291872.0, ans=0.0 2023-06-22 18:30:34,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1291872.0, ans=0.125 2023-06-22 18:30:56,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1291932.0, ans=0.125 2023-06-22 18:31:21,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1291992.0, ans=0.125 2023-06-22 18:31:57,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1292112.0, ans=0.125 2023-06-22 18:32:06,596 INFO [train.py:996] (2/4) Epoch 8, batch 1900, loss[loss=0.219, simple_loss=0.298, pruned_loss=0.07005, over 21799.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3144, pruned_loss=0.08068, over 4280875.38 frames. ], batch size: 282, lr: 3.84e-03, grad_scale: 8.0 2023-06-22 18:32:12,317 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.22 vs. limit=15.0 2023-06-22 18:32:16,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1292172.0, ans=0.125 2023-06-22 18:32:25,781 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.699e+02 3.836e+02 4.981e+02 6.397e+02 1.530e+03, threshold=9.962e+02, percent-clipped=6.0 2023-06-22 18:32:32,995 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=15.0 2023-06-22 18:32:43,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1292292.0, ans=0.0 2023-06-22 18:33:19,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1292352.0, ans=0.125 2023-06-22 18:33:38,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1292412.0, ans=0.125 2023-06-22 18:33:48,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1292472.0, ans=0.1 2023-06-22 18:33:49,931 INFO [train.py:996] (2/4) Epoch 8, batch 1950, loss[loss=0.2298, simple_loss=0.2911, pruned_loss=0.0842, over 15246.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3121, pruned_loss=0.08132, over 4268083.55 frames. ], batch size: 60, lr: 3.84e-03, grad_scale: 8.0 2023-06-22 18:33:57,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1292472.0, ans=0.125 2023-06-22 18:34:29,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1292532.0, ans=0.125 2023-06-22 18:34:41,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1292592.0, ans=0.0 2023-06-22 18:35:07,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1292652.0, ans=0.125 2023-06-22 18:35:31,189 INFO [train.py:996] (2/4) Epoch 8, batch 2000, loss[loss=0.2181, simple_loss=0.2807, pruned_loss=0.07774, over 21470.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3092, pruned_loss=0.07964, over 4273949.11 frames. ], batch size: 195, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:35:39,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1292772.0, ans=0.125 2023-06-22 18:35:54,809 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.454e+02 4.321e+02 6.258e+02 9.587e+02 1.701e+03, threshold=1.252e+03, percent-clipped=22.0 2023-06-22 18:36:01,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1292832.0, ans=0.125 2023-06-22 18:36:19,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1292892.0, ans=0.0 2023-06-22 18:36:39,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1292952.0, ans=0.5 2023-06-22 18:36:54,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1292952.0, ans=0.125 2023-06-22 18:37:01,901 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.20 vs. limit=15.0 2023-06-22 18:37:10,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1293012.0, ans=0.1 2023-06-22 18:37:13,648 INFO [train.py:996] (2/4) Epoch 8, batch 2050, loss[loss=0.2686, simple_loss=0.379, pruned_loss=0.07906, over 19864.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3121, pruned_loss=0.07966, over 4279034.54 frames. ], batch size: 702, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:37:14,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1293072.0, ans=0.0 2023-06-22 18:37:29,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1293072.0, ans=0.125 2023-06-22 18:37:48,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1293192.0, ans=0.1 2023-06-22 18:37:57,648 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.61 vs. limit=22.5 2023-06-22 18:38:24,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1293252.0, ans=0.125 2023-06-22 18:38:28,480 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-22 18:38:31,542 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=15.0 2023-06-22 18:38:35,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1293312.0, ans=0.125 2023-06-22 18:38:43,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1293312.0, ans=0.2 2023-06-22 18:38:52,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1293372.0, ans=0.125 2023-06-22 18:38:53,102 INFO [train.py:996] (2/4) Epoch 8, batch 2100, loss[loss=0.2075, simple_loss=0.2933, pruned_loss=0.06083, over 21783.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3154, pruned_loss=0.08142, over 4274635.98 frames. ], batch size: 351, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:38:58,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1293372.0, ans=0.125 2023-06-22 18:39:08,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1293372.0, ans=0.2 2023-06-22 18:39:15,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1293432.0, ans=0.1 2023-06-22 18:39:17,073 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.801e+02 4.056e+02 5.317e+02 7.512e+02 1.644e+03, threshold=1.063e+03, percent-clipped=5.0 2023-06-22 18:39:44,293 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.50 vs. limit=12.0 2023-06-22 18:40:01,274 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 18:40:13,352 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-22 18:40:33,826 INFO [train.py:996] (2/4) Epoch 8, batch 2150, loss[loss=0.2613, simple_loss=0.3541, pruned_loss=0.08425, over 21760.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3145, pruned_loss=0.08241, over 4275898.45 frames. ], batch size: 282, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:40:39,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1293672.0, ans=0.2 2023-06-22 18:40:58,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1293732.0, ans=0.0 2023-06-22 18:41:06,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1293732.0, ans=0.125 2023-06-22 18:41:18,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1293792.0, ans=0.95 2023-06-22 18:41:32,226 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.64 vs. limit=22.5 2023-06-22 18:41:35,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1293792.0, ans=0.1 2023-06-22 18:42:01,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1293912.0, ans=0.0 2023-06-22 18:42:09,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1293972.0, ans=0.125 2023-06-22 18:42:10,347 INFO [train.py:996] (2/4) Epoch 8, batch 2200, loss[loss=0.2696, simple_loss=0.3425, pruned_loss=0.0983, over 21888.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3161, pruned_loss=0.08195, over 4276305.77 frames. ], batch size: 351, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:42:21,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1293972.0, ans=0.0 2023-06-22 18:42:33,913 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.009e+02 3.898e+02 4.994e+02 6.578e+02 1.550e+03, threshold=9.987e+02, percent-clipped=10.0 2023-06-22 18:42:58,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1294092.0, ans=0.0 2023-06-22 18:43:29,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1294152.0, ans=0.2 2023-06-22 18:43:50,426 INFO [train.py:996] (2/4) Epoch 8, batch 2250, loss[loss=0.2189, simple_loss=0.3117, pruned_loss=0.06312, over 21440.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3122, pruned_loss=0.08071, over 4273422.56 frames. ], batch size: 211, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:44:54,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1294392.0, ans=0.125 2023-06-22 18:44:59,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1294452.0, ans=0.125 2023-06-22 18:45:18,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=1294512.0, ans=0.02 2023-06-22 18:45:29,764 INFO [train.py:996] (2/4) Epoch 8, batch 2300, loss[loss=0.2446, simple_loss=0.3418, pruned_loss=0.07374, over 21640.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3079, pruned_loss=0.08102, over 4265525.33 frames. ], batch size: 263, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:45:44,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1294572.0, ans=0.0 2023-06-22 18:45:53,598 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.836e+02 4.015e+02 5.277e+02 7.353e+02 1.540e+03, threshold=1.055e+03, percent-clipped=5.0 2023-06-22 18:46:49,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1294752.0, ans=0.0 2023-06-22 18:47:09,592 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=22.5 2023-06-22 18:47:11,793 INFO [train.py:996] (2/4) Epoch 8, batch 2350, loss[loss=0.3017, simple_loss=0.3713, pruned_loss=0.116, over 21768.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3073, pruned_loss=0.08212, over 4273221.91 frames. ], batch size: 441, lr: 3.84e-03, grad_scale: 16.0 2023-06-22 18:47:26,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1294872.0, ans=0.1 2023-06-22 18:47:49,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1294932.0, ans=0.125 2023-06-22 18:48:09,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1294992.0, ans=0.0 2023-06-22 18:48:13,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1294992.0, ans=0.0 2023-06-22 18:48:53,758 INFO [train.py:996] (2/4) Epoch 8, batch 2400, loss[loss=0.2396, simple_loss=0.3139, pruned_loss=0.08268, over 21228.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3106, pruned_loss=0.08417, over 4272796.57 frames. ], batch size: 548, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:49:18,980 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.938e+02 4.939e+02 6.884e+02 8.991e+02 1.831e+03, threshold=1.377e+03, percent-clipped=16.0 2023-06-22 18:49:59,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1295352.0, ans=0.2 2023-06-22 18:50:10,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1295352.0, ans=0.0 2023-06-22 18:50:35,303 INFO [train.py:996] (2/4) Epoch 8, batch 2450, loss[loss=0.217, simple_loss=0.2778, pruned_loss=0.07807, over 21874.00 frames. ], tot_loss[loss=0.242, simple_loss=0.314, pruned_loss=0.085, over 4268153.56 frames. ], batch size: 107, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:50:50,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1295532.0, ans=0.125 2023-06-22 18:51:52,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1295652.0, ans=0.1 2023-06-22 18:51:58,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1295712.0, ans=0.035 2023-06-22 18:52:11,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1295712.0, ans=0.125 2023-06-22 18:52:14,858 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=13.38 vs. limit=15.0 2023-06-22 18:52:15,530 INFO [train.py:996] (2/4) Epoch 8, batch 2500, loss[loss=0.2141, simple_loss=0.2902, pruned_loss=0.06904, over 21726.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3126, pruned_loss=0.08367, over 4272161.44 frames. ], batch size: 124, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:52:24,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1295772.0, ans=0.1 2023-06-22 18:52:34,798 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.052e+02 4.409e+02 5.815e+02 8.522e+02 2.143e+03, threshold=1.163e+03, percent-clipped=4.0 2023-06-22 18:53:21,634 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.47 vs. limit=10.0 2023-06-22 18:53:57,198 INFO [train.py:996] (2/4) Epoch 8, batch 2550, loss[loss=0.227, simple_loss=0.2994, pruned_loss=0.0773, over 14936.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3094, pruned_loss=0.08287, over 4261694.72 frames. ], batch size: 61, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:53:57,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1296072.0, ans=0.1 2023-06-22 18:54:09,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1296072.0, ans=0.2 2023-06-22 18:54:58,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=1296192.0, ans=0.02 2023-06-22 18:55:01,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1296252.0, ans=0.0 2023-06-22 18:55:06,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1296252.0, ans=0.1 2023-06-22 18:55:07,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1296252.0, ans=0.0 2023-06-22 18:55:37,957 INFO [train.py:996] (2/4) Epoch 8, batch 2600, loss[loss=0.2692, simple_loss=0.3324, pruned_loss=0.103, over 21309.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3104, pruned_loss=0.08395, over 4263633.62 frames. ], batch size: 143, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:55:41,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1296372.0, ans=0.0 2023-06-22 18:55:43,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1296372.0, ans=0.125 2023-06-22 18:55:57,331 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.832e+02 4.000e+02 4.999e+02 6.903e+02 1.017e+03, threshold=9.998e+02, percent-clipped=0.0 2023-06-22 18:56:26,373 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.45 vs. limit=15.0 2023-06-22 18:56:45,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1296552.0, ans=0.1 2023-06-22 18:57:12,131 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-22 18:57:14,597 INFO [train.py:996] (2/4) Epoch 8, batch 2650, loss[loss=0.2459, simple_loss=0.3266, pruned_loss=0.08259, over 21807.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3122, pruned_loss=0.08577, over 4275901.63 frames. ], batch size: 282, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:57:27,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1296672.0, ans=0.125 2023-06-22 18:57:31,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1296732.0, ans=0.125 2023-06-22 18:58:55,286 INFO [train.py:996] (2/4) Epoch 8, batch 2700, loss[loss=0.2506, simple_loss=0.3125, pruned_loss=0.0944, over 21794.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3117, pruned_loss=0.08478, over 4280397.38 frames. ], batch size: 124, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 18:59:13,911 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.77 vs. limit=15.0 2023-06-22 18:59:14,442 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.148e+02 4.384e+02 5.256e+02 7.143e+02 1.333e+03, threshold=1.051e+03, percent-clipped=8.0 2023-06-22 19:00:31,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1297212.0, ans=0.125 2023-06-22 19:00:37,135 INFO [train.py:996] (2/4) Epoch 8, batch 2750, loss[loss=0.2511, simple_loss=0.331, pruned_loss=0.08556, over 21354.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3116, pruned_loss=0.08381, over 4285023.33 frames. ], batch size: 159, lr: 3.84e-03, grad_scale: 32.0 2023-06-22 19:01:35,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1297392.0, ans=0.0 2023-06-22 19:02:18,174 INFO [train.py:996] (2/4) Epoch 8, batch 2800, loss[loss=0.2892, simple_loss=0.348, pruned_loss=0.1152, over 21406.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3168, pruned_loss=0.08556, over 4284223.23 frames. ], batch size: 211, lr: 3.83e-03, grad_scale: 32.0 2023-06-22 19:02:56,030 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.162e+02 4.531e+02 5.975e+02 9.124e+02 1.757e+03, threshold=1.195e+03, percent-clipped=17.0 2023-06-22 19:04:01,694 INFO [train.py:996] (2/4) Epoch 8, batch 2850, loss[loss=0.2636, simple_loss=0.3255, pruned_loss=0.1008, over 21330.00 frames. ], tot_loss[loss=0.2478, simple_loss=0.3202, pruned_loss=0.08771, over 4286194.47 frames. ], batch size: 176, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:04:43,747 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=22.5 2023-06-22 19:05:17,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1298052.0, ans=0.0 2023-06-22 19:05:36,698 INFO [train.py:996] (2/4) Epoch 8, batch 2900, loss[loss=0.2266, simple_loss=0.2924, pruned_loss=0.08043, over 21806.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.316, pruned_loss=0.08689, over 4285286.96 frames. ], batch size: 282, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:06:12,619 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.773e+02 4.241e+02 6.077e+02 8.455e+02 1.821e+03, threshold=1.215e+03, percent-clipped=6.0 2023-06-22 19:06:22,731 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-22 19:07:02,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1298412.0, ans=0.125 2023-06-22 19:07:15,530 INFO [train.py:996] (2/4) Epoch 8, batch 2950, loss[loss=0.2696, simple_loss=0.3729, pruned_loss=0.08317, over 20866.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.319, pruned_loss=0.08714, over 4287284.71 frames. ], batch size: 607, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:07:52,104 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-06-22 19:08:16,959 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=22.5 2023-06-22 19:08:16,966 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-06-22 19:08:56,478 INFO [train.py:996] (2/4) Epoch 8, batch 3000, loss[loss=0.2578, simple_loss=0.3325, pruned_loss=0.09156, over 21796.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3226, pruned_loss=0.08806, over 4290677.02 frames. ], batch size: 247, lr: 3.83e-03, grad_scale: 8.0 2023-06-22 19:08:56,479 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-22 19:09:14,197 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.5164, 3.8665, 4.0163, 4.2307], device='cuda:2') 2023-06-22 19:09:17,918 INFO [train.py:1028] (2/4) Epoch 8, validation: loss=0.2518, simple_loss=0.3464, pruned_loss=0.0786, over 1796401.00 frames. 2023-06-22 19:09:17,919 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-22 19:09:40,430 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.798e+02 4.528e+02 5.625e+02 8.163e+02 1.642e+03, threshold=1.125e+03, percent-clipped=6.0 2023-06-22 19:09:48,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1298892.0, ans=0.0 2023-06-22 19:09:51,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1298892.0, ans=0.0 2023-06-22 19:10:00,477 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.04 vs. limit=15.0 2023-06-22 19:10:18,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1298952.0, ans=0.1 2023-06-22 19:10:24,470 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.18 vs. limit=15.0 2023-06-22 19:10:38,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1299012.0, ans=0.1 2023-06-22 19:10:58,776 INFO [train.py:996] (2/4) Epoch 8, batch 3050, loss[loss=0.2102, simple_loss=0.3056, pruned_loss=0.05743, over 21769.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3194, pruned_loss=0.08557, over 4286397.75 frames. ], batch size: 351, lr: 3.83e-03, grad_scale: 8.0 2023-06-22 19:11:51,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1299252.0, ans=0.1 2023-06-22 19:12:17,931 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=22.5 2023-06-22 19:12:18,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1299312.0, ans=0.1 2023-06-22 19:12:38,538 INFO [train.py:996] (2/4) Epoch 8, batch 3100, loss[loss=0.2478, simple_loss=0.3361, pruned_loss=0.07969, over 21687.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3205, pruned_loss=0.08559, over 4282172.32 frames. ], batch size: 298, lr: 3.83e-03, grad_scale: 8.0 2023-06-22 19:12:45,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1299372.0, ans=0.125 2023-06-22 19:12:51,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1299372.0, ans=0.0 2023-06-22 19:13:02,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1299432.0, ans=0.0 2023-06-22 19:13:05,464 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.778e+02 3.934e+02 5.608e+02 7.913e+02 1.726e+03, threshold=1.122e+03, percent-clipped=9.0 2023-06-22 19:13:36,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1299552.0, ans=0.2 2023-06-22 19:14:11,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1299612.0, ans=0.5 2023-06-22 19:14:18,774 INFO [train.py:996] (2/4) Epoch 8, batch 3150, loss[loss=0.2497, simple_loss=0.3239, pruned_loss=0.08771, over 21569.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3219, pruned_loss=0.08602, over 4274779.79 frames. ], batch size: 230, lr: 3.83e-03, grad_scale: 8.0 2023-06-22 19:14:23,252 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2023-06-22 19:14:37,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1299672.0, ans=0.0 2023-06-22 19:14:44,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1299732.0, ans=0.2 2023-06-22 19:15:41,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1299852.0, ans=0.0 2023-06-22 19:16:06,902 INFO [train.py:996] (2/4) Epoch 8, batch 3200, loss[loss=0.2285, simple_loss=0.312, pruned_loss=0.07257, over 21778.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3224, pruned_loss=0.08539, over 4276597.29 frames. ], batch size: 332, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:16:14,505 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.97 vs. limit=15.0 2023-06-22 19:16:29,619 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.108e+02 3.882e+02 4.322e+02 5.833e+02 1.816e+03, threshold=8.643e+02, percent-clipped=1.0 2023-06-22 19:17:07,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1300092.0, ans=0.0 2023-06-22 19:17:10,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1300152.0, ans=0.0 2023-06-22 19:17:15,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1300152.0, ans=0.2 2023-06-22 19:17:22,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1300152.0, ans=0.125 2023-06-22 19:17:46,243 INFO [train.py:996] (2/4) Epoch 8, batch 3250, loss[loss=0.2466, simple_loss=0.3047, pruned_loss=0.09426, over 21345.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3229, pruned_loss=0.08693, over 4265191.67 frames. ], batch size: 194, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:17:46,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1300272.0, ans=0.125 2023-06-22 19:17:48,731 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=12.0 2023-06-22 19:18:01,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1300332.0, ans=0.125 2023-06-22 19:18:09,611 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.21 vs. limit=15.0 2023-06-22 19:18:26,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1300392.0, ans=22.5 2023-06-22 19:19:01,402 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-22 19:19:25,807 INFO [train.py:996] (2/4) Epoch 8, batch 3300, loss[loss=0.2445, simple_loss=0.331, pruned_loss=0.07902, over 21315.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3185, pruned_loss=0.0858, over 4267600.89 frames. ], batch size: 549, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:19:48,433 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.884e+02 4.529e+02 6.033e+02 9.657e+02 1.783e+03, threshold=1.207e+03, percent-clipped=28.0 2023-06-22 19:20:32,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1300752.0, ans=0.2 2023-06-22 19:20:46,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1300752.0, ans=0.1 2023-06-22 19:21:04,774 INFO [train.py:996] (2/4) Epoch 8, batch 3350, loss[loss=0.2868, simple_loss=0.3429, pruned_loss=0.1154, over 21594.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3197, pruned_loss=0.08569, over 4263823.48 frames. ], batch size: 471, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:21:32,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1300932.0, ans=0.125 2023-06-22 19:21:53,233 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.04 vs. limit=15.0 2023-06-22 19:21:57,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1300992.0, ans=0.05 2023-06-22 19:22:11,796 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.25 vs. limit=10.0 2023-06-22 19:22:21,585 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.80 vs. limit=6.0 2023-06-22 19:22:36,697 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.90 vs. limit=10.0 2023-06-22 19:22:43,277 INFO [train.py:996] (2/4) Epoch 8, batch 3400, loss[loss=0.2423, simple_loss=0.3537, pruned_loss=0.06542, over 21225.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3196, pruned_loss=0.08533, over 4275827.50 frames. ], batch size: 549, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:22:56,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1301172.0, ans=0.1 2023-06-22 19:23:06,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1301232.0, ans=0.0 2023-06-22 19:23:07,152 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-06-22 19:23:16,072 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.036e+02 4.257e+02 5.465e+02 6.871e+02 1.586e+03, threshold=1.093e+03, percent-clipped=5.0 2023-06-22 19:23:43,743 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.87 vs. limit=6.0 2023-06-22 19:23:50,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1301352.0, ans=0.125 2023-06-22 19:24:06,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1301412.0, ans=0.2 2023-06-22 19:24:24,318 INFO [train.py:996] (2/4) Epoch 8, batch 3450, loss[loss=0.2273, simple_loss=0.2867, pruned_loss=0.08396, over 21517.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3169, pruned_loss=0.0865, over 4274854.26 frames. ], batch size: 132, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:24:37,684 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-06-22 19:24:39,477 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.13 vs. limit=15.0 2023-06-22 19:24:58,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1301532.0, ans=0.125 2023-06-22 19:25:05,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1301532.0, ans=0.0 2023-06-22 19:25:29,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1301592.0, ans=0.0 2023-06-22 19:25:31,248 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=22.5 2023-06-22 19:26:09,200 INFO [train.py:996] (2/4) Epoch 8, batch 3500, loss[loss=0.2439, simple_loss=0.3208, pruned_loss=0.08349, over 21657.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3246, pruned_loss=0.08962, over 4276740.73 frames. ], batch size: 263, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:26:09,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1301772.0, ans=0.125 2023-06-22 19:26:29,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1301832.0, ans=0.125 2023-06-22 19:26:36,795 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.220e+02 4.832e+02 6.635e+02 8.517e+02 1.814e+03, threshold=1.327e+03, percent-clipped=16.0 2023-06-22 19:26:48,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1301892.0, ans=0.0 2023-06-22 19:27:40,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1302012.0, ans=0.2 2023-06-22 19:27:42,633 INFO [train.py:996] (2/4) Epoch 8, batch 3550, loss[loss=0.2141, simple_loss=0.2812, pruned_loss=0.07353, over 21625.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3263, pruned_loss=0.09002, over 4274333.08 frames. ], batch size: 298, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:28:13,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1302132.0, ans=0.0 2023-06-22 19:28:48,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1302252.0, ans=0.125 2023-06-22 19:28:53,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1302252.0, ans=0.0 2023-06-22 19:29:21,127 INFO [train.py:996] (2/4) Epoch 8, batch 3600, loss[loss=0.2475, simple_loss=0.3179, pruned_loss=0.08854, over 21721.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3216, pruned_loss=0.08976, over 4267001.28 frames. ], batch size: 351, lr: 3.83e-03, grad_scale: 32.0 2023-06-22 19:29:28,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1302372.0, ans=0.125 2023-06-22 19:29:30,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1302372.0, ans=0.125 2023-06-22 19:29:48,389 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.074e+02 4.388e+02 6.270e+02 8.797e+02 1.377e+03, threshold=1.254e+03, percent-clipped=1.0 2023-06-22 19:30:23,817 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 19:30:43,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1302612.0, ans=0.125 2023-06-22 19:30:44,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1302612.0, ans=0.125 2023-06-22 19:30:59,800 INFO [train.py:996] (2/4) Epoch 8, batch 3650, loss[loss=0.2376, simple_loss=0.3275, pruned_loss=0.0739, over 21668.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3231, pruned_loss=0.09053, over 4273769.57 frames. ], batch size: 389, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:31:06,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1302672.0, ans=0.5 2023-06-22 19:31:24,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1302732.0, ans=0.025 2023-06-22 19:32:00,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1302852.0, ans=0.125 2023-06-22 19:32:02,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1302852.0, ans=0.125 2023-06-22 19:32:22,123 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=12.0 2023-06-22 19:32:37,354 INFO [train.py:996] (2/4) Epoch 8, batch 3700, loss[loss=0.2234, simple_loss=0.2793, pruned_loss=0.08376, over 21435.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.32, pruned_loss=0.08966, over 4273891.93 frames. ], batch size: 212, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:32:56,062 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=19.62 vs. limit=22.5 2023-06-22 19:32:58,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1303032.0, ans=0.125 2023-06-22 19:33:05,940 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.810e+02 4.107e+02 5.215e+02 7.517e+02 1.439e+03, threshold=1.043e+03, percent-clipped=3.0 2023-06-22 19:33:25,183 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.70 vs. limit=15.0 2023-06-22 19:33:27,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1303092.0, ans=0.125 2023-06-22 19:33:53,979 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 19:33:58,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1303212.0, ans=0.125 2023-06-22 19:34:03,210 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=15.0 2023-06-22 19:34:16,534 INFO [train.py:996] (2/4) Epoch 8, batch 3750, loss[loss=0.1953, simple_loss=0.2775, pruned_loss=0.05661, over 21741.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.319, pruned_loss=0.08895, over 4275532.39 frames. ], batch size: 282, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:35:24,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1303452.0, ans=0.125 2023-06-22 19:36:00,327 INFO [train.py:996] (2/4) Epoch 8, batch 3800, loss[loss=0.3258, simple_loss=0.3751, pruned_loss=0.1383, over 21408.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3176, pruned_loss=0.08802, over 4277893.82 frames. ], batch size: 509, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:36:02,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1303572.0, ans=0.1 2023-06-22 19:36:04,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1303572.0, ans=0.0 2023-06-22 19:36:28,072 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.955e+02 4.734e+02 6.125e+02 7.875e+02 1.546e+03, threshold=1.225e+03, percent-clipped=6.0 2023-06-22 19:37:25,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1303812.0, ans=0.0 2023-06-22 19:37:37,320 INFO [train.py:996] (2/4) Epoch 8, batch 3850, loss[loss=0.2434, simple_loss=0.2995, pruned_loss=0.09363, over 22003.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3152, pruned_loss=0.08817, over 4272644.60 frames. ], batch size: 103, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:37:46,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1303872.0, ans=0.035 2023-06-22 19:37:54,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1303872.0, ans=0.2 2023-06-22 19:38:06,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1303932.0, ans=0.125 2023-06-22 19:38:21,143 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.37 vs. limit=15.0 2023-06-22 19:38:49,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1304112.0, ans=0.04949747468305833 2023-06-22 19:38:51,902 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-06-22 19:39:03,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1304112.0, ans=0.125 2023-06-22 19:39:16,043 INFO [train.py:996] (2/4) Epoch 8, batch 3900, loss[loss=0.2309, simple_loss=0.2932, pruned_loss=0.08429, over 21758.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3128, pruned_loss=0.08758, over 4270161.49 frames. ], batch size: 112, lr: 3.83e-03, grad_scale: 16.0 2023-06-22 19:39:45,202 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.973e+02 4.645e+02 5.915e+02 7.788e+02 1.896e+03, threshold=1.183e+03, percent-clipped=6.0 2023-06-22 19:39:46,334 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.21 vs. limit=15.0 2023-06-22 19:40:56,631 INFO [train.py:996] (2/4) Epoch 8, batch 3950, loss[loss=0.1842, simple_loss=0.2776, pruned_loss=0.04543, over 21747.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3143, pruned_loss=0.08715, over 4276742.38 frames. ], batch size: 332, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 19:41:55,290 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.20 vs. limit=12.0 2023-06-22 19:42:36,352 INFO [train.py:996] (2/4) Epoch 8, batch 4000, loss[loss=0.2264, simple_loss=0.2857, pruned_loss=0.08351, over 20198.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3071, pruned_loss=0.08283, over 4267050.84 frames. ], batch size: 703, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 19:42:56,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1304832.0, ans=0.0 2023-06-22 19:42:58,444 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-22 19:43:05,265 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.962e+02 4.102e+02 5.710e+02 7.605e+02 1.219e+03, threshold=1.142e+03, percent-clipped=1.0 2023-06-22 19:44:21,099 INFO [train.py:996] (2/4) Epoch 8, batch 4050, loss[loss=0.2124, simple_loss=0.3058, pruned_loss=0.05949, over 21752.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3058, pruned_loss=0.08047, over 4264727.83 frames. ], batch size: 351, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 19:44:23,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1305072.0, ans=0.125 2023-06-22 19:44:27,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1305072.0, ans=0.125 2023-06-22 19:44:51,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1305192.0, ans=0.0 2023-06-22 19:45:28,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1305252.0, ans=0.1 2023-06-22 19:45:46,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1305312.0, ans=0.1 2023-06-22 19:45:53,554 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=15.0 2023-06-22 19:46:00,750 INFO [train.py:996] (2/4) Epoch 8, batch 4100, loss[loss=0.2399, simple_loss=0.3225, pruned_loss=0.07866, over 21774.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3068, pruned_loss=0.08097, over 4267725.16 frames. ], batch size: 298, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 19:46:07,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1305372.0, ans=0.125 2023-06-22 19:46:12,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1305372.0, ans=0.025 2023-06-22 19:46:23,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1305432.0, ans=0.0 2023-06-22 19:46:26,624 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.689e+02 3.586e+02 4.759e+02 6.058e+02 1.628e+03, threshold=9.517e+02, percent-clipped=6.0 2023-06-22 19:46:41,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1305492.0, ans=0.125 2023-06-22 19:47:02,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1305552.0, ans=0.0 2023-06-22 19:47:40,287 INFO [train.py:996] (2/4) Epoch 8, batch 4150, loss[loss=0.1911, simple_loss=0.2756, pruned_loss=0.05329, over 21697.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3088, pruned_loss=0.07854, over 4271174.90 frames. ], batch size: 282, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 19:47:58,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1305732.0, ans=0.125 2023-06-22 19:48:39,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1305852.0, ans=0.2 2023-06-22 19:49:09,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1305912.0, ans=0.1 2023-06-22 19:49:10,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1305912.0, ans=0.0 2023-06-22 19:49:23,580 INFO [train.py:996] (2/4) Epoch 8, batch 4200, loss[loss=0.2203, simple_loss=0.2867, pruned_loss=0.07695, over 21104.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3101, pruned_loss=0.07896, over 4270786.97 frames. ], batch size: 143, lr: 3.82e-03, grad_scale: 8.0 2023-06-22 19:49:32,392 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.90 vs. limit=15.0 2023-06-22 19:49:57,475 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.597e+02 4.423e+02 6.282e+02 9.309e+02 2.210e+03, threshold=1.256e+03, percent-clipped=22.0 2023-06-22 19:50:15,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1306092.0, ans=0.1 2023-06-22 19:50:57,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1306212.0, ans=0.125 2023-06-22 19:51:05,939 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.88 vs. limit=6.0 2023-06-22 19:51:06,561 INFO [train.py:996] (2/4) Epoch 8, batch 4250, loss[loss=0.2714, simple_loss=0.3781, pruned_loss=0.08232, over 21274.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3175, pruned_loss=0.08038, over 4268998.35 frames. ], batch size: 549, lr: 3.82e-03, grad_scale: 8.0 2023-06-22 19:51:25,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1306272.0, ans=0.125 2023-06-22 19:51:34,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1306332.0, ans=0.0 2023-06-22 19:52:27,527 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.53 vs. limit=10.0 2023-06-22 19:52:55,166 INFO [train.py:996] (2/4) Epoch 8, batch 4300, loss[loss=0.2222, simple_loss=0.3206, pruned_loss=0.06193, over 21851.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.325, pruned_loss=0.08302, over 4275515.78 frames. ], batch size: 316, lr: 3.82e-03, grad_scale: 8.0 2023-06-22 19:53:38,414 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.109e+02 4.490e+02 6.473e+02 1.024e+03 2.368e+03, threshold=1.295e+03, percent-clipped=12.0 2023-06-22 19:53:38,894 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 19:53:45,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1306692.0, ans=0.0 2023-06-22 19:54:00,308 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-22 19:54:03,014 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=22.5 2023-06-22 19:54:35,569 INFO [train.py:996] (2/4) Epoch 8, batch 4350, loss[loss=0.2253, simple_loss=0.3058, pruned_loss=0.0724, over 21184.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3221, pruned_loss=0.08194, over 4265026.65 frames. ], batch size: 548, lr: 3.82e-03, grad_scale: 8.0 2023-06-22 19:54:53,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1306872.0, ans=0.0 2023-06-22 19:55:42,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1307052.0, ans=0.125 2023-06-22 19:55:54,929 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2023-06-22 19:56:11,630 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 19:56:15,889 INFO [train.py:996] (2/4) Epoch 8, batch 4400, loss[loss=0.2187, simple_loss=0.2975, pruned_loss=0.07001, over 21201.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3174, pruned_loss=0.0821, over 4266413.55 frames. ], batch size: 159, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 19:56:46,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1307232.0, ans=0.05 2023-06-22 19:56:53,710 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.962e+02 4.391e+02 5.978e+02 7.745e+02 1.639e+03, threshold=1.196e+03, percent-clipped=7.0 2023-06-22 19:57:04,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1307292.0, ans=0.125 2023-06-22 19:57:12,124 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 19:57:36,258 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=15.0 2023-06-22 19:58:01,993 INFO [train.py:996] (2/4) Epoch 8, batch 4450, loss[loss=0.3399, simple_loss=0.4485, pruned_loss=0.1156, over 21222.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3274, pruned_loss=0.08431, over 4273493.15 frames. ], batch size: 548, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 19:58:22,528 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.98 vs. limit=6.0 2023-06-22 19:58:23,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1307532.0, ans=0.0 2023-06-22 19:58:28,377 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-22 19:58:59,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1307652.0, ans=0.0 2023-06-22 19:59:04,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1307652.0, ans=0.125 2023-06-22 19:59:14,992 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-22 19:59:30,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1307712.0, ans=0.05 2023-06-22 19:59:39,684 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.79 vs. limit=15.0 2023-06-22 19:59:48,263 INFO [train.py:996] (2/4) Epoch 8, batch 4500, loss[loss=0.294, simple_loss=0.36, pruned_loss=0.114, over 21629.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.328, pruned_loss=0.08684, over 4280142.10 frames. ], batch size: 471, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 20:00:00,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1307772.0, ans=0.1 2023-06-22 20:00:11,109 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.28 vs. limit=15.0 2023-06-22 20:00:14,652 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.895e+02 4.151e+02 5.436e+02 7.450e+02 1.876e+03, threshold=1.087e+03, percent-clipped=7.0 2023-06-22 20:01:12,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1308012.0, ans=0.1 2023-06-22 20:01:20,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1308012.0, ans=0.2 2023-06-22 20:01:22,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1308012.0, ans=0.0 2023-06-22 20:01:28,208 INFO [train.py:996] (2/4) Epoch 8, batch 4550, loss[loss=0.2679, simple_loss=0.3437, pruned_loss=0.09605, over 21903.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3296, pruned_loss=0.08733, over 4277125.62 frames. ], batch size: 372, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 20:02:07,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1308192.0, ans=0.0 2023-06-22 20:02:10,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1308192.0, ans=0.0 2023-06-22 20:02:14,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1308192.0, ans=0.0 2023-06-22 20:02:26,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1308252.0, ans=0.0 2023-06-22 20:03:08,576 INFO [train.py:996] (2/4) Epoch 8, batch 4600, loss[loss=0.2346, simple_loss=0.3139, pruned_loss=0.0776, over 21839.00 frames. ], tot_loss[loss=0.2548, simple_loss=0.3321, pruned_loss=0.08877, over 4278387.48 frames. ], batch size: 332, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 20:03:15,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1308372.0, ans=0.0 2023-06-22 20:03:40,808 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.710e+02 4.165e+02 5.279e+02 6.740e+02 1.716e+03, threshold=1.056e+03, percent-clipped=6.0 2023-06-22 20:03:47,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1308492.0, ans=0.125 2023-06-22 20:03:58,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1308492.0, ans=0.125 2023-06-22 20:04:09,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1308552.0, ans=0.125 2023-06-22 20:04:47,311 INFO [train.py:996] (2/4) Epoch 8, batch 4650, loss[loss=0.1999, simple_loss=0.264, pruned_loss=0.0679, over 21314.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3266, pruned_loss=0.0872, over 4286866.99 frames. ], batch size: 176, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 20:05:03,023 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.04 vs. limit=22.5 2023-06-22 20:05:03,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1308732.0, ans=0.125 2023-06-22 20:05:28,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1308792.0, ans=22.5 2023-06-22 20:05:58,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1308852.0, ans=0.05 2023-06-22 20:06:06,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1308912.0, ans=0.0 2023-06-22 20:06:22,465 INFO [train.py:996] (2/4) Epoch 8, batch 4700, loss[loss=0.2382, simple_loss=0.2973, pruned_loss=0.08955, over 21522.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3156, pruned_loss=0.0843, over 4291908.47 frames. ], batch size: 391, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 20:06:25,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1308972.0, ans=0.125 2023-06-22 20:06:54,228 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.434e+02 3.523e+02 4.183e+02 5.885e+02 1.412e+03, threshold=8.365e+02, percent-clipped=3.0 2023-06-22 20:07:03,287 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=22.5 2023-06-22 20:07:06,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1309092.0, ans=0.0 2023-06-22 20:07:08,044 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.54 vs. limit=10.0 2023-06-22 20:07:33,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1309152.0, ans=0.125 2023-06-22 20:07:39,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1309152.0, ans=0.2 2023-06-22 20:08:02,201 INFO [train.py:996] (2/4) Epoch 8, batch 4750, loss[loss=0.2456, simple_loss=0.3094, pruned_loss=0.09093, over 21865.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3092, pruned_loss=0.08333, over 4293032.03 frames. ], batch size: 415, lr: 3.82e-03, grad_scale: 16.0 2023-06-22 20:08:14,254 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 20:08:31,138 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-22 20:09:34,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1309512.0, ans=0.0 2023-06-22 20:09:34,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1309512.0, ans=0.125 2023-06-22 20:09:42,387 INFO [train.py:996] (2/4) Epoch 8, batch 4800, loss[loss=0.2436, simple_loss=0.3091, pruned_loss=0.08908, over 21297.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3103, pruned_loss=0.08429, over 4291210.39 frames. ], batch size: 176, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 20:10:03,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1309632.0, ans=0.07 2023-06-22 20:10:14,332 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.048e+02 4.179e+02 5.198e+02 6.996e+02 1.429e+03, threshold=1.040e+03, percent-clipped=10.0 2023-06-22 20:10:16,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1309632.0, ans=0.1 2023-06-22 20:10:22,871 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 20:10:44,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1309752.0, ans=0.125 2023-06-22 20:11:21,048 INFO [train.py:996] (2/4) Epoch 8, batch 4850, loss[loss=0.2116, simple_loss=0.2819, pruned_loss=0.07066, over 15586.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3101, pruned_loss=0.08315, over 4279422.86 frames. ], batch size: 60, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 20:11:34,604 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 20:12:04,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1309992.0, ans=0.0 2023-06-22 20:12:43,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1310112.0, ans=0.1 2023-06-22 20:12:47,311 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-06-22 20:13:00,620 INFO [train.py:996] (2/4) Epoch 8, batch 4900, loss[loss=0.2385, simple_loss=0.3113, pruned_loss=0.08286, over 16541.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3108, pruned_loss=0.08418, over 4282399.49 frames. ], batch size: 63, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 20:13:26,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1310232.0, ans=0.2 2023-06-22 20:13:32,234 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.49 vs. limit=22.5 2023-06-22 20:13:32,427 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.010e+02 3.969e+02 5.003e+02 6.919e+02 1.603e+03, threshold=1.001e+03, percent-clipped=6.0 2023-06-22 20:13:52,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1310292.0, ans=0.09899494936611666 2023-06-22 20:14:27,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1310412.0, ans=0.0 2023-06-22 20:14:40,865 INFO [train.py:996] (2/4) Epoch 8, batch 4950, loss[loss=0.2189, simple_loss=0.3261, pruned_loss=0.0559, over 21158.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3139, pruned_loss=0.08233, over 4275781.13 frames. ], batch size: 548, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 20:14:59,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1310472.0, ans=0.04949747468305833 2023-06-22 20:15:07,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1310532.0, ans=0.0 2023-06-22 20:16:25,020 INFO [train.py:996] (2/4) Epoch 8, batch 5000, loss[loss=0.2632, simple_loss=0.3839, pruned_loss=0.07119, over 20750.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3137, pruned_loss=0.07858, over 4276048.22 frames. ], batch size: 607, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 20:16:34,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1310772.0, ans=0.0 2023-06-22 20:16:43,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=15.0 2023-06-22 20:16:51,889 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.498e+02 3.608e+02 4.622e+02 7.271e+02 1.664e+03, threshold=9.243e+02, percent-clipped=6.0 2023-06-22 20:17:32,539 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.27 vs. limit=15.0 2023-06-22 20:17:33,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1310952.0, ans=0.0 2023-06-22 20:17:53,495 INFO [train.py:996] (2/4) Epoch 8, batch 5050, loss[loss=0.2245, simple_loss=0.2961, pruned_loss=0.07642, over 21945.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3143, pruned_loss=0.08005, over 4281863.52 frames. ], batch size: 316, lr: 3.82e-03, grad_scale: 32.0 2023-06-22 20:19:28,950 INFO [train.py:996] (2/4) Epoch 8, batch 5100, loss[loss=0.2426, simple_loss=0.3143, pruned_loss=0.0854, over 21402.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3137, pruned_loss=0.0816, over 4291246.89 frames. ], batch size: 176, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:20:02,310 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.036e+02 3.868e+02 4.783e+02 6.520e+02 1.021e+03, threshold=9.567e+02, percent-clipped=2.0 2023-06-22 20:20:43,029 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=15.0 2023-06-22 20:20:51,741 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=22.5 2023-06-22 20:20:58,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1311612.0, ans=0.0 2023-06-22 20:21:08,406 INFO [train.py:996] (2/4) Epoch 8, batch 5150, loss[loss=0.2467, simple_loss=0.3146, pruned_loss=0.08935, over 21348.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3122, pruned_loss=0.08233, over 4292333.07 frames. ], batch size: 159, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:22:03,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1311792.0, ans=0.125 2023-06-22 20:22:51,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1311972.0, ans=0.1 2023-06-22 20:22:51,956 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-22 20:22:52,482 INFO [train.py:996] (2/4) Epoch 8, batch 5200, loss[loss=0.2669, simple_loss=0.3627, pruned_loss=0.08554, over 21863.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3151, pruned_loss=0.08335, over 4291481.49 frames. ], batch size: 371, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:23:01,752 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=22.5 2023-06-22 20:23:26,969 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.134e+02 4.423e+02 5.674e+02 8.806e+02 1.736e+03, threshold=1.135e+03, percent-clipped=18.0 2023-06-22 20:23:43,174 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 20:23:58,326 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=22.5 2023-06-22 20:24:32,942 INFO [train.py:996] (2/4) Epoch 8, batch 5250, loss[loss=0.2046, simple_loss=0.2868, pruned_loss=0.06118, over 21321.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3187, pruned_loss=0.08226, over 4282978.91 frames. ], batch size: 131, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:25:28,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1312392.0, ans=0.125 2023-06-22 20:25:30,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1312392.0, ans=0.125 2023-06-22 20:25:50,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1312512.0, ans=0.0 2023-06-22 20:26:11,365 INFO [train.py:996] (2/4) Epoch 8, batch 5300, loss[loss=0.277, simple_loss=0.3391, pruned_loss=0.1074, over 21812.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3187, pruned_loss=0.08402, over 4285964.04 frames. ], batch size: 441, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:26:44,648 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.970e+02 3.721e+02 4.525e+02 6.404e+02 1.262e+03, threshold=9.050e+02, percent-clipped=2.0 2023-06-22 20:27:26,598 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.92 vs. limit=15.0 2023-06-22 20:27:49,127 INFO [train.py:996] (2/4) Epoch 8, batch 5350, loss[loss=0.2486, simple_loss=0.3034, pruned_loss=0.09692, over 21587.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3162, pruned_loss=0.08439, over 4292663.59 frames. ], batch size: 548, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:27:58,452 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.68 vs. limit=15.0 2023-06-22 20:28:02,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1312872.0, ans=0.0 2023-06-22 20:28:11,138 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-22 20:29:15,577 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 20:29:17,799 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.92 vs. limit=6.0 2023-06-22 20:29:22,747 INFO [train.py:996] (2/4) Epoch 8, batch 5400, loss[loss=0.2161, simple_loss=0.2904, pruned_loss=0.07092, over 21715.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3149, pruned_loss=0.08556, over 4294683.36 frames. ], batch size: 247, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:29:36,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1313172.0, ans=0.1 2023-06-22 20:30:01,565 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.001e+02 4.337e+02 6.656e+02 9.891e+02 1.935e+03, threshold=1.331e+03, percent-clipped=29.0 2023-06-22 20:30:06,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1313292.0, ans=0.1 2023-06-22 20:30:32,629 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.81 vs. limit=12.0 2023-06-22 20:30:49,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1313412.0, ans=0.125 2023-06-22 20:31:03,440 INFO [train.py:996] (2/4) Epoch 8, batch 5450, loss[loss=0.2334, simple_loss=0.316, pruned_loss=0.07539, over 21555.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3164, pruned_loss=0.08361, over 4291679.91 frames. ], batch size: 194, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:31:19,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1313472.0, ans=0.2 2023-06-22 20:31:32,603 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=22.5 2023-06-22 20:31:56,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1313592.0, ans=0.0 2023-06-22 20:32:49,356 INFO [train.py:996] (2/4) Epoch 8, batch 5500, loss[loss=0.2094, simple_loss=0.3098, pruned_loss=0.05446, over 21712.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3205, pruned_loss=0.08008, over 4287462.88 frames. ], batch size: 351, lr: 3.81e-03, grad_scale: 8.0 2023-06-22 20:32:51,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1313772.0, ans=0.2 2023-06-22 20:32:54,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1313772.0, ans=0.0 2023-06-22 20:33:31,686 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.855e+02 4.320e+02 6.103e+02 1.036e+03 2.497e+03, threshold=1.221e+03, percent-clipped=15.0 2023-06-22 20:33:51,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1313952.0, ans=0.2 2023-06-22 20:34:30,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1314012.0, ans=0.125 2023-06-22 20:34:40,443 INFO [train.py:996] (2/4) Epoch 8, batch 5550, loss[loss=0.2063, simple_loss=0.2916, pruned_loss=0.06055, over 20845.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3186, pruned_loss=0.07649, over 4281590.86 frames. ], batch size: 607, lr: 3.81e-03, grad_scale: 8.0 2023-06-22 20:34:45,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1314072.0, ans=0.0 2023-06-22 20:34:59,701 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-22 20:35:07,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1314132.0, ans=0.04949747468305833 2023-06-22 20:35:29,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1314192.0, ans=0.025 2023-06-22 20:36:01,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1314312.0, ans=0.125 2023-06-22 20:36:08,624 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=22.5 2023-06-22 20:36:18,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1314312.0, ans=0.125 2023-06-22 20:36:20,963 INFO [train.py:996] (2/4) Epoch 8, batch 5600, loss[loss=0.2948, simple_loss=0.3927, pruned_loss=0.09845, over 21603.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3177, pruned_loss=0.07407, over 4277330.66 frames. ], batch size: 441, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:36:39,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1314432.0, ans=0.0 2023-06-22 20:36:51,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1314432.0, ans=0.0 2023-06-22 20:36:55,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1314432.0, ans=0.0 2023-06-22 20:36:55,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1314432.0, ans=0.0 2023-06-22 20:36:58,094 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.586e+02 4.144e+02 6.431e+02 9.394e+02 1.823e+03, threshold=1.286e+03, percent-clipped=11.0 2023-06-22 20:37:00,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1314492.0, ans=0.125 2023-06-22 20:37:55,136 INFO [train.py:996] (2/4) Epoch 8, batch 5650, loss[loss=0.2066, simple_loss=0.3197, pruned_loss=0.04674, over 20720.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.321, pruned_loss=0.07584, over 4268376.57 frames. ], batch size: 608, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:38:22,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1314732.0, ans=0.0 2023-06-22 20:38:40,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1314792.0, ans=0.125 2023-06-22 20:39:21,142 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=22.5 2023-06-22 20:39:26,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1314912.0, ans=0.125 2023-06-22 20:39:26,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1314912.0, ans=0.125 2023-06-22 20:39:34,667 INFO [train.py:996] (2/4) Epoch 8, batch 5700, loss[loss=0.2757, simple_loss=0.3643, pruned_loss=0.09353, over 21673.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3207, pruned_loss=0.07849, over 4279301.96 frames. ], batch size: 414, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:39:52,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1314972.0, ans=0.0 2023-06-22 20:40:12,303 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.709e+02 4.552e+02 6.267e+02 8.726e+02 1.736e+03, threshold=1.253e+03, percent-clipped=4.0 2023-06-22 20:40:30,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1315092.0, ans=0.2 2023-06-22 20:40:38,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1315092.0, ans=0.0 2023-06-22 20:41:19,534 INFO [train.py:996] (2/4) Epoch 8, batch 5750, loss[loss=0.1912, simple_loss=0.2722, pruned_loss=0.05512, over 21289.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3173, pruned_loss=0.07536, over 4280758.73 frames. ], batch size: 176, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:41:20,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1315272.0, ans=0.05 2023-06-22 20:41:28,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1315272.0, ans=0.125 2023-06-22 20:41:44,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1315332.0, ans=0.125 2023-06-22 20:42:08,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1315392.0, ans=0.0 2023-06-22 20:42:10,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=1315392.0, ans=0.1 2023-06-22 20:42:29,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1315452.0, ans=0.05 2023-06-22 20:42:59,378 INFO [train.py:996] (2/4) Epoch 8, batch 5800, loss[loss=0.2211, simple_loss=0.3123, pruned_loss=0.06496, over 21676.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3173, pruned_loss=0.07407, over 4275464.76 frames. ], batch size: 247, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:43:06,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1315572.0, ans=0.125 2023-06-22 20:43:42,637 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.530e+02 3.870e+02 5.356e+02 7.893e+02 2.349e+03, threshold=1.071e+03, percent-clipped=9.0 2023-06-22 20:43:49,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1315692.0, ans=0.1 2023-06-22 20:43:49,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1315692.0, ans=0.0 2023-06-22 20:44:12,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1315752.0, ans=0.0 2023-06-22 20:44:24,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1315812.0, ans=0.1 2023-06-22 20:44:35,118 INFO [train.py:996] (2/4) Epoch 8, batch 5850, loss[loss=0.174, simple_loss=0.2618, pruned_loss=0.0431, over 21210.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.315, pruned_loss=0.07003, over 4275309.71 frames. ], batch size: 159, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:44:37,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1315872.0, ans=0.0 2023-06-22 20:45:24,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1315992.0, ans=0.125 2023-06-22 20:46:01,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1316112.0, ans=0.125 2023-06-22 20:46:07,377 INFO [train.py:996] (2/4) Epoch 8, batch 5900, loss[loss=0.2307, simple_loss=0.2983, pruned_loss=0.08158, over 21278.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.3083, pruned_loss=0.06556, over 4265792.02 frames. ], batch size: 143, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:46:48,288 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.135e+02 3.361e+02 4.544e+02 6.338e+02 1.644e+03, threshold=9.088e+02, percent-clipped=4.0 2023-06-22 20:47:03,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1316352.0, ans=0.125 2023-06-22 20:47:19,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1316352.0, ans=0.2 2023-06-22 20:47:31,715 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=22.5 2023-06-22 20:47:35,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1316412.0, ans=0.0 2023-06-22 20:47:42,261 INFO [train.py:996] (2/4) Epoch 8, batch 5950, loss[loss=0.2017, simple_loss=0.2678, pruned_loss=0.06781, over 21550.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.3065, pruned_loss=0.06933, over 4271603.55 frames. ], batch size: 230, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:47:51,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1316472.0, ans=0.0 2023-06-22 20:48:16,324 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-22 20:48:17,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1316532.0, ans=0.09899494936611666 2023-06-22 20:48:23,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1316592.0, ans=0.0 2023-06-22 20:48:52,162 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 20:49:19,223 INFO [train.py:996] (2/4) Epoch 8, batch 6000, loss[loss=0.2401, simple_loss=0.3061, pruned_loss=0.08706, over 15323.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3016, pruned_loss=0.07253, over 4264215.63 frames. ], batch size: 60, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:49:19,224 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-22 20:49:40,904 INFO [train.py:1028] (2/4) Epoch 8, validation: loss=0.2636, simple_loss=0.3606, pruned_loss=0.08334, over 1796401.00 frames. 2023-06-22 20:49:40,905 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-22 20:50:16,501 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.204e+02 4.164e+02 5.189e+02 7.580e+02 1.356e+03, threshold=1.038e+03, percent-clipped=17.0 2023-06-22 20:50:19,309 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.85 vs. limit=8.0 2023-06-22 20:51:14,214 INFO [train.py:996] (2/4) Epoch 8, batch 6050, loss[loss=0.2121, simple_loss=0.2822, pruned_loss=0.07098, over 21424.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2962, pruned_loss=0.07362, over 4261352.17 frames. ], batch size: 509, lr: 3.81e-03, grad_scale: 32.0 2023-06-22 20:51:19,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1317072.0, ans=0.0 2023-06-22 20:51:47,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1317132.0, ans=0.2 2023-06-22 20:52:51,285 INFO [train.py:996] (2/4) Epoch 8, batch 6100, loss[loss=0.2038, simple_loss=0.3109, pruned_loss=0.0484, over 19849.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2968, pruned_loss=0.07264, over 4270141.37 frames. ], batch size: 702, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:52:51,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1317372.0, ans=0.1 2023-06-22 20:53:29,011 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.740e+02 3.450e+02 4.266e+02 5.544e+02 1.374e+03, threshold=8.532e+02, percent-clipped=4.0 2023-06-22 20:53:38,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1317492.0, ans=0.0 2023-06-22 20:53:47,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1317552.0, ans=0.125 2023-06-22 20:54:08,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1317612.0, ans=0.125 2023-06-22 20:54:28,759 INFO [train.py:996] (2/4) Epoch 8, batch 6150, loss[loss=0.2456, simple_loss=0.3086, pruned_loss=0.09133, over 21839.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3001, pruned_loss=0.07558, over 4283448.90 frames. ], batch size: 351, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:54:37,676 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=15.0 2023-06-22 20:54:38,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1317672.0, ans=0.125 2023-06-22 20:55:16,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1317792.0, ans=0.04949747468305833 2023-06-22 20:55:39,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1317912.0, ans=0.0 2023-06-22 20:56:04,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1317912.0, ans=0.0 2023-06-22 20:56:06,735 INFO [train.py:996] (2/4) Epoch 8, batch 6200, loss[loss=0.2211, simple_loss=0.2971, pruned_loss=0.07251, over 21525.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3032, pruned_loss=0.07607, over 4286989.12 frames. ], batch size: 131, lr: 3.81e-03, grad_scale: 16.0 2023-06-22 20:56:32,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1318032.0, ans=0.2 2023-06-22 20:56:37,618 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-22 20:56:44,776 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.020e+02 4.357e+02 5.420e+02 8.092e+02 2.121e+03, threshold=1.084e+03, percent-clipped=22.0 2023-06-22 20:57:17,845 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.03 vs. limit=6.0 2023-06-22 20:57:39,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1318212.0, ans=0.0 2023-06-22 20:57:47,406 INFO [train.py:996] (2/4) Epoch 8, batch 6250, loss[loss=0.2327, simple_loss=0.334, pruned_loss=0.06566, over 21694.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3093, pruned_loss=0.07585, over 4274189.67 frames. ], batch size: 298, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 20:57:49,400 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 20:58:11,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1318332.0, ans=0.2 2023-06-22 20:58:12,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1318332.0, ans=0.125 2023-06-22 20:58:12,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1318332.0, ans=0.125 2023-06-22 20:58:21,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1318332.0, ans=0.1 2023-06-22 20:59:16,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1318512.0, ans=0.125 2023-06-22 20:59:22,289 INFO [train.py:996] (2/4) Epoch 8, batch 6300, loss[loss=0.23, simple_loss=0.3005, pruned_loss=0.07971, over 21468.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3147, pruned_loss=0.07579, over 4276496.89 frames. ], batch size: 131, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 20:59:33,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1318572.0, ans=0.1 2023-06-22 20:59:42,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1318632.0, ans=0.0 2023-06-22 20:59:58,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1318692.0, ans=0.125 2023-06-22 20:59:59,987 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.614e+02 4.288e+02 6.355e+02 8.462e+02 1.476e+03, threshold=1.271e+03, percent-clipped=15.0 2023-06-22 21:00:00,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1318692.0, ans=0.0 2023-06-22 21:00:16,418 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 21:00:21,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1318692.0, ans=0.0 2023-06-22 21:00:33,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1318752.0, ans=0.125 2023-06-22 21:00:46,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1318812.0, ans=0.1 2023-06-22 21:01:04,420 INFO [train.py:996] (2/4) Epoch 8, batch 6350, loss[loss=0.2773, simple_loss=0.3511, pruned_loss=0.1017, over 21353.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3158, pruned_loss=0.07998, over 4278927.12 frames. ], batch size: 159, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:01:40,557 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-06-22 21:02:05,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1319052.0, ans=0.2 2023-06-22 21:02:20,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1319052.0, ans=0.1 2023-06-22 21:02:26,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1319112.0, ans=0.125 2023-06-22 21:02:43,825 INFO [train.py:996] (2/4) Epoch 8, batch 6400, loss[loss=0.2465, simple_loss=0.3126, pruned_loss=0.09018, over 21832.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.321, pruned_loss=0.08443, over 4277693.50 frames. ], batch size: 247, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:02:44,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1319172.0, ans=0.0 2023-06-22 21:02:53,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1319172.0, ans=0.125 2023-06-22 21:03:14,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1319232.0, ans=0.1 2023-06-22 21:03:31,902 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.536e+02 4.404e+02 5.449e+02 7.418e+02 1.410e+03, threshold=1.090e+03, percent-clipped=1.0 2023-06-22 21:04:03,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1319352.0, ans=0.04949747468305833 2023-06-22 21:04:21,985 INFO [train.py:996] (2/4) Epoch 8, batch 6450, loss[loss=0.2121, simple_loss=0.3077, pruned_loss=0.05827, over 21679.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3226, pruned_loss=0.08276, over 4276835.09 frames. ], batch size: 247, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:04:27,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1319472.0, ans=0.1 2023-06-22 21:05:05,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1319532.0, ans=0.1 2023-06-22 21:05:22,734 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 21:05:48,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1319712.0, ans=0.125 2023-06-22 21:05:59,767 INFO [train.py:996] (2/4) Epoch 8, batch 6500, loss[loss=0.2766, simple_loss=0.3485, pruned_loss=0.1023, over 21378.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3173, pruned_loss=0.08133, over 4274280.21 frames. ], batch size: 507, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:06:02,196 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.44 vs. limit=22.5 2023-06-22 21:06:14,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1319772.0, ans=0.125 2023-06-22 21:06:17,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1319772.0, ans=0.2 2023-06-22 21:06:45,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1319892.0, ans=0.125 2023-06-22 21:06:48,026 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.971e+02 4.929e+02 6.597e+02 9.364e+02 1.745e+03, threshold=1.319e+03, percent-clipped=16.0 2023-06-22 21:06:54,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1319892.0, ans=0.0 2023-06-22 21:07:18,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1319952.0, ans=0.125 2023-06-22 21:07:31,125 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.04 vs. limit=12.0 2023-06-22 21:07:44,863 INFO [train.py:996] (2/4) Epoch 8, batch 6550, loss[loss=0.2271, simple_loss=0.3019, pruned_loss=0.0761, over 21868.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3167, pruned_loss=0.08021, over 4272553.75 frames. ], batch size: 351, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:08:48,679 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-22 21:09:06,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1320312.0, ans=0.125 2023-06-22 21:09:17,827 INFO [train.py:996] (2/4) Epoch 8, batch 6600, loss[loss=0.1886, simple_loss=0.2533, pruned_loss=0.06195, over 21794.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3105, pruned_loss=0.08005, over 4269704.76 frames. ], batch size: 283, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:09:48,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1320432.0, ans=0.125 2023-06-22 21:09:57,227 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.721e+02 4.019e+02 5.750e+02 8.785e+02 1.668e+03, threshold=1.150e+03, percent-clipped=5.0 2023-06-22 21:10:10,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1320492.0, ans=0.2 2023-06-22 21:10:17,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1320552.0, ans=0.0 2023-06-22 21:10:55,799 INFO [train.py:996] (2/4) Epoch 8, batch 6650, loss[loss=0.1829, simple_loss=0.2573, pruned_loss=0.05421, over 21535.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3024, pruned_loss=0.07786, over 4267204.65 frames. ], batch size: 230, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:11:02,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1320672.0, ans=0.0 2023-06-22 21:11:19,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1320732.0, ans=0.0 2023-06-22 21:12:29,155 INFO [train.py:996] (2/4) Epoch 8, batch 6700, loss[loss=0.2022, simple_loss=0.2484, pruned_loss=0.07797, over 20784.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2958, pruned_loss=0.0772, over 4267733.08 frames. ], batch size: 608, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:12:53,603 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-06-22 21:13:08,237 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.668e+02 3.822e+02 4.568e+02 6.367e+02 1.164e+03, threshold=9.137e+02, percent-clipped=1.0 2023-06-22 21:13:27,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1321152.0, ans=0.0 2023-06-22 21:13:46,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1321212.0, ans=0.125 2023-06-22 21:14:02,857 INFO [train.py:996] (2/4) Epoch 8, batch 6750, loss[loss=0.2146, simple_loss=0.29, pruned_loss=0.06963, over 16250.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2945, pruned_loss=0.07748, over 4259691.79 frames. ], batch size: 61, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:14:04,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1321272.0, ans=0.0 2023-06-22 21:14:29,958 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 21:15:37,244 INFO [train.py:996] (2/4) Epoch 8, batch 6800, loss[loss=0.2013, simple_loss=0.262, pruned_loss=0.07033, over 21375.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.2959, pruned_loss=0.07948, over 4271042.22 frames. ], batch size: 194, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:15:40,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1321572.0, ans=0.1 2023-06-22 21:16:09,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1321632.0, ans=0.125 2023-06-22 21:16:13,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1321692.0, ans=0.2 2023-06-22 21:16:16,529 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.952e+02 4.656e+02 6.234e+02 8.844e+02 1.935e+03, threshold=1.247e+03, percent-clipped=22.0 2023-06-22 21:16:29,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1321692.0, ans=0.0 2023-06-22 21:16:30,243 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.27 vs. limit=12.0 2023-06-22 21:16:35,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1321752.0, ans=0.0 2023-06-22 21:16:49,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1321812.0, ans=0.1 2023-06-22 21:16:57,316 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=22.5 2023-06-22 21:17:04,213 INFO [train.py:996] (2/4) Epoch 8, batch 6850, loss[loss=0.2284, simple_loss=0.2889, pruned_loss=0.08396, over 21667.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2951, pruned_loss=0.08087, over 4279045.29 frames. ], batch size: 441, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:17:27,705 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.77 vs. limit=15.0 2023-06-22 21:17:30,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1321932.0, ans=0.1 2023-06-22 21:17:44,910 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-22 21:18:11,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1322052.0, ans=0.0 2023-06-22 21:18:21,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1322112.0, ans=0.125 2023-06-22 21:18:21,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1322112.0, ans=0.125 2023-06-22 21:18:48,719 INFO [train.py:996] (2/4) Epoch 8, batch 6900, loss[loss=0.2638, simple_loss=0.323, pruned_loss=0.1022, over 21790.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2972, pruned_loss=0.08112, over 4286553.35 frames. ], batch size: 391, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:19:34,850 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.637e+02 4.516e+02 6.241e+02 9.290e+02 1.863e+03, threshold=1.248e+03, percent-clipped=14.0 2023-06-22 21:20:02,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1322412.0, ans=0.125 2023-06-22 21:20:29,590 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.58 vs. limit=15.0 2023-06-22 21:20:33,616 INFO [train.py:996] (2/4) Epoch 8, batch 6950, loss[loss=0.2688, simple_loss=0.3384, pruned_loss=0.09958, over 21296.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.2994, pruned_loss=0.07794, over 4275791.94 frames. ], batch size: 159, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:20:40,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1322472.0, ans=0.1 2023-06-22 21:20:48,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1322532.0, ans=0.125 2023-06-22 21:21:05,011 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.67 vs. limit=10.0 2023-06-22 21:21:27,393 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=22.5 2023-06-22 21:21:28,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1322652.0, ans=0.2 2023-06-22 21:22:12,754 INFO [train.py:996] (2/4) Epoch 8, batch 7000, loss[loss=0.2028, simple_loss=0.2645, pruned_loss=0.07052, over 21604.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3025, pruned_loss=0.08054, over 4281019.96 frames. ], batch size: 247, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:22:50,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1322892.0, ans=0.125 2023-06-22 21:22:51,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1322892.0, ans=0.125 2023-06-22 21:22:54,259 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.120e+02 4.125e+02 4.702e+02 7.118e+02 1.401e+03, threshold=9.403e+02, percent-clipped=1.0 2023-06-22 21:23:06,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1322952.0, ans=0.2 2023-06-22 21:23:42,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1323012.0, ans=0.125 2023-06-22 21:23:51,475 INFO [train.py:996] (2/4) Epoch 8, batch 7050, loss[loss=0.2026, simple_loss=0.2946, pruned_loss=0.05532, over 21744.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.2993, pruned_loss=0.07885, over 4281996.71 frames. ], batch size: 351, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:24:17,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1323132.0, ans=0.1 2023-06-22 21:24:49,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1323252.0, ans=0.125 2023-06-22 21:24:51,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1323252.0, ans=0.125 2023-06-22 21:25:31,388 INFO [train.py:996] (2/4) Epoch 8, batch 7100, loss[loss=0.2413, simple_loss=0.3232, pruned_loss=0.07964, over 21584.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.304, pruned_loss=0.07941, over 4275309.63 frames. ], batch size: 389, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:26:12,652 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.549e+02 4.059e+02 5.159e+02 6.413e+02 1.166e+03, threshold=1.032e+03, percent-clipped=5.0 2023-06-22 21:27:15,334 INFO [train.py:996] (2/4) Epoch 8, batch 7150, loss[loss=0.2704, simple_loss=0.33, pruned_loss=0.1054, over 21408.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3019, pruned_loss=0.07715, over 4266209.95 frames. ], batch size: 194, lr: 3.80e-03, grad_scale: 16.0 2023-06-22 21:27:19,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1323672.0, ans=0.0 2023-06-22 21:27:34,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1323732.0, ans=0.125 2023-06-22 21:27:41,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1323732.0, ans=0.125 2023-06-22 21:28:08,070 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-22 21:28:24,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1323852.0, ans=0.0 2023-06-22 21:28:54,619 INFO [train.py:996] (2/4) Epoch 8, batch 7200, loss[loss=0.2406, simple_loss=0.3018, pruned_loss=0.08967, over 21816.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3045, pruned_loss=0.07949, over 4268858.01 frames. ], batch size: 352, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:28:56,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1323972.0, ans=0.1 2023-06-22 21:29:00,006 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=22.5 2023-06-22 21:29:04,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1323972.0, ans=0.5 2023-06-22 21:29:07,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1323972.0, ans=0.125 2023-06-22 21:29:13,087 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.23 vs. limit=22.5 2023-06-22 21:29:17,630 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=12.0 2023-06-22 21:29:35,429 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.797e+02 4.643e+02 6.367e+02 8.701e+02 1.653e+03, threshold=1.273e+03, percent-clipped=12.0 2023-06-22 21:30:01,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1324152.0, ans=0.125 2023-06-22 21:30:08,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1324212.0, ans=0.125 2023-06-22 21:30:27,750 INFO [train.py:996] (2/4) Epoch 8, batch 7250, loss[loss=0.2161, simple_loss=0.2794, pruned_loss=0.0764, over 21829.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3001, pruned_loss=0.0792, over 4270679.63 frames. ], batch size: 352, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:31:11,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1324392.0, ans=0.0 2023-06-22 21:31:39,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1324452.0, ans=0.2 2023-06-22 21:31:39,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1324452.0, ans=0.125 2023-06-22 21:31:45,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1324512.0, ans=0.125 2023-06-22 21:32:03,418 INFO [train.py:996] (2/4) Epoch 8, batch 7300, loss[loss=0.2136, simple_loss=0.2723, pruned_loss=0.07747, over 21635.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.2959, pruned_loss=0.07863, over 4263426.21 frames. ], batch size: 264, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:32:22,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1324632.0, ans=0.125 2023-06-22 21:32:49,707 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.998e+02 4.041e+02 5.063e+02 7.441e+02 1.428e+03, threshold=1.013e+03, percent-clipped=2.0 2023-06-22 21:32:51,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1324692.0, ans=0.125 2023-06-22 21:32:51,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1324692.0, ans=0.05 2023-06-22 21:33:42,996 INFO [train.py:996] (2/4) Epoch 8, batch 7350, loss[loss=0.2807, simple_loss=0.3387, pruned_loss=0.1114, over 21577.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.295, pruned_loss=0.07965, over 4264666.38 frames. ], batch size: 415, lr: 3.80e-03, grad_scale: 32.0 2023-06-22 21:34:06,581 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=15.0 2023-06-22 21:34:35,871 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-22 21:34:44,400 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-22 21:34:50,394 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 21:35:08,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1325112.0, ans=0.125 2023-06-22 21:35:08,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1325112.0, ans=0.1 2023-06-22 21:35:21,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1325112.0, ans=0.1 2023-06-22 21:35:24,111 INFO [train.py:996] (2/4) Epoch 8, batch 7400, loss[loss=0.2476, simple_loss=0.3185, pruned_loss=0.08833, over 21750.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3013, pruned_loss=0.08232, over 4265852.80 frames. ], batch size: 441, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:36:01,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1325232.0, ans=0.0 2023-06-22 21:36:10,334 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.225e+02 4.086e+02 4.943e+02 6.567e+02 1.302e+03, threshold=9.886e+02, percent-clipped=5.0 2023-06-22 21:36:14,681 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.53 vs. limit=22.5 2023-06-22 21:36:29,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1325352.0, ans=0.2 2023-06-22 21:36:29,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1325352.0, ans=0.2 2023-06-22 21:36:58,151 INFO [train.py:996] (2/4) Epoch 8, batch 7450, loss[loss=0.2868, simple_loss=0.3307, pruned_loss=0.1215, over 21427.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.2994, pruned_loss=0.08105, over 4270127.89 frames. ], batch size: 509, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:37:16,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1325472.0, ans=0.1 2023-06-22 21:38:11,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1325652.0, ans=0.0 2023-06-22 21:38:19,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1325652.0, ans=0.125 2023-06-22 21:38:37,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1325772.0, ans=0.1 2023-06-22 21:38:38,903 INFO [train.py:996] (2/4) Epoch 8, batch 7500, loss[loss=0.3266, simple_loss=0.4131, pruned_loss=0.1201, over 21482.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3043, pruned_loss=0.08339, over 4265922.12 frames. ], batch size: 471, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:39:35,925 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.043e+02 4.495e+02 6.774e+02 8.927e+02 1.705e+03, threshold=1.355e+03, percent-clipped=18.0 2023-06-22 21:39:42,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1325892.0, ans=0.0 2023-06-22 21:40:02,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1326012.0, ans=0.125 2023-06-22 21:40:23,991 INFO [train.py:996] (2/4) Epoch 8, batch 7550, loss[loss=0.2536, simple_loss=0.3486, pruned_loss=0.07928, over 21624.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3137, pruned_loss=0.08347, over 4273179.07 frames. ], batch size: 414, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:41:07,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1326132.0, ans=0.125 2023-06-22 21:41:08,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1326192.0, ans=0.125 2023-06-22 21:41:24,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1326252.0, ans=0.125 2023-06-22 21:41:32,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1326252.0, ans=0.04949747468305833 2023-06-22 21:41:51,526 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.08 vs. limit=15.0 2023-06-22 21:41:52,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1326312.0, ans=0.125 2023-06-22 21:42:00,776 INFO [train.py:996] (2/4) Epoch 8, batch 7600, loss[loss=0.2748, simple_loss=0.3313, pruned_loss=0.1092, over 21647.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3151, pruned_loss=0.08256, over 4273950.58 frames. ], batch size: 471, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:42:12,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1326372.0, ans=0.125 2023-06-22 21:42:22,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1326372.0, ans=0.125 2023-06-22 21:42:44,475 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.63 vs. limit=15.0 2023-06-22 21:42:49,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1326492.0, ans=0.04949747468305833 2023-06-22 21:42:50,583 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.183e+02 4.607e+02 6.581e+02 9.726e+02 1.530e+03, threshold=1.316e+03, percent-clipped=7.0 2023-06-22 21:43:03,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1326552.0, ans=0.0 2023-06-22 21:43:24,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1326612.0, ans=0.1 2023-06-22 21:43:38,606 INFO [train.py:996] (2/4) Epoch 8, batch 7650, loss[loss=0.2408, simple_loss=0.3082, pruned_loss=0.08668, over 21295.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3143, pruned_loss=0.08375, over 4281419.13 frames. ], batch size: 143, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:44:44,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1326852.0, ans=0.0 2023-06-22 21:44:49,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1326852.0, ans=0.0 2023-06-22 21:44:59,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1326912.0, ans=0.125 2023-06-22 21:45:14,452 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-22 21:45:18,099 INFO [train.py:996] (2/4) Epoch 8, batch 7700, loss[loss=0.2588, simple_loss=0.3328, pruned_loss=0.09243, over 21618.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3169, pruned_loss=0.08701, over 4288271.64 frames. ], batch size: 230, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:46:04,650 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.953e+02 3.847e+02 4.781e+02 6.112e+02 1.345e+03, threshold=9.563e+02, percent-clipped=1.0 2023-06-22 21:46:10,801 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=12.0 2023-06-22 21:46:37,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1327212.0, ans=0.1 2023-06-22 21:46:38,743 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.96 vs. limit=15.0 2023-06-22 21:46:46,400 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=15.0 2023-06-22 21:46:54,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1327212.0, ans=0.1 2023-06-22 21:46:58,538 INFO [train.py:996] (2/4) Epoch 8, batch 7750, loss[loss=0.2832, simple_loss=0.3826, pruned_loss=0.09193, over 21876.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3224, pruned_loss=0.08757, over 4281277.77 frames. ], batch size: 372, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:47:02,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1327272.0, ans=0.2 2023-06-22 21:47:13,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1327272.0, ans=0.0 2023-06-22 21:47:46,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1327392.0, ans=10.0 2023-06-22 21:48:02,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1327452.0, ans=0.0 2023-06-22 21:48:10,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1327452.0, ans=0.0 2023-06-22 21:48:42,259 INFO [train.py:996] (2/4) Epoch 8, batch 7800, loss[loss=0.2327, simple_loss=0.3077, pruned_loss=0.07882, over 21753.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3233, pruned_loss=0.08762, over 4272527.03 frames. ], batch size: 332, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:48:58,689 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.94 vs. limit=15.0 2023-06-22 21:49:19,169 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.166e+02 4.676e+02 6.316e+02 9.000e+02 2.015e+03, threshold=1.263e+03, percent-clipped=20.0 2023-06-22 21:49:37,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1327752.0, ans=0.2 2023-06-22 21:49:54,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1327812.0, ans=0.125 2023-06-22 21:50:14,986 INFO [train.py:996] (2/4) Epoch 8, batch 7850, loss[loss=0.2458, simple_loss=0.3068, pruned_loss=0.09237, over 21448.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.317, pruned_loss=0.0863, over 4271338.89 frames. ], batch size: 389, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:50:23,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1327872.0, ans=0.1 2023-06-22 21:51:01,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1327992.0, ans=0.0 2023-06-22 21:51:10,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1328052.0, ans=0.0 2023-06-22 21:51:28,329 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=15.0 2023-06-22 21:52:00,782 INFO [train.py:996] (2/4) Epoch 8, batch 7900, loss[loss=0.2434, simple_loss=0.334, pruned_loss=0.07634, over 21752.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3105, pruned_loss=0.08466, over 4259175.29 frames. ], batch size: 332, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:52:02,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1328172.0, ans=0.1 2023-06-22 21:52:06,714 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.29 vs. limit=22.5 2023-06-22 21:52:37,733 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-22 21:52:44,137 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.39 vs. limit=10.0 2023-06-22 21:52:44,745 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.884e+02 3.830e+02 4.891e+02 7.312e+02 1.897e+03, threshold=9.781e+02, percent-clipped=5.0 2023-06-22 21:53:41,617 INFO [train.py:996] (2/4) Epoch 8, batch 7950, loss[loss=0.1816, simple_loss=0.2358, pruned_loss=0.06373, over 20710.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3153, pruned_loss=0.08417, over 4262902.71 frames. ], batch size: 609, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:54:09,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1328532.0, ans=0.125 2023-06-22 21:54:32,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1328592.0, ans=0.125 2023-06-22 21:55:13,960 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.39 vs. limit=12.0 2023-06-22 21:55:19,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1328712.0, ans=0.125 2023-06-22 21:55:24,221 INFO [train.py:996] (2/4) Epoch 8, batch 8000, loss[loss=0.2931, simple_loss=0.3772, pruned_loss=0.1045, over 21467.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3181, pruned_loss=0.08631, over 4261630.96 frames. ], batch size: 471, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 21:55:28,416 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 21:56:25,230 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.345e+02 4.453e+02 5.766e+02 9.258e+02 3.143e+03, threshold=1.153e+03, percent-clipped=22.0 2023-06-22 21:56:25,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1328892.0, ans=0.125 2023-06-22 21:57:10,673 INFO [train.py:996] (2/4) Epoch 8, batch 8050, loss[loss=0.2376, simple_loss=0.292, pruned_loss=0.09155, over 21082.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3206, pruned_loss=0.08655, over 4264861.19 frames. ], batch size: 159, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:57:37,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1329132.0, ans=0.09899494936611666 2023-06-22 21:58:09,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1329192.0, ans=0.125 2023-06-22 21:58:32,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1329312.0, ans=0.125 2023-06-22 21:58:40,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1329312.0, ans=0.1 2023-06-22 21:58:48,870 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-22 21:58:56,232 INFO [train.py:996] (2/4) Epoch 8, batch 8100, loss[loss=0.2472, simple_loss=0.3125, pruned_loss=0.09089, over 21906.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3206, pruned_loss=0.08613, over 4267316.04 frames. ], batch size: 316, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 21:59:47,641 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.065e+02 4.714e+02 7.153e+02 1.179e+03 2.402e+03, threshold=1.431e+03, percent-clipped=27.0 2023-06-22 22:00:24,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1329612.0, ans=0.0 2023-06-22 22:00:29,950 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=14.68 vs. limit=22.5 2023-06-22 22:00:43,217 INFO [train.py:996] (2/4) Epoch 8, batch 8150, loss[loss=0.2908, simple_loss=0.3971, pruned_loss=0.09222, over 21567.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3293, pruned_loss=0.08749, over 4268260.23 frames. ], batch size: 441, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 22:01:07,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1329732.0, ans=0.125 2023-06-22 22:01:14,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1329792.0, ans=0.1 2023-06-22 22:01:27,069 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.18 vs. limit=22.5 2023-06-22 22:01:52,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1329852.0, ans=0.1 2023-06-22 22:02:22,142 INFO [train.py:996] (2/4) Epoch 8, batch 8200, loss[loss=0.1803, simple_loss=0.2385, pruned_loss=0.06105, over 16476.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3202, pruned_loss=0.08466, over 4262379.18 frames. ], batch size: 63, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 22:03:02,694 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.206e+02 4.983e+02 6.789e+02 1.065e+03 2.564e+03, threshold=1.358e+03, percent-clipped=14.0 2023-06-22 22:03:54,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1330212.0, ans=0.025 2023-06-22 22:04:02,061 INFO [train.py:996] (2/4) Epoch 8, batch 8250, loss[loss=0.2405, simple_loss=0.327, pruned_loss=0.07696, over 21579.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3197, pruned_loss=0.08465, over 4269438.61 frames. ], batch size: 230, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 22:04:05,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1330272.0, ans=0.0 2023-06-22 22:04:08,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1330272.0, ans=0.0 2023-06-22 22:04:17,454 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.66 vs. limit=15.0 2023-06-22 22:04:23,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1330332.0, ans=0.0 2023-06-22 22:04:51,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1330392.0, ans=0.125 2023-06-22 22:05:02,317 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.53 vs. limit=15.0 2023-06-22 22:05:27,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1330512.0, ans=0.125 2023-06-22 22:05:30,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1330512.0, ans=0.1 2023-06-22 22:05:41,299 INFO [train.py:996] (2/4) Epoch 8, batch 8300, loss[loss=0.2056, simple_loss=0.286, pruned_loss=0.06256, over 21634.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3176, pruned_loss=0.08237, over 4270052.24 frames. ], batch size: 230, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 22:05:56,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1330632.0, ans=0.125 2023-06-22 22:06:20,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1330692.0, ans=0.125 2023-06-22 22:06:27,261 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.942e+02 4.371e+02 5.179e+02 7.811e+02 1.980e+03, threshold=1.036e+03, percent-clipped=4.0 2023-06-22 22:06:36,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1330752.0, ans=0.07 2023-06-22 22:06:53,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1330752.0, ans=0.015 2023-06-22 22:07:10,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1330812.0, ans=0.125 2023-06-22 22:07:17,069 INFO [train.py:996] (2/4) Epoch 8, batch 8350, loss[loss=0.313, simple_loss=0.4335, pruned_loss=0.09627, over 20766.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3187, pruned_loss=0.08161, over 4267823.45 frames. ], batch size: 607, lr: 3.79e-03, grad_scale: 16.0 2023-06-22 22:07:19,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1330872.0, ans=0.035 2023-06-22 22:07:27,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1330872.0, ans=0.125 2023-06-22 22:07:44,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1330932.0, ans=0.125 2023-06-22 22:07:58,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1330992.0, ans=0.2 2023-06-22 22:08:29,522 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.61 vs. limit=15.0 2023-06-22 22:08:41,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1331112.0, ans=0.0 2023-06-22 22:08:56,591 INFO [train.py:996] (2/4) Epoch 8, batch 8400, loss[loss=0.1878, simple_loss=0.2861, pruned_loss=0.04473, over 21708.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3158, pruned_loss=0.0789, over 4266186.09 frames. ], batch size: 332, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 22:09:35,091 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.661e+02 3.911e+02 4.503e+02 6.126e+02 1.860e+03, threshold=9.006e+02, percent-clipped=8.0 2023-06-22 22:09:56,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1331352.0, ans=0.0 2023-06-22 22:10:01,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1331352.0, ans=0.125 2023-06-22 22:10:34,265 INFO [train.py:996] (2/4) Epoch 8, batch 8450, loss[loss=0.2474, simple_loss=0.3133, pruned_loss=0.0907, over 21833.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.311, pruned_loss=0.07779, over 4266642.75 frames. ], batch size: 124, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 22:10:55,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1331532.0, ans=0.125 2023-06-22 22:11:18,668 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=15.0 2023-06-22 22:12:12,531 INFO [train.py:996] (2/4) Epoch 8, batch 8500, loss[loss=0.2156, simple_loss=0.2774, pruned_loss=0.07689, over 21656.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3073, pruned_loss=0.0793, over 4266091.98 frames. ], batch size: 247, lr: 3.79e-03, grad_scale: 32.0 2023-06-22 22:12:16,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1331772.0, ans=0.125 2023-06-22 22:12:20,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=22.5 2023-06-22 22:12:55,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1331892.0, ans=0.1 2023-06-22 22:12:58,440 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.125e+02 4.200e+02 5.679e+02 8.112e+02 1.673e+03, threshold=1.136e+03, percent-clipped=13.0 2023-06-22 22:13:54,925 INFO [train.py:996] (2/4) Epoch 8, batch 8550, loss[loss=0.2939, simple_loss=0.3789, pruned_loss=0.1044, over 21849.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.312, pruned_loss=0.08132, over 4256590.00 frames. ], batch size: 371, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:14:00,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1332072.0, ans=0.5 2023-06-22 22:14:03,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1332072.0, ans=0.1 2023-06-22 22:14:26,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1332132.0, ans=0.125 2023-06-22 22:14:48,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1332192.0, ans=0.125 2023-06-22 22:15:35,664 INFO [train.py:996] (2/4) Epoch 8, batch 8600, loss[loss=0.2595, simple_loss=0.3351, pruned_loss=0.09196, over 21568.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3181, pruned_loss=0.08316, over 4258295.52 frames. ], batch size: 389, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:15:37,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1332372.0, ans=0.1 2023-06-22 22:15:52,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1332432.0, ans=0.0 2023-06-22 22:16:32,345 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.201e+02 4.143e+02 4.841e+02 5.659e+02 1.807e+03, threshold=9.683e+02, percent-clipped=7.0 2023-06-22 22:16:59,388 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.24 vs. limit=10.0 2023-06-22 22:16:59,475 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.83 vs. limit=15.0 2023-06-22 22:17:03,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1332612.0, ans=0.125 2023-06-22 22:17:15,218 INFO [train.py:996] (2/4) Epoch 8, batch 8650, loss[loss=0.2022, simple_loss=0.2806, pruned_loss=0.06188, over 21845.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3231, pruned_loss=0.08291, over 4268676.97 frames. ], batch size: 107, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:17:18,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1332672.0, ans=0.0 2023-06-22 22:18:07,837 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-22 22:18:41,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1332912.0, ans=0.0 2023-06-22 22:18:51,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1332912.0, ans=0.125 2023-06-22 22:18:53,769 INFO [train.py:996] (2/4) Epoch 8, batch 8700, loss[loss=0.2549, simple_loss=0.3784, pruned_loss=0.06569, over 19921.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3161, pruned_loss=0.07972, over 4257919.46 frames. ], batch size: 702, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:19:05,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1332972.0, ans=0.2 2023-06-22 22:19:07,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1333032.0, ans=0.125 2023-06-22 22:19:47,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1333092.0, ans=0.125 2023-06-22 22:19:48,712 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.793e+02 4.063e+02 5.741e+02 9.934e+02 1.995e+03, threshold=1.148e+03, percent-clipped=26.0 2023-06-22 22:19:50,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1333092.0, ans=0.1 2023-06-22 22:19:53,266 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-22 22:20:19,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1333212.0, ans=0.1 2023-06-22 22:20:21,734 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-22 22:20:32,229 INFO [train.py:996] (2/4) Epoch 8, batch 8750, loss[loss=0.236, simple_loss=0.3002, pruned_loss=0.08589, over 21989.00 frames. ], tot_loss[loss=0.238, simple_loss=0.314, pruned_loss=0.08097, over 4264900.60 frames. ], batch size: 103, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:21:27,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1333392.0, ans=0.07 2023-06-22 22:21:48,961 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.26 vs. limit=22.5 2023-06-22 22:21:52,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1333452.0, ans=0.125 2023-06-22 22:21:58,384 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=22.5 2023-06-22 22:22:16,516 INFO [train.py:996] (2/4) Epoch 8, batch 8800, loss[loss=0.2938, simple_loss=0.3631, pruned_loss=0.1122, over 21282.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3246, pruned_loss=0.08504, over 4268384.80 frames. ], batch size: 143, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:22:33,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1333632.0, ans=0.1 2023-06-22 22:22:35,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1333632.0, ans=0.125 2023-06-22 22:22:52,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1333632.0, ans=0.125 2023-06-22 22:22:57,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1333692.0, ans=0.2 2023-06-22 22:23:07,917 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.331e+02 4.564e+02 6.230e+02 9.935e+02 2.348e+03, threshold=1.246e+03, percent-clipped=15.0 2023-06-22 22:23:13,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1333752.0, ans=0.125 2023-06-22 22:23:31,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1333812.0, ans=0.125 2023-06-22 22:23:49,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=1333872.0, ans=0.02 2023-06-22 22:23:49,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1333872.0, ans=0.125 2023-06-22 22:23:50,302 INFO [train.py:996] (2/4) Epoch 8, batch 8850, loss[loss=0.2558, simple_loss=0.3245, pruned_loss=0.09358, over 21556.00 frames. ], tot_loss[loss=0.251, simple_loss=0.329, pruned_loss=0.08649, over 4266540.46 frames. ], batch size: 263, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:24:39,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1333992.0, ans=0.125 2023-06-22 22:24:44,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1333992.0, ans=0.125 2023-06-22 22:24:44,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1333992.0, ans=0.125 2023-06-22 22:25:26,317 INFO [train.py:996] (2/4) Epoch 8, batch 8900, loss[loss=0.2312, simple_loss=0.3371, pruned_loss=0.06267, over 21194.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3223, pruned_loss=0.08533, over 4264859.51 frames. ], batch size: 549, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:25:57,114 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.28 vs. limit=6.0 2023-06-22 22:26:20,774 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.244e+02 4.551e+02 5.330e+02 7.944e+02 2.391e+03, threshold=1.066e+03, percent-clipped=3.0 2023-06-22 22:26:21,669 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.90 vs. limit=6.0 2023-06-22 22:27:00,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1334412.0, ans=0.2 2023-06-22 22:27:11,651 INFO [train.py:996] (2/4) Epoch 8, batch 8950, loss[loss=0.275, simple_loss=0.3468, pruned_loss=0.1016, over 21595.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3257, pruned_loss=0.08482, over 4262454.83 frames. ], batch size: 414, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:27:25,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1334472.0, ans=0.5 2023-06-22 22:27:39,249 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.96 vs. limit=15.0 2023-06-22 22:27:49,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1334592.0, ans=0.2 2023-06-22 22:28:10,262 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.85 vs. limit=10.0 2023-06-22 22:28:16,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1334652.0, ans=0.1 2023-06-22 22:28:19,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1334652.0, ans=0.1 2023-06-22 22:28:41,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1334712.0, ans=0.125 2023-06-22 22:28:50,967 INFO [train.py:996] (2/4) Epoch 8, batch 9000, loss[loss=0.2212, simple_loss=0.3005, pruned_loss=0.07092, over 21782.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3192, pruned_loss=0.08392, over 4262931.53 frames. ], batch size: 317, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:28:50,968 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-22 22:29:12,150 INFO [train.py:1028] (2/4) Epoch 8, validation: loss=0.2658, simple_loss=0.3603, pruned_loss=0.0856, over 1796401.00 frames. 2023-06-22 22:29:12,151 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-22 22:29:42,338 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=15.0 2023-06-22 22:29:55,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1334892.0, ans=0.0 2023-06-22 22:29:57,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1334892.0, ans=0.125 2023-06-22 22:29:57,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1334892.0, ans=0.125 2023-06-22 22:29:59,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1334892.0, ans=0.1 2023-06-22 22:30:00,387 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.733e+02 4.159e+02 6.404e+02 9.275e+02 1.956e+03, threshold=1.281e+03, percent-clipped=15.0 2023-06-22 22:30:07,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1334952.0, ans=0.125 2023-06-22 22:30:37,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1335012.0, ans=0.1 2023-06-22 22:30:51,404 INFO [train.py:996] (2/4) Epoch 8, batch 9050, loss[loss=0.2916, simple_loss=0.3683, pruned_loss=0.1074, over 21821.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.314, pruned_loss=0.08187, over 4269723.92 frames. ], batch size: 118, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:30:55,641 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=15.0 2023-06-22 22:30:56,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1335072.0, ans=0.1 2023-06-22 22:31:27,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1335192.0, ans=0.0 2023-06-22 22:32:28,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1335312.0, ans=0.1 2023-06-22 22:32:28,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1335312.0, ans=0.09899494936611666 2023-06-22 22:32:34,000 INFO [train.py:996] (2/4) Epoch 8, batch 9100, loss[loss=0.3217, simple_loss=0.3734, pruned_loss=0.135, over 21363.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3198, pruned_loss=0.08415, over 4268117.90 frames. ], batch size: 507, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:32:37,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1335372.0, ans=0.125 2023-06-22 22:32:47,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1335372.0, ans=0.2 2023-06-22 22:33:13,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1335492.0, ans=0.125 2023-06-22 22:33:17,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1335492.0, ans=0.0 2023-06-22 22:33:24,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1335492.0, ans=0.125 2023-06-22 22:33:32,279 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.829e+02 4.374e+02 5.511e+02 8.272e+02 1.713e+03, threshold=1.102e+03, percent-clipped=4.0 2023-06-22 22:33:55,610 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=15.0 2023-06-22 22:34:15,746 INFO [train.py:996] (2/4) Epoch 8, batch 9150, loss[loss=0.2265, simple_loss=0.308, pruned_loss=0.07244, over 21238.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3234, pruned_loss=0.08191, over 4271883.38 frames. ], batch size: 176, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:34:36,543 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=12.0 2023-06-22 22:34:39,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1335732.0, ans=0.2 2023-06-22 22:34:40,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1335732.0, ans=0.0 2023-06-22 22:34:45,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1335732.0, ans=0.1 2023-06-22 22:34:58,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1335792.0, ans=0.07 2023-06-22 22:35:18,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1335792.0, ans=0.125 2023-06-22 22:35:20,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1335792.0, ans=0.0 2023-06-22 22:35:24,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1335852.0, ans=0.125 2023-06-22 22:35:26,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1335852.0, ans=0.0 2023-06-22 22:35:50,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1335912.0, ans=0.04949747468305833 2023-06-22 22:36:00,679 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.83 vs. limit=15.0 2023-06-22 22:36:01,117 INFO [train.py:996] (2/4) Epoch 8, batch 9200, loss[loss=0.2602, simple_loss=0.3303, pruned_loss=0.09501, over 21302.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3252, pruned_loss=0.08095, over 4268129.33 frames. ], batch size: 159, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:36:01,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1335972.0, ans=0.125 2023-06-22 22:36:11,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1335972.0, ans=0.05 2023-06-22 22:36:16,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1336032.0, ans=0.95 2023-06-22 22:36:59,568 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.939e+02 4.377e+02 5.436e+02 8.538e+02 1.737e+03, threshold=1.087e+03, percent-clipped=12.0 2023-06-22 22:37:25,996 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-22 22:37:40,978 INFO [train.py:996] (2/4) Epoch 8, batch 9250, loss[loss=0.2268, simple_loss=0.3443, pruned_loss=0.05469, over 19845.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3262, pruned_loss=0.08352, over 4274654.42 frames. ], batch size: 702, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:37:49,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1336272.0, ans=0.125 2023-06-22 22:37:54,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1336272.0, ans=0.1 2023-06-22 22:39:16,204 INFO [train.py:996] (2/4) Epoch 8, batch 9300, loss[loss=0.2472, simple_loss=0.3421, pruned_loss=0.07617, over 21719.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.321, pruned_loss=0.08384, over 4264636.88 frames. ], batch size: 332, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:40:00,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1336692.0, ans=0.125 2023-06-22 22:40:11,042 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.867e+02 5.198e+02 7.448e+02 1.175e+03 2.635e+03, threshold=1.490e+03, percent-clipped=31.0 2023-06-22 22:40:19,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1336752.0, ans=0.125 2023-06-22 22:40:32,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1336752.0, ans=0.035 2023-06-22 22:40:41,177 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-06-22 22:40:51,247 INFO [train.py:996] (2/4) Epoch 8, batch 9350, loss[loss=0.3127, simple_loss=0.3771, pruned_loss=0.1241, over 21786.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3276, pruned_loss=0.08556, over 4274797.02 frames. ], batch size: 441, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:41:35,218 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-22 22:42:14,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1337112.0, ans=0.125 2023-06-22 22:42:36,753 INFO [train.py:996] (2/4) Epoch 8, batch 9400, loss[loss=0.1939, simple_loss=0.2603, pruned_loss=0.06372, over 21337.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3288, pruned_loss=0.0855, over 4272952.73 frames. ], batch size: 194, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:42:38,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1337172.0, ans=0.1 2023-06-22 22:43:06,569 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.74 vs. limit=15.0 2023-06-22 22:43:20,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1337292.0, ans=0.05 2023-06-22 22:43:32,551 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.205e+02 4.546e+02 6.111e+02 8.751e+02 2.078e+03, threshold=1.222e+03, percent-clipped=3.0 2023-06-22 22:43:37,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1337352.0, ans=0.0 2023-06-22 22:43:41,672 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=15.0 2023-06-22 22:43:58,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1337412.0, ans=0.0 2023-06-22 22:44:16,690 INFO [train.py:996] (2/4) Epoch 8, batch 9450, loss[loss=0.2284, simple_loss=0.2841, pruned_loss=0.08641, over 21208.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3183, pruned_loss=0.08444, over 4265726.85 frames. ], batch size: 471, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:44:18,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1337472.0, ans=10.0 2023-06-22 22:44:49,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1337532.0, ans=0.1 2023-06-22 22:45:54,746 INFO [train.py:996] (2/4) Epoch 8, batch 9500, loss[loss=0.2458, simple_loss=0.3132, pruned_loss=0.08922, over 21760.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3114, pruned_loss=0.08263, over 4269409.70 frames. ], batch size: 124, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:46:09,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1337772.0, ans=0.125 2023-06-22 22:46:14,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1337832.0, ans=0.125 2023-06-22 22:46:25,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1337832.0, ans=0.0 2023-06-22 22:46:36,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1337892.0, ans=0.125 2023-06-22 22:46:36,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1337892.0, ans=0.0 2023-06-22 22:46:50,866 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.196e+02 5.640e+02 7.713e+02 1.096e+03 2.487e+03, threshold=1.543e+03, percent-clipped=16.0 2023-06-22 22:46:58,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1337952.0, ans=0.125 2023-06-22 22:47:26,329 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-22 22:47:34,319 INFO [train.py:996] (2/4) Epoch 8, batch 9550, loss[loss=0.2798, simple_loss=0.3474, pruned_loss=0.1061, over 21708.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3159, pruned_loss=0.08414, over 4270146.67 frames. ], batch size: 351, lr: 3.78e-03, grad_scale: 16.0 2023-06-22 22:48:00,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1338132.0, ans=0.125 2023-06-22 22:48:00,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1338132.0, ans=0.0 2023-06-22 22:48:39,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-22 22:48:58,447 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.86 vs. limit=10.0 2023-06-22 22:49:14,064 INFO [train.py:996] (2/4) Epoch 8, batch 9600, loss[loss=0.2841, simple_loss=0.3336, pruned_loss=0.1173, over 21744.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3184, pruned_loss=0.08628, over 4280715.88 frames. ], batch size: 507, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:50:03,247 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.129e+02 4.133e+02 5.747e+02 7.464e+02 1.666e+03, threshold=1.149e+03, percent-clipped=1.0 2023-06-22 22:50:12,856 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.13 vs. limit=22.5 2023-06-22 22:50:15,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1338552.0, ans=0.125 2023-06-22 22:50:37,090 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.64 vs. limit=15.0 2023-06-22 22:50:49,764 INFO [train.py:996] (2/4) Epoch 8, batch 9650, loss[loss=0.245, simple_loss=0.3224, pruned_loss=0.08377, over 21461.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3188, pruned_loss=0.08624, over 4279907.64 frames. ], batch size: 211, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:51:05,809 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 22:51:58,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1338852.0, ans=0.125 2023-06-22 22:52:07,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1338912.0, ans=0.1 2023-06-22 22:52:28,553 INFO [train.py:996] (2/4) Epoch 8, batch 9700, loss[loss=0.2427, simple_loss=0.3604, pruned_loss=0.06256, over 20779.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3211, pruned_loss=0.08554, over 4273525.69 frames. ], batch size: 608, lr: 3.78e-03, grad_scale: 32.0 2023-06-22 22:53:07,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1339092.0, ans=0.125 2023-06-22 22:53:09,900 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.04 vs. limit=10.0 2023-06-22 22:53:13,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1339092.0, ans=0.1 2023-06-22 22:53:17,468 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=12.0 2023-06-22 22:53:18,050 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.146e+02 4.568e+02 6.321e+02 8.796e+02 1.656e+03, threshold=1.264e+03, percent-clipped=3.0 2023-06-22 22:53:20,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1339152.0, ans=0.1 2023-06-22 22:54:00,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1339212.0, ans=0.125 2023-06-22 22:54:05,574 INFO [train.py:996] (2/4) Epoch 8, batch 9750, loss[loss=0.2401, simple_loss=0.2845, pruned_loss=0.09784, over 21222.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3158, pruned_loss=0.08484, over 4279710.46 frames. ], batch size: 471, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 22:54:07,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1339272.0, ans=0.0 2023-06-22 22:54:12,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1339272.0, ans=0.125 2023-06-22 22:55:03,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1339452.0, ans=0.125 2023-06-22 22:55:06,276 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.63 vs. limit=15.0 2023-06-22 22:55:19,253 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-06-22 22:55:42,074 INFO [train.py:996] (2/4) Epoch 8, batch 9800, loss[loss=0.2195, simple_loss=0.2926, pruned_loss=0.07317, over 21844.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3158, pruned_loss=0.08488, over 4262809.56 frames. ], batch size: 414, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 22:55:43,185 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.29 vs. limit=15.0 2023-06-22 22:56:09,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1339632.0, ans=0.125 2023-06-22 22:56:09,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1339632.0, ans=0.125 2023-06-22 22:56:22,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1339692.0, ans=0.1 2023-06-22 22:56:28,086 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=22.5 2023-06-22 22:56:31,477 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.189e+02 3.646e+02 4.309e+02 6.187e+02 1.699e+03, threshold=8.618e+02, percent-clipped=3.0 2023-06-22 22:56:40,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1339752.0, ans=0.1 2023-06-22 22:57:00,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1339812.0, ans=0.2 2023-06-22 22:57:19,959 INFO [train.py:996] (2/4) Epoch 8, batch 9850, loss[loss=0.2285, simple_loss=0.2975, pruned_loss=0.07973, over 21867.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3113, pruned_loss=0.08478, over 4257924.17 frames. ], batch size: 107, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 22:57:24,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1339872.0, ans=0.1 2023-06-22 22:57:40,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1339932.0, ans=0.04949747468305833 2023-06-22 22:58:11,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1340052.0, ans=0.0 2023-06-22 22:58:31,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1340112.0, ans=0.0 2023-06-22 22:58:39,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1340112.0, ans=0.5 2023-06-22 22:58:54,069 INFO [train.py:996] (2/4) Epoch 8, batch 9900, loss[loss=0.2375, simple_loss=0.2967, pruned_loss=0.08914, over 21883.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3075, pruned_loss=0.08401, over 4261244.18 frames. ], batch size: 373, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 22:59:45,629 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.205e+02 4.510e+02 5.793e+02 9.115e+02 1.830e+03, threshold=1.159e+03, percent-clipped=29.0 2023-06-22 23:00:16,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1340412.0, ans=0.125 2023-06-22 23:00:17,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1340412.0, ans=0.0 2023-06-22 23:00:29,931 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.55 vs. limit=15.0 2023-06-22 23:00:33,416 INFO [train.py:996] (2/4) Epoch 8, batch 9950, loss[loss=0.2677, simple_loss=0.3229, pruned_loss=0.1062, over 21570.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3117, pruned_loss=0.08691, over 4262147.34 frames. ], batch size: 415, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:00:40,737 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=15.0 2023-06-22 23:01:18,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1340592.0, ans=0.0 2023-06-22 23:01:41,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1340652.0, ans=0.0 2023-06-22 23:01:59,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1340712.0, ans=0.2 2023-06-22 23:02:10,748 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=22.5 2023-06-22 23:02:13,485 INFO [train.py:996] (2/4) Epoch 8, batch 10000, loss[loss=0.2455, simple_loss=0.3063, pruned_loss=0.0924, over 21255.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3064, pruned_loss=0.08483, over 4253919.69 frames. ], batch size: 159, lr: 3.77e-03, grad_scale: 32.0 2023-06-22 23:02:18,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1340772.0, ans=0.0 2023-06-22 23:02:27,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1340772.0, ans=0.1 2023-06-22 23:02:52,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1340892.0, ans=0.125 2023-06-22 23:03:05,267 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.683e+02 4.495e+02 6.092e+02 8.521e+02 2.124e+03, threshold=1.218e+03, percent-clipped=12.0 2023-06-22 23:03:37,768 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.24 vs. limit=22.5 2023-06-22 23:03:54,440 INFO [train.py:996] (2/4) Epoch 8, batch 10050, loss[loss=0.2369, simple_loss=0.3065, pruned_loss=0.08365, over 21644.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3103, pruned_loss=0.08605, over 4260559.73 frames. ], batch size: 415, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:03:56,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=1341072.0, ans=0.02 2023-06-22 23:04:26,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1341132.0, ans=0.0 2023-06-22 23:04:47,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1341192.0, ans=0.0 2023-06-22 23:05:22,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1341312.0, ans=0.125 2023-06-22 23:05:29,780 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.86 vs. limit=10.0 2023-06-22 23:05:32,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1341372.0, ans=10.0 2023-06-22 23:05:33,491 INFO [train.py:996] (2/4) Epoch 8, batch 10100, loss[loss=0.2773, simple_loss=0.3608, pruned_loss=0.09688, over 21666.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3071, pruned_loss=0.0828, over 4251802.53 frames. ], batch size: 441, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:05:44,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1341372.0, ans=0.0 2023-06-22 23:05:51,386 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=15.0 2023-06-22 23:05:55,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1341432.0, ans=0.1 2023-06-22 23:06:40,185 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.964e+02 4.513e+02 5.773e+02 8.039e+02 1.456e+03, threshold=1.155e+03, percent-clipped=7.0 2023-06-22 23:06:51,804 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-22 23:07:06,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1341612.0, ans=0.0 2023-06-22 23:07:08,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1341612.0, ans=0.1 2023-06-22 23:07:16,817 INFO [train.py:996] (2/4) Epoch 8, batch 10150, loss[loss=0.2089, simple_loss=0.2781, pruned_loss=0.06987, over 21210.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3139, pruned_loss=0.08529, over 4264562.15 frames. ], batch size: 608, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:07:25,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1341672.0, ans=0.07 2023-06-22 23:07:28,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1341672.0, ans=0.2 2023-06-22 23:08:40,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1341912.0, ans=0.0 2023-06-22 23:08:51,433 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.11 vs. limit=15.0 2023-06-22 23:08:55,262 INFO [train.py:996] (2/4) Epoch 8, batch 10200, loss[loss=0.2109, simple_loss=0.285, pruned_loss=0.06842, over 21437.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3131, pruned_loss=0.08315, over 4268296.69 frames. ], batch size: 131, lr: 3.77e-03, grad_scale: 8.0 2023-06-22 23:08:55,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1341972.0, ans=0.0 2023-06-22 23:09:59,680 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.693e+02 4.053e+02 5.323e+02 7.160e+02 1.292e+03, threshold=1.065e+03, percent-clipped=4.0 2023-06-22 23:10:01,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1342152.0, ans=0.125 2023-06-22 23:10:35,042 INFO [train.py:996] (2/4) Epoch 8, batch 10250, loss[loss=0.2625, simple_loss=0.3347, pruned_loss=0.09514, over 21283.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3075, pruned_loss=0.07794, over 4264833.24 frames. ], batch size: 159, lr: 3.77e-03, grad_scale: 8.0 2023-06-22 23:10:37,608 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=12.0 2023-06-22 23:11:12,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1342332.0, ans=0.125 2023-06-22 23:11:14,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1342332.0, ans=0.1 2023-06-22 23:11:30,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1342392.0, ans=0.125 2023-06-22 23:12:14,710 INFO [train.py:996] (2/4) Epoch 8, batch 10300, loss[loss=0.2466, simple_loss=0.3295, pruned_loss=0.0818, over 21736.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3092, pruned_loss=0.07764, over 4266947.20 frames. ], batch size: 247, lr: 3.77e-03, grad_scale: 8.0 2023-06-22 23:12:56,322 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.39 vs. limit=15.0 2023-06-22 23:13:10,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1342692.0, ans=0.125 2023-06-22 23:13:12,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1342692.0, ans=0.1 2023-06-22 23:13:14,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1342692.0, ans=0.0 2023-06-22 23:13:20,047 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.911e+02 4.156e+02 6.277e+02 8.296e+02 2.131e+03, threshold=1.255e+03, percent-clipped=15.0 2023-06-22 23:14:00,720 INFO [train.py:996] (2/4) Epoch 8, batch 10350, loss[loss=0.1793, simple_loss=0.2318, pruned_loss=0.06336, over 21163.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3108, pruned_loss=0.07754, over 4273642.13 frames. ], batch size: 143, lr: 3.77e-03, grad_scale: 8.0 2023-06-22 23:14:39,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1342932.0, ans=0.0 2023-06-22 23:14:40,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1342932.0, ans=0.2 2023-06-22 23:14:47,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1342992.0, ans=0.125 2023-06-22 23:14:53,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1342992.0, ans=0.0 2023-06-22 23:15:44,266 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.21 vs. limit=22.5 2023-06-22 23:15:51,349 INFO [train.py:996] (2/4) Epoch 8, batch 10400, loss[loss=0.2397, simple_loss=0.3177, pruned_loss=0.08087, over 21768.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3006, pruned_loss=0.07582, over 4258558.20 frames. ], batch size: 352, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:15:57,188 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.30 vs. limit=5.0 2023-06-22 23:16:15,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1343232.0, ans=0.1 2023-06-22 23:16:45,507 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.450e+02 4.781e+02 6.358e+02 9.315e+02 2.129e+03, threshold=1.272e+03, percent-clipped=10.0 2023-06-22 23:16:56,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1343352.0, ans=0.125 2023-06-22 23:17:18,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1343412.0, ans=0.025 2023-06-22 23:17:31,396 INFO [train.py:996] (2/4) Epoch 8, batch 10450, loss[loss=0.2098, simple_loss=0.2921, pruned_loss=0.06371, over 21800.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.305, pruned_loss=0.07883, over 4259847.71 frames. ], batch size: 282, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:17:31,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1343472.0, ans=0.125 2023-06-22 23:18:05,809 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-06-22 23:18:52,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1343712.0, ans=0.2 2023-06-22 23:19:09,663 INFO [train.py:996] (2/4) Epoch 8, batch 10500, loss[loss=0.2272, simple_loss=0.2886, pruned_loss=0.08287, over 21509.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3051, pruned_loss=0.07761, over 4259645.49 frames. ], batch size: 441, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:19:24,604 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.36 vs. limit=15.0 2023-06-22 23:19:29,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1343832.0, ans=0.0 2023-06-22 23:19:52,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1343892.0, ans=0.0 2023-06-22 23:20:03,199 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.164e+02 4.877e+02 7.502e+02 1.116e+03 2.000e+03, threshold=1.500e+03, percent-clipped=17.0 2023-06-22 23:20:51,447 INFO [train.py:996] (2/4) Epoch 8, batch 10550, loss[loss=0.2208, simple_loss=0.2859, pruned_loss=0.07787, over 21812.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3006, pruned_loss=0.07808, over 4253185.50 frames. ], batch size: 352, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:21:34,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1344192.0, ans=0.0 2023-06-22 23:22:01,057 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=22.5 2023-06-22 23:22:01,144 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-06-22 23:22:29,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1344312.0, ans=0.125 2023-06-22 23:22:30,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1344312.0, ans=0.0 2023-06-22 23:22:43,191 INFO [train.py:996] (2/4) Epoch 8, batch 10600, loss[loss=0.2035, simple_loss=0.2814, pruned_loss=0.06279, over 21188.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2972, pruned_loss=0.07654, over 4256994.26 frames. ], batch size: 548, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:23:49,144 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.068e+02 4.057e+02 5.623e+02 8.035e+02 1.796e+03, threshold=1.125e+03, percent-clipped=5.0 2023-06-22 23:24:12,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1344552.0, ans=0.0 2023-06-22 23:24:35,504 INFO [train.py:996] (2/4) Epoch 8, batch 10650, loss[loss=0.1586, simple_loss=0.2202, pruned_loss=0.04852, over 21781.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2982, pruned_loss=0.07532, over 4270226.65 frames. ], batch size: 124, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:24:46,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1344672.0, ans=0.1 2023-06-22 23:25:48,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1344852.0, ans=0.0 2023-06-22 23:26:30,718 INFO [train.py:996] (2/4) Epoch 8, batch 10700, loss[loss=0.2856, simple_loss=0.3568, pruned_loss=0.1072, over 21396.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2976, pruned_loss=0.07613, over 4273892.16 frames. ], batch size: 131, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:26:34,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1344972.0, ans=0.1 2023-06-22 23:26:51,266 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.95 vs. limit=22.5 2023-06-22 23:27:18,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1345092.0, ans=0.0 2023-06-22 23:27:20,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1345092.0, ans=0.125 2023-06-22 23:27:36,623 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.264e+02 5.189e+02 6.885e+02 9.178e+02 1.741e+03, threshold=1.377e+03, percent-clipped=11.0 2023-06-22 23:27:50,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1345212.0, ans=0.125 2023-06-22 23:28:13,087 INFO [train.py:996] (2/4) Epoch 8, batch 10750, loss[loss=0.2449, simple_loss=0.3354, pruned_loss=0.07724, over 21751.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3109, pruned_loss=0.08148, over 4277036.07 frames. ], batch size: 247, lr: 3.77e-03, grad_scale: 16.0 2023-06-22 23:28:46,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1345332.0, ans=0.125 2023-06-22 23:28:48,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1345332.0, ans=0.05 2023-06-22 23:29:55,283 INFO [train.py:996] (2/4) Epoch 8, batch 10800, loss[loss=0.2573, simple_loss=0.3287, pruned_loss=0.09292, over 21533.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3149, pruned_loss=0.08124, over 4276477.47 frames. ], batch size: 230, lr: 3.77e-03, grad_scale: 32.0 2023-06-22 23:31:04,396 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.247e+02 4.824e+02 6.509e+02 9.810e+02 2.428e+03, threshold=1.302e+03, percent-clipped=4.0 2023-06-22 23:31:06,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1345752.0, ans=0.125 2023-06-22 23:31:17,874 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.40 vs. limit=22.5 2023-06-22 23:31:20,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1345812.0, ans=0.125 2023-06-22 23:31:39,553 INFO [train.py:996] (2/4) Epoch 8, batch 10850, loss[loss=0.2334, simple_loss=0.3337, pruned_loss=0.06653, over 20776.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3162, pruned_loss=0.08188, over 4278163.18 frames. ], batch size: 609, lr: 3.77e-03, grad_scale: 32.0 2023-06-22 23:32:13,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1345932.0, ans=0.125 2023-06-22 23:33:19,382 INFO [train.py:996] (2/4) Epoch 8, batch 10900, loss[loss=0.2368, simple_loss=0.3425, pruned_loss=0.06558, over 21599.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3099, pruned_loss=0.08075, over 4267201.82 frames. ], batch size: 441, lr: 3.77e-03, grad_scale: 32.0 2023-06-22 23:33:50,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1346232.0, ans=0.125 2023-06-22 23:33:56,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1346232.0, ans=0.1 2023-06-22 23:34:21,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1346352.0, ans=0.2 2023-06-22 23:34:23,719 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.825e+02 3.987e+02 5.547e+02 7.924e+02 1.642e+03, threshold=1.109e+03, percent-clipped=4.0 2023-06-22 23:35:00,083 INFO [train.py:996] (2/4) Epoch 8, batch 10950, loss[loss=0.2088, simple_loss=0.279, pruned_loss=0.06924, over 21370.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3051, pruned_loss=0.07861, over 4252329.87 frames. ], batch size: 131, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:35:35,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1346532.0, ans=0.0 2023-06-22 23:35:45,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1346592.0, ans=0.125 2023-06-22 23:35:54,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1346592.0, ans=0.125 2023-06-22 23:36:01,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1346652.0, ans=0.2 2023-06-22 23:36:37,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1346772.0, ans=0.0 2023-06-22 23:36:38,544 INFO [train.py:996] (2/4) Epoch 8, batch 11000, loss[loss=0.2355, simple_loss=0.2938, pruned_loss=0.08858, over 20155.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3035, pruned_loss=0.07976, over 4251846.26 frames. ], batch size: 702, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:37:42,996 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.878e+02 3.827e+02 4.499e+02 6.468e+02 1.217e+03, threshold=8.999e+02, percent-clipped=2.0 2023-06-22 23:38:15,769 INFO [train.py:996] (2/4) Epoch 8, batch 11050, loss[loss=0.2288, simple_loss=0.2895, pruned_loss=0.08408, over 21802.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3014, pruned_loss=0.08111, over 4253902.33 frames. ], batch size: 112, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:38:36,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1347132.0, ans=0.1 2023-06-22 23:39:32,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1347312.0, ans=10.0 2023-06-22 23:39:54,417 INFO [train.py:996] (2/4) Epoch 8, batch 11100, loss[loss=0.2518, simple_loss=0.3148, pruned_loss=0.09441, over 21663.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3015, pruned_loss=0.08146, over 4239970.01 frames. ], batch size: 282, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:39:56,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1347372.0, ans=0.125 2023-06-22 23:40:38,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1347492.0, ans=0.125 2023-06-22 23:41:00,900 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.278e+02 4.372e+02 5.317e+02 7.818e+02 1.562e+03, threshold=1.063e+03, percent-clipped=13.0 2023-06-22 23:41:07,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1347552.0, ans=0.125 2023-06-22 23:41:34,682 INFO [train.py:996] (2/4) Epoch 8, batch 11150, loss[loss=0.242, simple_loss=0.3331, pruned_loss=0.07547, over 21603.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3002, pruned_loss=0.08191, over 4242557.42 frames. ], batch size: 441, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:42:37,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1347852.0, ans=0.2 2023-06-22 23:43:07,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1347912.0, ans=0.2 2023-06-22 23:43:13,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1347972.0, ans=0.125 2023-06-22 23:43:15,306 INFO [train.py:996] (2/4) Epoch 8, batch 11200, loss[loss=0.1991, simple_loss=0.2616, pruned_loss=0.06829, over 21377.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.2982, pruned_loss=0.08199, over 4240054.52 frames. ], batch size: 131, lr: 3.76e-03, grad_scale: 32.0 2023-06-22 23:43:20,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1347972.0, ans=0.125 2023-06-22 23:43:20,738 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-22 23:44:06,640 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2023-06-22 23:44:10,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1348092.0, ans=0.04949747468305833 2023-06-22 23:44:13,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1348092.0, ans=0.125 2023-06-22 23:44:19,992 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.309e+02 4.266e+02 5.477e+02 7.611e+02 1.407e+03, threshold=1.095e+03, percent-clipped=4.0 2023-06-22 23:44:53,121 INFO [train.py:996] (2/4) Epoch 8, batch 11250, loss[loss=0.2263, simple_loss=0.3165, pruned_loss=0.06799, over 21588.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.2975, pruned_loss=0.08133, over 4251871.12 frames. ], batch size: 414, lr: 3.76e-03, grad_scale: 32.0 2023-06-22 23:45:09,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1348332.0, ans=0.1 2023-06-22 23:45:26,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1348332.0, ans=0.0 2023-06-22 23:46:31,389 INFO [train.py:996] (2/4) Epoch 8, batch 11300, loss[loss=0.1956, simple_loss=0.2787, pruned_loss=0.05622, over 21866.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.299, pruned_loss=0.08099, over 4255259.68 frames. ], batch size: 316, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:46:37,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1348572.0, ans=0.0 2023-06-22 23:47:36,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1348752.0, ans=0.0 2023-06-22 23:47:39,336 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.094e+02 3.884e+02 4.784e+02 6.961e+02 1.768e+03, threshold=9.568e+02, percent-clipped=7.0 2023-06-22 23:48:11,944 INFO [train.py:996] (2/4) Epoch 8, batch 11350, loss[loss=0.2284, simple_loss=0.3157, pruned_loss=0.07051, over 21711.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3004, pruned_loss=0.08064, over 4254090.48 frames. ], batch size: 263, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:48:14,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1348872.0, ans=0.1 2023-06-22 23:48:41,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1348932.0, ans=0.125 2023-06-22 23:49:14,485 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 23:49:17,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1349052.0, ans=0.125 2023-06-22 23:49:28,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1349052.0, ans=0.0 2023-06-22 23:49:51,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1349112.0, ans=0.0 2023-06-22 23:49:54,084 INFO [train.py:996] (2/4) Epoch 8, batch 11400, loss[loss=0.2327, simple_loss=0.3188, pruned_loss=0.07333, over 21714.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3066, pruned_loss=0.08238, over 4252055.87 frames. ], batch size: 298, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:50:11,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1349172.0, ans=0.125 2023-06-22 23:50:14,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1349232.0, ans=0.125 2023-06-22 23:50:51,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1349292.0, ans=0.125 2023-06-22 23:50:58,675 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.22 vs. limit=15.0 2023-06-22 23:51:07,157 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.908e+02 4.448e+02 6.055e+02 8.360e+02 1.667e+03, threshold=1.211e+03, percent-clipped=10.0 2023-06-22 23:51:37,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1349412.0, ans=0.1 2023-06-22 23:51:39,976 INFO [train.py:996] (2/4) Epoch 8, batch 11450, loss[loss=0.3162, simple_loss=0.3746, pruned_loss=0.1289, over 21384.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3085, pruned_loss=0.08189, over 4251665.08 frames. ], batch size: 508, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:51:42,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1349472.0, ans=0.2 2023-06-22 23:52:07,805 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.57 vs. limit=15.0 2023-06-22 23:52:44,141 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.43 vs. limit=10.0 2023-06-22 23:53:01,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1349712.0, ans=0.125 2023-06-22 23:53:11,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1349712.0, ans=0.2 2023-06-22 23:53:17,207 INFO [train.py:996] (2/4) Epoch 8, batch 11500, loss[loss=0.225, simple_loss=0.3034, pruned_loss=0.07337, over 21192.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3117, pruned_loss=0.08295, over 4254807.20 frames. ], batch size: 143, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:53:26,721 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-06-22 23:53:31,310 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.01 vs. limit=10.0 2023-06-22 23:53:50,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1349832.0, ans=0.125 2023-06-22 23:54:03,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1349892.0, ans=0.125 2023-06-22 23:54:09,677 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.63 vs. limit=15.0 2023-06-22 23:54:22,046 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.283e+02 4.405e+02 5.880e+02 8.915e+02 1.909e+03, threshold=1.176e+03, percent-clipped=7.0 2023-06-22 23:54:24,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1349952.0, ans=0.0 2023-06-22 23:54:55,545 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.70 vs. limit=10.0 2023-06-22 23:55:04,812 INFO [train.py:996] (2/4) Epoch 8, batch 11550, loss[loss=0.1961, simple_loss=0.2612, pruned_loss=0.0655, over 20788.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3174, pruned_loss=0.08293, over 4252985.97 frames. ], batch size: 608, lr: 3.76e-03, grad_scale: 16.0 2023-06-22 23:55:42,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1350192.0, ans=0.0 2023-06-22 23:55:47,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1350192.0, ans=10.0 2023-06-22 23:55:57,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1350192.0, ans=0.0 2023-06-22 23:56:32,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1350312.0, ans=0.125 2023-06-22 23:56:38,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1350312.0, ans=0.125 2023-06-22 23:56:46,824 INFO [train.py:996] (2/4) Epoch 8, batch 11600, loss[loss=0.2652, simple_loss=0.3496, pruned_loss=0.09035, over 21343.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3334, pruned_loss=0.08507, over 4258680.49 frames. ], batch size: 159, lr: 3.76e-03, grad_scale: 32.0 2023-06-22 23:57:12,400 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.68 vs. limit=15.0 2023-06-22 23:57:15,714 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-22 23:57:20,082 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-06-22 23:57:29,305 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-22 23:57:48,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1350552.0, ans=0.125 2023-06-22 23:57:49,974 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.035e+02 5.071e+02 7.210e+02 9.611e+02 2.245e+03, threshold=1.442e+03, percent-clipped=13.0 2023-06-22 23:58:00,143 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.86 vs. limit=22.5 2023-06-22 23:58:15,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1350612.0, ans=0.125 2023-06-22 23:58:27,169 INFO [train.py:996] (2/4) Epoch 8, batch 11650, loss[loss=0.2404, simple_loss=0.3201, pruned_loss=0.08033, over 21734.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3403, pruned_loss=0.08521, over 4261784.74 frames. ], batch size: 351, lr: 3.76e-03, grad_scale: 32.0 2023-06-22 23:58:45,585 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.20 vs. limit=22.5 2023-06-22 23:58:55,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1350732.0, ans=0.0 2023-06-22 23:59:54,311 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=22.5 2023-06-23 00:00:00,941 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=22.5 2023-06-23 00:00:05,890 INFO [train.py:996] (2/4) Epoch 8, batch 11700, loss[loss=0.2333, simple_loss=0.2896, pruned_loss=0.08852, over 21572.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3302, pruned_loss=0.08455, over 4263374.56 frames. ], batch size: 263, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:00:54,294 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 00:01:02,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1351152.0, ans=0.125 2023-06-23 00:01:08,222 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.318e+02 4.484e+02 5.547e+02 7.973e+02 1.731e+03, threshold=1.109e+03, percent-clipped=2.0 2023-06-23 00:01:11,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1351152.0, ans=0.125 2023-06-23 00:01:19,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1351152.0, ans=0.0 2023-06-23 00:01:45,120 INFO [train.py:996] (2/4) Epoch 8, batch 11750, loss[loss=0.2939, simple_loss=0.3214, pruned_loss=0.1332, over 21524.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3207, pruned_loss=0.08469, over 4263792.23 frames. ], batch size: 512, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:01:57,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1351272.0, ans=0.125 2023-06-23 00:02:03,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1351272.0, ans=0.0 2023-06-23 00:02:33,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1351392.0, ans=0.1 2023-06-23 00:02:41,926 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.15 vs. limit=15.0 2023-06-23 00:02:52,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1351452.0, ans=0.1 2023-06-23 00:02:56,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1351452.0, ans=0.125 2023-06-23 00:02:56,613 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=12.0 2023-06-23 00:03:31,076 INFO [train.py:996] (2/4) Epoch 8, batch 11800, loss[loss=0.259, simple_loss=0.32, pruned_loss=0.09907, over 21823.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3231, pruned_loss=0.08691, over 4255473.17 frames. ], batch size: 247, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:03:52,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1351632.0, ans=0.2 2023-06-23 00:03:53,503 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.08 vs. limit=22.5 2023-06-23 00:04:22,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1351752.0, ans=0.125 2023-06-23 00:04:33,193 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.146e+02 4.952e+02 6.755e+02 1.112e+03 2.056e+03, threshold=1.351e+03, percent-clipped=25.0 2023-06-23 00:04:51,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1351812.0, ans=0.0 2023-06-23 00:04:55,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1351812.0, ans=0.125 2023-06-23 00:05:11,119 INFO [train.py:996] (2/4) Epoch 8, batch 11850, loss[loss=0.2303, simple_loss=0.3316, pruned_loss=0.06455, over 21776.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3242, pruned_loss=0.0861, over 4265422.57 frames. ], batch size: 351, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:05:11,698 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 00:05:11,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1351872.0, ans=0.04949747468305833 2023-06-23 00:05:40,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1351932.0, ans=0.125 2023-06-23 00:05:42,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1351932.0, ans=0.125 2023-06-23 00:06:01,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1351992.0, ans=0.125 2023-06-23 00:06:21,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1352052.0, ans=0.2 2023-06-23 00:06:44,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1352112.0, ans=0.125 2023-06-23 00:06:52,030 INFO [train.py:996] (2/4) Epoch 8, batch 11900, loss[loss=0.2807, simple_loss=0.3535, pruned_loss=0.104, over 21396.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3243, pruned_loss=0.08377, over 4267914.89 frames. ], batch size: 507, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:06:54,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1352172.0, ans=0.04949747468305833 2023-06-23 00:07:04,647 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.99 vs. limit=15.0 2023-06-23 00:07:05,680 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 00:07:22,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1352232.0, ans=0.0 2023-06-23 00:08:07,971 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.952e+02 4.101e+02 5.216e+02 6.925e+02 1.642e+03, threshold=1.043e+03, percent-clipped=1.0 2023-06-23 00:08:35,007 INFO [train.py:996] (2/4) Epoch 8, batch 11950, loss[loss=0.2519, simple_loss=0.3519, pruned_loss=0.07597, over 21690.00 frames. ], tot_loss[loss=0.243, simple_loss=0.325, pruned_loss=0.08052, over 4265693.25 frames. ], batch size: 247, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:08:37,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1352472.0, ans=0.1 2023-06-23 00:09:19,451 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.17 vs. limit=10.0 2023-06-23 00:09:26,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1352592.0, ans=0.125 2023-06-23 00:10:13,529 INFO [train.py:996] (2/4) Epoch 8, batch 12000, loss[loss=0.2038, simple_loss=0.2712, pruned_loss=0.06819, over 21681.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3176, pruned_loss=0.07832, over 4266317.74 frames. ], batch size: 282, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:10:13,529 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-23 00:10:32,704 INFO [train.py:1028] (2/4) Epoch 8, validation: loss=0.2606, simple_loss=0.356, pruned_loss=0.08257, over 1796401.00 frames. 2023-06-23 00:10:32,705 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-23 00:11:07,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1352832.0, ans=0.125 2023-06-23 00:11:18,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1352892.0, ans=0.1 2023-06-23 00:11:39,452 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.500e+02 4.103e+02 5.711e+02 8.012e+02 1.968e+03, threshold=1.142e+03, percent-clipped=13.0 2023-06-23 00:12:11,459 INFO [train.py:996] (2/4) Epoch 8, batch 12050, loss[loss=0.223, simple_loss=0.2932, pruned_loss=0.07635, over 21675.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3137, pruned_loss=0.08059, over 4267284.58 frames. ], batch size: 230, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:12:21,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1353072.0, ans=0.05 2023-06-23 00:12:42,275 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=22.5 2023-06-23 00:12:51,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1353132.0, ans=0.125 2023-06-23 00:13:09,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1353192.0, ans=0.125 2023-06-23 00:13:17,322 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.52 vs. limit=15.0 2023-06-23 00:13:22,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1353252.0, ans=0.0 2023-06-23 00:13:53,212 INFO [train.py:996] (2/4) Epoch 8, batch 12100, loss[loss=0.2976, simple_loss=0.3559, pruned_loss=0.1197, over 21310.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3215, pruned_loss=0.08505, over 4273581.63 frames. ], batch size: 176, lr: 3.76e-03, grad_scale: 32.0 2023-06-23 00:14:40,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1353492.0, ans=0.5 2023-06-23 00:14:47,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1353492.0, ans=0.1 2023-06-23 00:15:04,087 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.947e+02 5.144e+02 7.244e+02 1.095e+03 2.232e+03, threshold=1.449e+03, percent-clipped=22.0 2023-06-23 00:15:06,740 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-06-23 00:15:19,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1353612.0, ans=0.125 2023-06-23 00:15:45,909 INFO [train.py:996] (2/4) Epoch 8, batch 12150, loss[loss=0.2334, simple_loss=0.329, pruned_loss=0.06894, over 21694.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3238, pruned_loss=0.08446, over 4265732.83 frames. ], batch size: 298, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:17:00,386 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.02 vs. limit=8.0 2023-06-23 00:17:25,412 INFO [train.py:996] (2/4) Epoch 8, batch 12200, loss[loss=0.213, simple_loss=0.2695, pruned_loss=0.07821, over 21589.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3182, pruned_loss=0.08333, over 4266231.43 frames. ], batch size: 231, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:18:18,674 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.56 vs. limit=15.0 2023-06-23 00:18:19,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1354152.0, ans=0.125 2023-06-23 00:18:27,542 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.162e+02 4.597e+02 6.328e+02 9.392e+02 1.574e+03, threshold=1.266e+03, percent-clipped=2.0 2023-06-23 00:18:41,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1354212.0, ans=0.1 2023-06-23 00:18:52,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1354212.0, ans=0.125 2023-06-23 00:18:52,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1354212.0, ans=0.09899494936611666 2023-06-23 00:18:56,164 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2023-06-23 00:19:03,022 INFO [train.py:996] (2/4) Epoch 8, batch 12250, loss[loss=0.1919, simple_loss=0.2845, pruned_loss=0.04964, over 21705.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3099, pruned_loss=0.07946, over 4264383.26 frames. ], batch size: 415, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:19:03,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1354272.0, ans=0.125 2023-06-23 00:19:09,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1354272.0, ans=0.125 2023-06-23 00:20:15,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1354452.0, ans=0.125 2023-06-23 00:20:41,578 INFO [train.py:996] (2/4) Epoch 8, batch 12300, loss[loss=0.279, simple_loss=0.3661, pruned_loss=0.09595, over 21660.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3038, pruned_loss=0.07476, over 4264802.00 frames. ], batch size: 441, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:20:43,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1354572.0, ans=0.125 2023-06-23 00:21:41,015 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.472e+02 4.122e+02 6.339e+02 8.293e+02 1.636e+03, threshold=1.268e+03, percent-clipped=3.0 2023-06-23 00:22:00,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1354812.0, ans=0.0 2023-06-23 00:22:01,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1354812.0, ans=0.0 2023-06-23 00:22:22,498 INFO [train.py:996] (2/4) Epoch 8, batch 12350, loss[loss=0.2208, simple_loss=0.3478, pruned_loss=0.04688, over 20782.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3075, pruned_loss=0.07421, over 4266350.07 frames. ], batch size: 607, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:22:34,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1354872.0, ans=0.125 2023-06-23 00:22:39,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1354872.0, ans=0.09899494936611666 2023-06-23 00:23:37,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1355052.0, ans=0.1 2023-06-23 00:23:48,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1355112.0, ans=0.2 2023-06-23 00:23:53,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1355112.0, ans=0.2 2023-06-23 00:24:00,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1355172.0, ans=0.2 2023-06-23 00:24:01,244 INFO [train.py:996] (2/4) Epoch 8, batch 12400, loss[loss=0.229, simple_loss=0.2929, pruned_loss=0.08252, over 21850.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3103, pruned_loss=0.07868, over 4280747.36 frames. ], batch size: 298, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:24:37,144 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.10 vs. limit=15.0 2023-06-23 00:24:38,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1355292.0, ans=0.0 2023-06-23 00:24:48,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1355292.0, ans=0.035 2023-06-23 00:25:07,642 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.244e+02 4.565e+02 7.015e+02 1.038e+03 2.241e+03, threshold=1.403e+03, percent-clipped=10.0 2023-06-23 00:25:16,999 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.55 vs. limit=15.0 2023-06-23 00:25:42,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1355412.0, ans=0.125 2023-06-23 00:25:45,036 INFO [train.py:996] (2/4) Epoch 8, batch 12450, loss[loss=0.2563, simple_loss=0.3366, pruned_loss=0.08797, over 21919.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3149, pruned_loss=0.08204, over 4284643.58 frames. ], batch size: 316, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:26:58,756 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-23 00:27:12,211 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=15.0 2023-06-23 00:27:15,038 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=12.0 2023-06-23 00:27:27,300 INFO [train.py:996] (2/4) Epoch 8, batch 12500, loss[loss=0.2501, simple_loss=0.3442, pruned_loss=0.07797, over 21316.00 frames. ], tot_loss[loss=0.249, simple_loss=0.326, pruned_loss=0.08602, over 4285219.41 frames. ], batch size: 176, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:28:04,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1355832.0, ans=0.0 2023-06-23 00:28:22,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1355892.0, ans=0.2 2023-06-23 00:28:44,174 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.178e+02 4.947e+02 7.092e+02 9.787e+02 2.648e+03, threshold=1.418e+03, percent-clipped=11.0 2023-06-23 00:28:51,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1355952.0, ans=0.2 2023-06-23 00:29:12,275 INFO [train.py:996] (2/4) Epoch 8, batch 12550, loss[loss=0.2283, simple_loss=0.3114, pruned_loss=0.07256, over 21407.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3296, pruned_loss=0.08761, over 4285033.98 frames. ], batch size: 211, lr: 3.75e-03, grad_scale: 16.0 2023-06-23 00:29:46,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1356132.0, ans=0.2 2023-06-23 00:30:32,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1356252.0, ans=0.2 2023-06-23 00:30:58,474 INFO [train.py:996] (2/4) Epoch 8, batch 12600, loss[loss=0.2235, simple_loss=0.3205, pruned_loss=0.0632, over 21578.00 frames. ], tot_loss[loss=0.249, simple_loss=0.3284, pruned_loss=0.08479, over 4280261.06 frames. ], batch size: 389, lr: 3.75e-03, grad_scale: 16.0 2023-06-23 00:30:58,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1356372.0, ans=0.1 2023-06-23 00:31:31,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1356432.0, ans=0.2 2023-06-23 00:31:31,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1356432.0, ans=0.1 2023-06-23 00:31:41,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1356492.0, ans=0.125 2023-06-23 00:31:52,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.whiten.whitening_limit, batch_count=1356492.0, ans=12.0 2023-06-23 00:31:59,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1356552.0, ans=0.125 2023-06-23 00:32:06,798 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.884e+02 4.323e+02 5.935e+02 8.611e+02 2.067e+03, threshold=1.187e+03, percent-clipped=5.0 2023-06-23 00:32:24,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1356612.0, ans=0.125 2023-06-23 00:32:36,783 INFO [train.py:996] (2/4) Epoch 8, batch 12650, loss[loss=0.2348, simple_loss=0.3069, pruned_loss=0.08129, over 21711.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3203, pruned_loss=0.08057, over 4272190.96 frames. ], batch size: 389, lr: 3.75e-03, grad_scale: 16.0 2023-06-23 00:33:14,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1356732.0, ans=0.125 2023-06-23 00:33:17,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1356792.0, ans=0.125 2023-06-23 00:33:32,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1356792.0, ans=0.2 2023-06-23 00:34:21,570 INFO [train.py:996] (2/4) Epoch 8, batch 12700, loss[loss=0.2214, simple_loss=0.3016, pruned_loss=0.07057, over 21498.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.32, pruned_loss=0.08347, over 4275278.15 frames. ], batch size: 211, lr: 3.75e-03, grad_scale: 16.0 2023-06-23 00:34:57,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1357032.0, ans=0.1 2023-06-23 00:35:13,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1357092.0, ans=0.0 2023-06-23 00:35:25,459 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.987e+02 4.487e+02 5.893e+02 8.124e+02 1.594e+03, threshold=1.179e+03, percent-clipped=3.0 2023-06-23 00:35:29,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1357152.0, ans=0.0 2023-06-23 00:35:59,892 INFO [train.py:996] (2/4) Epoch 8, batch 12750, loss[loss=0.2345, simple_loss=0.3058, pruned_loss=0.08162, over 21774.00 frames. ], tot_loss[loss=0.245, simple_loss=0.322, pruned_loss=0.08398, over 4272581.48 frames. ], batch size: 112, lr: 3.75e-03, grad_scale: 16.0 2023-06-23 00:36:22,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1357332.0, ans=0.2 2023-06-23 00:36:42,883 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.34 vs. limit=15.0 2023-06-23 00:36:49,406 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.04 vs. limit=12.0 2023-06-23 00:36:50,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1357392.0, ans=0.125 2023-06-23 00:37:24,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1357512.0, ans=0.2 2023-06-23 00:37:25,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1357512.0, ans=0.125 2023-06-23 00:37:35,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1357512.0, ans=0.125 2023-06-23 00:37:42,784 INFO [train.py:996] (2/4) Epoch 8, batch 12800, loss[loss=0.2275, simple_loss=0.3024, pruned_loss=0.07626, over 21932.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3215, pruned_loss=0.08419, over 4269801.73 frames. ], batch size: 316, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:37:43,869 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-23 00:38:20,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1357692.0, ans=0.1 2023-06-23 00:38:42,075 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.559e+02 4.784e+02 6.128e+02 7.998e+02 1.838e+03, threshold=1.226e+03, percent-clipped=10.0 2023-06-23 00:38:56,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1357812.0, ans=0.125 2023-06-23 00:39:18,523 INFO [train.py:996] (2/4) Epoch 8, batch 12850, loss[loss=0.2151, simple_loss=0.3161, pruned_loss=0.05708, over 21770.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3233, pruned_loss=0.08597, over 4275733.85 frames. ], batch size: 332, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:40:03,924 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=22.5 2023-06-23 00:40:48,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1358112.0, ans=0.125 2023-06-23 00:41:03,880 INFO [train.py:996] (2/4) Epoch 8, batch 12900, loss[loss=0.1968, simple_loss=0.2729, pruned_loss=0.06037, over 21144.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3195, pruned_loss=0.08194, over 4275181.43 frames. ], batch size: 176, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:41:14,381 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.52 vs. limit=15.0 2023-06-23 00:41:22,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1358232.0, ans=0.0 2023-06-23 00:41:26,298 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=22.5 2023-06-23 00:41:53,149 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.68 vs. limit=15.0 2023-06-23 00:42:12,800 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.798e+02 4.070e+02 5.539e+02 8.932e+02 2.008e+03, threshold=1.108e+03, percent-clipped=7.0 2023-06-23 00:42:16,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1358352.0, ans=0.125 2023-06-23 00:42:26,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1358412.0, ans=0.125 2023-06-23 00:42:36,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1358412.0, ans=0.125 2023-06-23 00:42:43,896 INFO [train.py:996] (2/4) Epoch 8, batch 12950, loss[loss=0.1947, simple_loss=0.2755, pruned_loss=0.05691, over 21649.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3199, pruned_loss=0.0802, over 4268603.29 frames. ], batch size: 263, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:43:10,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1358532.0, ans=0.125 2023-06-23 00:43:12,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1358532.0, ans=0.2 2023-06-23 00:43:48,664 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=12.0 2023-06-23 00:44:18,496 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-23 00:44:22,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1358772.0, ans=0.2 2023-06-23 00:44:24,125 INFO [train.py:996] (2/4) Epoch 8, batch 13000, loss[loss=0.2673, simple_loss=0.3755, pruned_loss=0.07953, over 19794.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3228, pruned_loss=0.08174, over 4266084.42 frames. ], batch size: 702, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:44:28,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1358772.0, ans=0.2 2023-06-23 00:45:01,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1358832.0, ans=0.125 2023-06-23 00:45:06,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1358892.0, ans=0.0 2023-06-23 00:45:30,874 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.055e+02 4.812e+02 8.002e+02 1.036e+03 2.306e+03, threshold=1.600e+03, percent-clipped=23.0 2023-06-23 00:45:54,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1359012.0, ans=0.025 2023-06-23 00:46:00,832 INFO [train.py:996] (2/4) Epoch 8, batch 13050, loss[loss=0.1866, simple_loss=0.2334, pruned_loss=0.06989, over 19960.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3185, pruned_loss=0.08018, over 4261702.19 frames. ], batch size: 704, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:46:21,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1359132.0, ans=0.125 2023-06-23 00:46:30,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1359132.0, ans=0.1 2023-06-23 00:47:02,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1359252.0, ans=0.1 2023-06-23 00:47:39,003 INFO [train.py:996] (2/4) Epoch 8, batch 13100, loss[loss=0.227, simple_loss=0.3182, pruned_loss=0.06788, over 21703.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3185, pruned_loss=0.08026, over 4270884.32 frames. ], batch size: 298, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:48:18,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1359432.0, ans=0.0 2023-06-23 00:48:18,064 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.991e-03 2023-06-23 00:48:38,226 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.70 vs. limit=15.0 2023-06-23 00:48:52,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1359552.0, ans=0.125 2023-06-23 00:48:53,084 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.010e+02 4.152e+02 4.803e+02 6.425e+02 1.389e+03, threshold=9.605e+02, percent-clipped=0.0 2023-06-23 00:48:58,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1359552.0, ans=0.0 2023-06-23 00:49:23,639 INFO [train.py:996] (2/4) Epoch 8, batch 13150, loss[loss=0.2682, simple_loss=0.3383, pruned_loss=0.09907, over 21437.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.32, pruned_loss=0.08217, over 4275668.19 frames. ], batch size: 211, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:50:03,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1359732.0, ans=0.0 2023-06-23 00:50:10,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1359792.0, ans=0.125 2023-06-23 00:50:12,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1359792.0, ans=0.125 2023-06-23 00:50:38,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1359852.0, ans=0.125 2023-06-23 00:50:40,355 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=15.0 2023-06-23 00:51:08,205 INFO [train.py:996] (2/4) Epoch 8, batch 13200, loss[loss=0.2303, simple_loss=0.295, pruned_loss=0.08281, over 21224.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3187, pruned_loss=0.08191, over 4270802.76 frames. ], batch size: 608, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:51:09,225 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2023-06-23 00:52:01,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1360092.0, ans=0.0 2023-06-23 00:52:13,781 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.899e+02 4.724e+02 6.289e+02 8.620e+02 1.453e+03, threshold=1.258e+03, percent-clipped=16.0 2023-06-23 00:52:40,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1360212.0, ans=0.0 2023-06-23 00:52:45,198 INFO [train.py:996] (2/4) Epoch 8, batch 13250, loss[loss=0.2467, simple_loss=0.3125, pruned_loss=0.09048, over 21657.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3188, pruned_loss=0.0845, over 4269129.07 frames. ], batch size: 230, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:53:03,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1360272.0, ans=0.125 2023-06-23 00:53:06,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1360332.0, ans=0.125 2023-06-23 00:54:31,040 INFO [train.py:996] (2/4) Epoch 8, batch 13300, loss[loss=0.2218, simple_loss=0.3415, pruned_loss=0.05102, over 19794.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3223, pruned_loss=0.08405, over 4268363.74 frames. ], batch size: 702, lr: 3.75e-03, grad_scale: 32.0 2023-06-23 00:55:01,596 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.54 vs. limit=15.0 2023-06-23 00:55:32,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff2.min_abs, batch_count=1360692.0, ans=0.1 2023-06-23 00:55:41,039 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.129e+02 4.712e+02 5.675e+02 7.796e+02 1.493e+03, threshold=1.135e+03, percent-clipped=5.0 2023-06-23 00:55:58,655 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=9.98 vs. limit=12.0 2023-06-23 00:56:11,971 INFO [train.py:996] (2/4) Epoch 8, batch 13350, loss[loss=0.2376, simple_loss=0.3225, pruned_loss=0.07633, over 21766.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3249, pruned_loss=0.08594, over 4271152.18 frames. ], batch size: 298, lr: 3.74e-03, grad_scale: 32.0 2023-06-23 00:56:31,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1360932.0, ans=0.04949747468305833 2023-06-23 00:56:37,354 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-23 00:56:38,829 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-23 00:57:18,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1361052.0, ans=0.0 2023-06-23 00:57:42,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1361112.0, ans=0.0 2023-06-23 00:57:57,188 INFO [train.py:996] (2/4) Epoch 8, batch 13400, loss[loss=0.2459, simple_loss=0.3169, pruned_loss=0.08743, over 21460.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3255, pruned_loss=0.08801, over 4274565.23 frames. ], batch size: 194, lr: 3.74e-03, grad_scale: 32.0 2023-06-23 00:59:01,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1361352.0, ans=0.0 2023-06-23 00:59:05,826 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.334e+02 4.523e+02 5.899e+02 7.481e+02 1.405e+03, threshold=1.180e+03, percent-clipped=3.0 2023-06-23 00:59:15,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1361412.0, ans=0.125 2023-06-23 00:59:36,199 INFO [train.py:996] (2/4) Epoch 8, batch 13450, loss[loss=0.2056, simple_loss=0.2655, pruned_loss=0.07284, over 21525.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3261, pruned_loss=0.0902, over 4277580.70 frames. ], batch size: 195, lr: 3.74e-03, grad_scale: 32.0 2023-06-23 00:59:49,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1361472.0, ans=0.1 2023-06-23 00:59:49,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1361472.0, ans=0.125 2023-06-23 00:59:51,130 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.80 vs. limit=15.0 2023-06-23 00:59:55,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1361532.0, ans=0.1 2023-06-23 00:59:58,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1361532.0, ans=0.125 2023-06-23 01:00:59,770 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=15.0 2023-06-23 01:01:07,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1361712.0, ans=0.1 2023-06-23 01:01:15,841 INFO [train.py:996] (2/4) Epoch 8, batch 13500, loss[loss=0.2626, simple_loss=0.3427, pruned_loss=0.09129, over 21284.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3166, pruned_loss=0.08667, over 4268237.50 frames. ], batch size: 549, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:01:54,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1361832.0, ans=0.2 2023-06-23 01:02:35,453 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.831e+02 4.391e+02 6.972e+02 1.115e+03 2.286e+03, threshold=1.394e+03, percent-clipped=24.0 2023-06-23 01:02:43,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1362012.0, ans=0.125 2023-06-23 01:02:57,271 INFO [train.py:996] (2/4) Epoch 8, batch 13550, loss[loss=0.2013, simple_loss=0.2712, pruned_loss=0.06573, over 21748.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3208, pruned_loss=0.085, over 4266066.20 frames. ], batch size: 112, lr: 3.74e-03, grad_scale: 8.0 2023-06-23 01:03:05,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1362072.0, ans=0.125 2023-06-23 01:03:19,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1362132.0, ans=0.125 2023-06-23 01:03:24,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1362132.0, ans=0.1 2023-06-23 01:03:50,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1362192.0, ans=0.125 2023-06-23 01:04:02,562 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 01:04:11,134 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.54 vs. limit=10.0 2023-06-23 01:04:26,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1362312.0, ans=0.1 2023-06-23 01:04:26,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1362312.0, ans=0.07 2023-06-23 01:04:31,397 INFO [train.py:996] (2/4) Epoch 8, batch 13600, loss[loss=0.2479, simple_loss=0.3275, pruned_loss=0.08413, over 21826.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3209, pruned_loss=0.08506, over 4280943.75 frames. ], batch size: 124, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:05:43,633 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=15.0 2023-06-23 01:05:47,093 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.174e+02 4.391e+02 6.199e+02 8.558e+02 2.268e+03, threshold=1.240e+03, percent-clipped=7.0 2023-06-23 01:05:47,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1362552.0, ans=0.035 2023-06-23 01:06:09,015 INFO [train.py:996] (2/4) Epoch 8, batch 13650, loss[loss=0.1956, simple_loss=0.2581, pruned_loss=0.06659, over 21564.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3149, pruned_loss=0.08204, over 4276320.68 frames. ], batch size: 263, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:06:45,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1362732.0, ans=0.1 2023-06-23 01:07:25,515 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.80 vs. limit=15.0 2023-06-23 01:07:43,691 INFO [train.py:996] (2/4) Epoch 8, batch 13700, loss[loss=0.2106, simple_loss=0.2796, pruned_loss=0.07079, over 21668.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3102, pruned_loss=0.08128, over 4262124.02 frames. ], batch size: 263, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:08:21,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1363032.0, ans=0.0 2023-06-23 01:09:00,901 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.420e+02 5.296e+02 7.505e+02 1.141e+03 2.334e+03, threshold=1.501e+03, percent-clipped=22.0 2023-06-23 01:09:32,541 INFO [train.py:996] (2/4) Epoch 8, batch 13750, loss[loss=0.225, simple_loss=0.3088, pruned_loss=0.07062, over 21681.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3109, pruned_loss=0.08172, over 4266047.39 frames. ], batch size: 351, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:09:48,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1363272.0, ans=0.125 2023-06-23 01:10:11,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1363392.0, ans=0.2 2023-06-23 01:10:31,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1363452.0, ans=0.125 2023-06-23 01:11:14,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1363512.0, ans=0.125 2023-06-23 01:11:20,754 INFO [train.py:996] (2/4) Epoch 8, batch 13800, loss[loss=0.2313, simple_loss=0.3174, pruned_loss=0.07262, over 21447.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.316, pruned_loss=0.08163, over 4262333.16 frames. ], batch size: 194, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:11:59,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1363692.0, ans=0.0 2023-06-23 01:12:13,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1363692.0, ans=0.1 2023-06-23 01:12:17,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1363752.0, ans=0.0 2023-06-23 01:12:21,847 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-06-23 01:12:37,254 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.245e+02 4.845e+02 7.419e+02 1.036e+03 2.562e+03, threshold=1.484e+03, percent-clipped=7.0 2023-06-23 01:12:49,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1363812.0, ans=0.0 2023-06-23 01:13:00,659 INFO [train.py:996] (2/4) Epoch 8, batch 13850, loss[loss=0.2299, simple_loss=0.3238, pruned_loss=0.06805, over 21675.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3223, pruned_loss=0.08295, over 4264136.07 frames. ], batch size: 263, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:13:48,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1363992.0, ans=0.125 2023-06-23 01:14:14,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1364052.0, ans=0.0 2023-06-23 01:14:29,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1364112.0, ans=0.05 2023-06-23 01:14:39,105 INFO [train.py:996] (2/4) Epoch 8, batch 13900, loss[loss=0.2724, simple_loss=0.3444, pruned_loss=0.1002, over 21974.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3267, pruned_loss=0.08642, over 4271816.85 frames. ], batch size: 113, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:14:45,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1364172.0, ans=0.1 2023-06-23 01:15:55,318 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.327e+02 4.266e+02 5.471e+02 7.768e+02 2.129e+03, threshold=1.094e+03, percent-clipped=1.0 2023-06-23 01:16:03,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1364412.0, ans=0.2 2023-06-23 01:16:17,071 INFO [train.py:996] (2/4) Epoch 8, batch 13950, loss[loss=0.2519, simple_loss=0.3801, pruned_loss=0.06183, over 19752.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3261, pruned_loss=0.08791, over 4271448.08 frames. ], batch size: 702, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:17:17,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1364592.0, ans=0.125 2023-06-23 01:17:23,928 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2023-06-23 01:17:27,986 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 01:17:53,963 INFO [train.py:996] (2/4) Epoch 8, batch 14000, loss[loss=0.243, simple_loss=0.3374, pruned_loss=0.07433, over 21559.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3202, pruned_loss=0.08459, over 4258541.67 frames. ], batch size: 471, lr: 3.74e-03, grad_scale: 32.0 2023-06-23 01:18:08,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1364772.0, ans=0.1 2023-06-23 01:18:33,740 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=22.5 2023-06-23 01:18:34,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1364892.0, ans=0.1 2023-06-23 01:19:08,957 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.634e+02 4.329e+02 5.834e+02 8.040e+02 1.947e+03, threshold=1.167e+03, percent-clipped=14.0 2023-06-23 01:19:30,088 INFO [train.py:996] (2/4) Epoch 8, batch 14050, loss[loss=0.2478, simple_loss=0.3076, pruned_loss=0.094, over 20176.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3149, pruned_loss=0.08091, over 4250320.84 frames. ], batch size: 703, lr: 3.74e-03, grad_scale: 32.0 2023-06-23 01:19:45,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1365072.0, ans=0.5 2023-06-23 01:19:52,108 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=22.47 vs. limit=22.5 2023-06-23 01:19:52,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1365132.0, ans=0.125 2023-06-23 01:20:05,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1365132.0, ans=0.125 2023-06-23 01:20:25,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1365192.0, ans=0.2 2023-06-23 01:20:33,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1365192.0, ans=0.125 2023-06-23 01:20:51,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1365312.0, ans=0.035 2023-06-23 01:21:12,102 INFO [train.py:996] (2/4) Epoch 8, batch 14100, loss[loss=0.2538, simple_loss=0.3141, pruned_loss=0.09674, over 21576.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3098, pruned_loss=0.0812, over 4245296.53 frames. ], batch size: 263, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:21:26,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1365432.0, ans=0.125 2023-06-23 01:21:49,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1365432.0, ans=0.05 2023-06-23 01:22:18,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1365552.0, ans=0.125 2023-06-23 01:22:24,443 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.509e+02 4.771e+02 6.447e+02 8.696e+02 1.773e+03, threshold=1.289e+03, percent-clipped=8.0 2023-06-23 01:22:27,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1365612.0, ans=0.1 2023-06-23 01:22:30,816 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 01:22:43,439 INFO [train.py:996] (2/4) Epoch 8, batch 14150, loss[loss=0.2314, simple_loss=0.3276, pruned_loss=0.06759, over 21714.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3137, pruned_loss=0.08188, over 4251627.42 frames. ], batch size: 298, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:22:56,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1365672.0, ans=0.07 2023-06-23 01:23:02,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1365732.0, ans=0.125 2023-06-23 01:24:00,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1365852.0, ans=0.125 2023-06-23 01:24:19,879 INFO [train.py:996] (2/4) Epoch 8, batch 14200, loss[loss=0.233, simple_loss=0.3079, pruned_loss=0.07908, over 21677.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3123, pruned_loss=0.08028, over 4243467.83 frames. ], batch size: 230, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:24:49,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1366032.0, ans=10.0 2023-06-23 01:25:08,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1366092.0, ans=0.125 2023-06-23 01:25:15,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1366092.0, ans=0.0 2023-06-23 01:25:32,004 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.024e+02 4.323e+02 5.337e+02 8.028e+02 2.442e+03, threshold=1.067e+03, percent-clipped=5.0 2023-06-23 01:25:57,607 INFO [train.py:996] (2/4) Epoch 8, batch 14250, loss[loss=0.2341, simple_loss=0.2966, pruned_loss=0.08579, over 21870.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.308, pruned_loss=0.08023, over 4251911.30 frames. ], batch size: 98, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:26:02,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1366272.0, ans=0.5 2023-06-23 01:26:15,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1366272.0, ans=0.2 2023-06-23 01:26:23,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1366332.0, ans=0.04949747468305833 2023-06-23 01:26:24,317 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.95 vs. limit=10.0 2023-06-23 01:26:25,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1366332.0, ans=0.125 2023-06-23 01:27:05,310 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.47 vs. limit=10.0 2023-06-23 01:27:35,837 INFO [train.py:996] (2/4) Epoch 8, batch 14300, loss[loss=0.3574, simple_loss=0.452, pruned_loss=0.1314, over 21228.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3112, pruned_loss=0.0811, over 4246613.48 frames. ], batch size: 549, lr: 3.74e-03, grad_scale: 8.0 2023-06-23 01:28:54,523 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.931e+02 4.415e+02 6.422e+02 1.030e+03 2.040e+03, threshold=1.284e+03, percent-clipped=23.0 2023-06-23 01:29:05,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1366812.0, ans=0.0 2023-06-23 01:29:07,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1366812.0, ans=0.1 2023-06-23 01:29:08,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1366812.0, ans=0.125 2023-06-23 01:29:13,115 INFO [train.py:996] (2/4) Epoch 8, batch 14350, loss[loss=0.2351, simple_loss=0.3114, pruned_loss=0.07945, over 21874.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3175, pruned_loss=0.08179, over 4240654.77 frames. ], batch size: 371, lr: 3.74e-03, grad_scale: 8.0 2023-06-23 01:29:23,404 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.82 vs. limit=6.0 2023-06-23 01:29:48,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1366992.0, ans=0.125 2023-06-23 01:30:02,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1366992.0, ans=0.125 2023-06-23 01:30:37,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1367112.0, ans=0.1 2023-06-23 01:30:47,761 INFO [train.py:996] (2/4) Epoch 8, batch 14400, loss[loss=0.2356, simple_loss=0.2984, pruned_loss=0.08644, over 21773.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3148, pruned_loss=0.08236, over 4236740.50 frames. ], batch size: 316, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:31:24,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1367292.0, ans=0.125 2023-06-23 01:31:56,158 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.112e+02 4.154e+02 4.970e+02 6.969e+02 1.897e+03, threshold=9.939e+02, percent-clipped=6.0 2023-06-23 01:32:09,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1367412.0, ans=0.0 2023-06-23 01:32:10,181 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.04 vs. limit=15.0 2023-06-23 01:32:19,356 INFO [train.py:996] (2/4) Epoch 8, batch 14450, loss[loss=0.1999, simple_loss=0.2677, pruned_loss=0.0661, over 21764.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3087, pruned_loss=0.0823, over 4238616.34 frames. ], batch size: 316, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:32:24,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1367472.0, ans=0.125 2023-06-23 01:32:37,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1367472.0, ans=0.125 2023-06-23 01:32:40,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1367532.0, ans=0.0 2023-06-23 01:32:56,059 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0 2023-06-23 01:33:20,235 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.17 vs. limit=10.0 2023-06-23 01:34:04,030 INFO [train.py:996] (2/4) Epoch 8, batch 14500, loss[loss=0.2152, simple_loss=0.2917, pruned_loss=0.06933, over 21170.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3037, pruned_loss=0.08157, over 4249001.14 frames. ], batch size: 143, lr: 3.74e-03, grad_scale: 16.0 2023-06-23 01:34:10,027 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.93 vs. limit=15.0 2023-06-23 01:34:43,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1367832.0, ans=0.05 2023-06-23 01:35:10,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1367952.0, ans=0.1 2023-06-23 01:35:14,496 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=15.0 2023-06-23 01:35:16,220 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=15.0 2023-06-23 01:35:21,290 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.039e+02 4.784e+02 6.137e+02 8.722e+02 1.642e+03, threshold=1.227e+03, percent-clipped=18.0 2023-06-23 01:35:45,095 INFO [train.py:996] (2/4) Epoch 8, batch 14550, loss[loss=0.2692, simple_loss=0.3424, pruned_loss=0.09799, over 21721.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3073, pruned_loss=0.08293, over 4255632.59 frames. ], batch size: 298, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:35:47,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=1368072.0, ans=0.2 2023-06-23 01:35:48,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1368072.0, ans=0.2 2023-06-23 01:35:54,122 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=12.0 2023-06-23 01:36:12,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1368132.0, ans=0.0 2023-06-23 01:36:47,934 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=15.0 2023-06-23 01:37:22,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1368372.0, ans=0.125 2023-06-23 01:37:23,777 INFO [train.py:996] (2/4) Epoch 8, batch 14600, loss[loss=0.229, simple_loss=0.3244, pruned_loss=0.06681, over 21734.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3153, pruned_loss=0.08668, over 4260257.58 frames. ], batch size: 247, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:38:38,629 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.403e+02 4.393e+02 5.466e+02 7.760e+02 1.223e+03, threshold=1.093e+03, percent-clipped=0.0 2023-06-23 01:39:02,814 INFO [train.py:996] (2/4) Epoch 8, batch 14650, loss[loss=0.2275, simple_loss=0.3034, pruned_loss=0.07582, over 21757.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3217, pruned_loss=0.0875, over 4260263.52 frames. ], batch size: 112, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:40:36,544 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.47 vs. limit=15.0 2023-06-23 01:40:41,934 INFO [train.py:996] (2/4) Epoch 8, batch 14700, loss[loss=0.2034, simple_loss=0.2956, pruned_loss=0.05555, over 21502.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.317, pruned_loss=0.08158, over 4265249.05 frames. ], batch size: 471, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 01:41:16,806 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.93 vs. limit=15.0 2023-06-23 01:41:40,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1369152.0, ans=0.125 2023-06-23 01:41:55,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1369152.0, ans=0.125 2023-06-23 01:42:00,408 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.444e+02 5.470e+02 7.461e+02 1.083e+03 1.858e+03, threshold=1.492e+03, percent-clipped=24.0 2023-06-23 01:42:04,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1369212.0, ans=0.1 2023-06-23 01:42:18,573 INFO [train.py:996] (2/4) Epoch 8, batch 14750, loss[loss=0.2003, simple_loss=0.2852, pruned_loss=0.05775, over 21384.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3204, pruned_loss=0.08292, over 4267300.30 frames. ], batch size: 194, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 01:42:32,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1369272.0, ans=0.0 2023-06-23 01:42:38,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1369332.0, ans=0.1 2023-06-23 01:42:40,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1369332.0, ans=0.0 2023-06-23 01:42:46,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1369332.0, ans=0.125 2023-06-23 01:43:03,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1369392.0, ans=0.2 2023-06-23 01:44:00,031 INFO [train.py:996] (2/4) Epoch 8, batch 14800, loss[loss=0.3728, simple_loss=0.4225, pruned_loss=0.1615, over 21401.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3313, pruned_loss=0.08883, over 4262556.92 frames. ], batch size: 507, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:44:06,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1369572.0, ans=0.0 2023-06-23 01:45:18,469 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.582e+02 5.734e+02 8.027e+02 1.112e+03 2.200e+03, threshold=1.605e+03, percent-clipped=5.0 2023-06-23 01:45:30,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1369812.0, ans=0.125 2023-06-23 01:45:31,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1369812.0, ans=0.125 2023-06-23 01:45:41,498 INFO [train.py:996] (2/4) Epoch 8, batch 14850, loss[loss=0.2199, simple_loss=0.2809, pruned_loss=0.07942, over 21851.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3252, pruned_loss=0.08832, over 4266002.96 frames. ], batch size: 107, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:46:04,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1369932.0, ans=0.125 2023-06-23 01:46:09,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1369932.0, ans=10.0 2023-06-23 01:46:14,602 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.570e-02 2023-06-23 01:46:28,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1369992.0, ans=0.0 2023-06-23 01:47:10,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1370112.0, ans=10.0 2023-06-23 01:47:23,698 INFO [train.py:996] (2/4) Epoch 8, batch 14900, loss[loss=0.2556, simple_loss=0.3264, pruned_loss=0.09247, over 21832.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3267, pruned_loss=0.08954, over 4270137.61 frames. ], batch size: 282, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:47:25,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1370172.0, ans=0.1 2023-06-23 01:47:34,149 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2023-06-23 01:48:21,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1370292.0, ans=0.0 2023-06-23 01:48:41,945 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.419e+02 4.642e+02 5.851e+02 8.319e+02 1.860e+03, threshold=1.170e+03, percent-clipped=1.0 2023-06-23 01:48:42,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1370412.0, ans=0.125 2023-06-23 01:49:04,625 INFO [train.py:996] (2/4) Epoch 8, batch 14950, loss[loss=0.2582, simple_loss=0.3448, pruned_loss=0.08577, over 21925.00 frames. ], tot_loss[loss=0.2523, simple_loss=0.3274, pruned_loss=0.08859, over 4277221.93 frames. ], batch size: 373, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:49:06,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_ff3.min_abs, batch_count=1370472.0, ans=0.2 2023-06-23 01:49:11,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1370472.0, ans=0.0 2023-06-23 01:49:30,189 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-23 01:50:03,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1370652.0, ans=0.125 2023-06-23 01:50:40,420 INFO [train.py:996] (2/4) Epoch 8, batch 15000, loss[loss=0.2746, simple_loss=0.3435, pruned_loss=0.1028, over 21689.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3291, pruned_loss=0.08974, over 4279424.87 frames. ], batch size: 389, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 01:50:40,420 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-23 01:51:00,731 INFO [train.py:1028] (2/4) Epoch 8, validation: loss=0.2539, simple_loss=0.3505, pruned_loss=0.07863, over 1796401.00 frames. 2023-06-23 01:51:00,732 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-23 01:51:02,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1370772.0, ans=0.0 2023-06-23 01:51:30,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1370832.0, ans=0.125 2023-06-23 01:51:57,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1370952.0, ans=0.0 2023-06-23 01:52:18,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1370952.0, ans=0.1 2023-06-23 01:52:21,541 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.233e+02 4.412e+02 5.546e+02 7.158e+02 1.443e+03, threshold=1.109e+03, percent-clipped=4.0 2023-06-23 01:52:43,829 INFO [train.py:996] (2/4) Epoch 8, batch 15050, loss[loss=0.2478, simple_loss=0.3261, pruned_loss=0.08475, over 21435.00 frames. ], tot_loss[loss=0.2565, simple_loss=0.331, pruned_loss=0.09095, over 4281384.23 frames. ], batch size: 211, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 01:52:53,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1371072.0, ans=0.0 2023-06-23 01:53:13,631 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.13 vs. limit=12.0 2023-06-23 01:53:58,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1371252.0, ans=0.125 2023-06-23 01:54:29,748 INFO [train.py:996] (2/4) Epoch 8, batch 15100, loss[loss=0.2072, simple_loss=0.2824, pruned_loss=0.06605, over 21619.00 frames. ], tot_loss[loss=0.2566, simple_loss=0.3331, pruned_loss=0.09004, over 4283453.19 frames. ], batch size: 112, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 01:54:41,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1371372.0, ans=0.1 2023-06-23 01:55:03,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1371432.0, ans=0.125 2023-06-23 01:55:28,621 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-23 01:55:46,405 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-23 01:55:50,285 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.664e+02 5.163e+02 6.983e+02 1.038e+03 2.377e+03, threshold=1.397e+03, percent-clipped=16.0 2023-06-23 01:56:13,544 INFO [train.py:996] (2/4) Epoch 8, batch 15150, loss[loss=0.234, simple_loss=0.2888, pruned_loss=0.08964, over 21623.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3274, pruned_loss=0.08953, over 4283970.41 frames. ], batch size: 298, lr: 3.73e-03, grad_scale: 4.0 2023-06-23 01:56:24,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1371672.0, ans=0.0 2023-06-23 01:56:30,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1371732.0, ans=0.0 2023-06-23 01:57:48,558 INFO [train.py:996] (2/4) Epoch 8, batch 15200, loss[loss=0.2705, simple_loss=0.3405, pruned_loss=0.1003, over 21387.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3187, pruned_loss=0.08614, over 4287748.00 frames. ], batch size: 507, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 01:57:50,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1371972.0, ans=0.0 2023-06-23 01:58:07,721 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.11 vs. limit=8.0 2023-06-23 01:58:37,854 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 01:59:10,117 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.032e+02 4.454e+02 6.348e+02 1.099e+03 2.249e+03, threshold=1.270e+03, percent-clipped=13.0 2023-06-23 01:59:21,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1372212.0, ans=0.125 2023-06-23 01:59:29,113 INFO [train.py:996] (2/4) Epoch 8, batch 15250, loss[loss=0.235, simple_loss=0.3001, pruned_loss=0.08493, over 21612.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3134, pruned_loss=0.08415, over 4274763.56 frames. ], batch size: 441, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:00:03,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1372332.0, ans=0.2 2023-06-23 02:00:11,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1372392.0, ans=0.125 2023-06-23 02:00:13,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1372392.0, ans=0.0 2023-06-23 02:00:45,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1372452.0, ans=0.125 2023-06-23 02:00:45,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1372452.0, ans=0.2 2023-06-23 02:01:09,067 INFO [train.py:996] (2/4) Epoch 8, batch 15300, loss[loss=0.2473, simple_loss=0.3221, pruned_loss=0.08625, over 22019.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3158, pruned_loss=0.08664, over 4275599.57 frames. ], batch size: 317, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:01:19,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1372572.0, ans=0.1 2023-06-23 02:01:54,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1372692.0, ans=0.1 2023-06-23 02:02:09,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1372752.0, ans=0.0 2023-06-23 02:02:34,706 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.584e+02 4.847e+02 6.325e+02 8.051e+02 1.474e+03, threshold=1.265e+03, percent-clipped=2.0 2023-06-23 02:02:48,584 INFO [train.py:996] (2/4) Epoch 8, batch 15350, loss[loss=0.2334, simple_loss=0.3258, pruned_loss=0.0705, over 21870.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.322, pruned_loss=0.08832, over 4279916.21 frames. ], batch size: 316, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:02:50,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1372872.0, ans=0.2 2023-06-23 02:03:18,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1372932.0, ans=0.125 2023-06-23 02:03:28,072 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 02:03:34,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1372992.0, ans=0.0 2023-06-23 02:04:27,054 INFO [train.py:996] (2/4) Epoch 8, batch 15400, loss[loss=0.2375, simple_loss=0.3113, pruned_loss=0.08186, over 21893.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3241, pruned_loss=0.08756, over 4279821.71 frames. ], batch size: 371, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:04:48,744 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.43 vs. limit=12.0 2023-06-23 02:05:29,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1373352.0, ans=0.0 2023-06-23 02:05:41,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1373412.0, ans=0.1 2023-06-23 02:05:42,576 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.174e+02 4.574e+02 6.918e+02 9.277e+02 1.952e+03, threshold=1.384e+03, percent-clipped=9.0 2023-06-23 02:05:43,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1373412.0, ans=0.0 2023-06-23 02:06:06,506 INFO [train.py:996] (2/4) Epoch 8, batch 15450, loss[loss=0.2421, simple_loss=0.3101, pruned_loss=0.08708, over 21641.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3213, pruned_loss=0.08694, over 4267537.07 frames. ], batch size: 263, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:06:18,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1373472.0, ans=0.1 2023-06-23 02:06:19,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1373472.0, ans=0.2 2023-06-23 02:07:20,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1373652.0, ans=0.1 2023-06-23 02:07:43,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1373712.0, ans=0.0 2023-06-23 02:07:47,616 INFO [train.py:996] (2/4) Epoch 8, batch 15500, loss[loss=0.2612, simple_loss=0.3386, pruned_loss=0.09192, over 21893.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3227, pruned_loss=0.08628, over 4261498.33 frames. ], batch size: 371, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:08:22,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1373832.0, ans=0.2 2023-06-23 02:08:37,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1373892.0, ans=0.2 2023-06-23 02:08:57,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1373952.0, ans=0.2 2023-06-23 02:09:08,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1373952.0, ans=0.1 2023-06-23 02:09:14,709 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.796e+02 4.547e+02 5.557e+02 7.236e+02 1.680e+03, threshold=1.111e+03, percent-clipped=1.0 2023-06-23 02:09:15,823 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-23 02:09:19,267 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.23 vs. limit=15.0 2023-06-23 02:09:28,950 INFO [train.py:996] (2/4) Epoch 8, batch 15550, loss[loss=0.1997, simple_loss=0.263, pruned_loss=0.06823, over 21843.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.32, pruned_loss=0.08405, over 4271284.07 frames. ], batch size: 107, lr: 3.73e-03, grad_scale: 8.0 2023-06-23 02:10:12,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1374192.0, ans=0.125 2023-06-23 02:10:16,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1374192.0, ans=0.125 2023-06-23 02:10:49,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1374252.0, ans=0.125 2023-06-23 02:11:03,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1374312.0, ans=0.125 2023-06-23 02:11:07,681 INFO [train.py:996] (2/4) Epoch 8, batch 15600, loss[loss=0.2531, simple_loss=0.3193, pruned_loss=0.09349, over 21503.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3143, pruned_loss=0.08319, over 4270271.32 frames. ], batch size: 441, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 02:11:39,567 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=22.5 2023-06-23 02:11:45,939 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 02:11:52,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1374492.0, ans=0.0 2023-06-23 02:12:32,601 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.200e+02 4.456e+02 5.982e+02 8.275e+02 1.817e+03, threshold=1.196e+03, percent-clipped=9.0 2023-06-23 02:12:46,679 INFO [train.py:996] (2/4) Epoch 8, batch 15650, loss[loss=0.2257, simple_loss=0.2925, pruned_loss=0.07942, over 21866.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3122, pruned_loss=0.08267, over 4270389.18 frames. ], batch size: 98, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 02:13:50,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1374792.0, ans=0.0 2023-06-23 02:14:31,069 INFO [train.py:996] (2/4) Epoch 8, batch 15700, loss[loss=0.2264, simple_loss=0.2886, pruned_loss=0.08216, over 21211.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3087, pruned_loss=0.08157, over 4253001.74 frames. ], batch size: 159, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 02:15:40,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1375152.0, ans=0.2 2023-06-23 02:15:50,695 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.000e+02 4.439e+02 5.550e+02 6.958e+02 1.356e+03, threshold=1.110e+03, percent-clipped=1.0 2023-06-23 02:16:04,697 INFO [train.py:996] (2/4) Epoch 8, batch 15750, loss[loss=0.2068, simple_loss=0.2818, pruned_loss=0.06591, over 21670.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3038, pruned_loss=0.08097, over 4251668.93 frames. ], batch size: 298, lr: 3.73e-03, grad_scale: 16.0 2023-06-23 02:16:25,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1375332.0, ans=0.125 2023-06-23 02:17:19,223 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 02:17:49,025 INFO [train.py:996] (2/4) Epoch 8, batch 15800, loss[loss=0.256, simple_loss=0.3181, pruned_loss=0.09697, over 21292.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3009, pruned_loss=0.08152, over 4260342.06 frames. ], batch size: 548, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:18:04,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1375632.0, ans=0.1 2023-06-23 02:18:04,853 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=15.0 2023-06-23 02:18:47,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1375752.0, ans=0.125 2023-06-23 02:18:53,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1375752.0, ans=0.125 2023-06-23 02:19:04,854 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.995e+02 4.739e+02 6.417e+02 1.005e+03 2.218e+03, threshold=1.283e+03, percent-clipped=19.0 2023-06-23 02:19:12,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1375812.0, ans=0.1 2023-06-23 02:19:24,002 INFO [train.py:996] (2/4) Epoch 8, batch 15850, loss[loss=0.2172, simple_loss=0.2892, pruned_loss=0.07259, over 21688.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3038, pruned_loss=0.08375, over 4263382.96 frames. ], batch size: 247, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:20:37,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1376052.0, ans=0.125 2023-06-23 02:20:59,214 INFO [train.py:996] (2/4) Epoch 8, batch 15900, loss[loss=0.2054, simple_loss=0.2979, pruned_loss=0.05643, over 21530.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.303, pruned_loss=0.08378, over 4267348.90 frames. ], batch size: 230, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:21:17,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1376172.0, ans=0.0 2023-06-23 02:21:22,662 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.31 vs. limit=15.0 2023-06-23 02:21:38,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1376232.0, ans=0.2 2023-06-23 02:22:07,138 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.38 vs. limit=6.0 2023-06-23 02:22:24,065 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.514e+02 4.482e+02 6.671e+02 9.137e+02 1.402e+03, threshold=1.334e+03, percent-clipped=2.0 2023-06-23 02:22:38,518 INFO [train.py:996] (2/4) Epoch 8, batch 15950, loss[loss=0.2623, simple_loss=0.3382, pruned_loss=0.09326, over 21662.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3038, pruned_loss=0.0808, over 4258985.37 frames. ], batch size: 441, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:23:27,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=1376592.0, ans=0.02 2023-06-23 02:23:49,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1376652.0, ans=10.0 2023-06-23 02:24:09,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1376712.0, ans=0.125 2023-06-23 02:24:13,811 INFO [train.py:996] (2/4) Epoch 8, batch 16000, loss[loss=0.2346, simple_loss=0.333, pruned_loss=0.06809, over 21286.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3047, pruned_loss=0.07876, over 4256539.77 frames. ], batch size: 548, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:24:20,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1376772.0, ans=0.125 2023-06-23 02:24:20,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1376772.0, ans=0.0 2023-06-23 02:24:37,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1376832.0, ans=0.125 2023-06-23 02:24:57,428 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=12.0 2023-06-23 02:25:32,065 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.98 vs. limit=10.0 2023-06-23 02:25:36,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1377012.0, ans=0.0 2023-06-23 02:25:39,161 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.869e+02 4.296e+02 5.678e+02 9.703e+02 1.741e+03, threshold=1.136e+03, percent-clipped=11.0 2023-06-23 02:25:42,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1377012.0, ans=0.125 2023-06-23 02:25:53,860 INFO [train.py:996] (2/4) Epoch 8, batch 16050, loss[loss=0.2093, simple_loss=0.2986, pruned_loss=0.06001, over 21639.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3065, pruned_loss=0.0769, over 4267810.21 frames. ], batch size: 230, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:25:54,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1377072.0, ans=0.1 2023-06-23 02:27:13,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1377252.0, ans=0.1 2023-06-23 02:27:32,632 INFO [train.py:996] (2/4) Epoch 8, batch 16100, loss[loss=0.2773, simple_loss=0.3405, pruned_loss=0.107, over 21850.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.312, pruned_loss=0.0792, over 4269654.42 frames. ], batch size: 332, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:27:41,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1377372.0, ans=0.0 2023-06-23 02:28:09,632 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-23 02:28:49,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1377552.0, ans=0.125 2023-06-23 02:28:58,827 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.354e+02 5.010e+02 6.146e+02 8.242e+02 2.299e+03, threshold=1.229e+03, percent-clipped=9.0 2023-06-23 02:29:11,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1377672.0, ans=0.0 2023-06-23 02:29:12,555 INFO [train.py:996] (2/4) Epoch 8, batch 16150, loss[loss=0.2686, simple_loss=0.3416, pruned_loss=0.09775, over 21483.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3121, pruned_loss=0.08202, over 4278892.60 frames. ], batch size: 548, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:29:14,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1377672.0, ans=0.1 2023-06-23 02:29:36,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1377732.0, ans=0.125 2023-06-23 02:29:46,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1377732.0, ans=0.05 2023-06-23 02:29:49,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1377732.0, ans=0.125 2023-06-23 02:29:50,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1377732.0, ans=0.2 2023-06-23 02:30:00,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1377792.0, ans=0.125 2023-06-23 02:30:53,177 INFO [train.py:996] (2/4) Epoch 8, batch 16200, loss[loss=0.2627, simple_loss=0.3381, pruned_loss=0.09367, over 21497.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3158, pruned_loss=0.08298, over 4283226.71 frames. ], batch size: 194, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:31:04,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1377972.0, ans=0.0 2023-06-23 02:31:06,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1377972.0, ans=0.04949747468305833 2023-06-23 02:32:05,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1378152.0, ans=10.0 2023-06-23 02:32:05,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1378152.0, ans=10.0 2023-06-23 02:32:14,990 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.293e+02 4.981e+02 6.694e+02 1.065e+03 1.723e+03, threshold=1.339e+03, percent-clipped=15.0 2023-06-23 02:32:27,731 INFO [train.py:996] (2/4) Epoch 8, batch 16250, loss[loss=0.214, simple_loss=0.2848, pruned_loss=0.07159, over 21316.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3154, pruned_loss=0.08272, over 4284340.55 frames. ], batch size: 176, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:32:52,928 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 02:32:55,094 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.67 vs. limit=15.0 2023-06-23 02:33:01,306 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=12.0 2023-06-23 02:33:18,586 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-23 02:33:35,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1378452.0, ans=0.125 2023-06-23 02:34:06,535 INFO [train.py:996] (2/4) Epoch 8, batch 16300, loss[loss=0.2479, simple_loss=0.3255, pruned_loss=0.08511, over 21353.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3105, pruned_loss=0.07884, over 4276808.87 frames. ], batch size: 471, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:34:50,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1378632.0, ans=0.125 2023-06-23 02:35:14,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1378752.0, ans=0.0 2023-06-23 02:35:22,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1378752.0, ans=0.125 2023-06-23 02:35:35,285 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.087e+02 4.179e+02 5.648e+02 7.276e+02 1.488e+03, threshold=1.130e+03, percent-clipped=3.0 2023-06-23 02:35:39,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1378812.0, ans=0.0 2023-06-23 02:35:53,652 INFO [train.py:996] (2/4) Epoch 8, batch 16350, loss[loss=0.3298, simple_loss=0.3913, pruned_loss=0.1341, over 21795.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3108, pruned_loss=0.07921, over 4273392.05 frames. ], batch size: 124, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:36:02,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1378872.0, ans=0.95 2023-06-23 02:36:17,658 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-06-23 02:36:44,633 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-23 02:37:20,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1379112.0, ans=0.1 2023-06-23 02:37:33,151 INFO [train.py:996] (2/4) Epoch 8, batch 16400, loss[loss=0.2259, simple_loss=0.2983, pruned_loss=0.07673, over 21848.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3155, pruned_loss=0.0816, over 4277190.20 frames. ], batch size: 298, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:38:00,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1379232.0, ans=0.0 2023-06-23 02:38:35,456 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=15.0 2023-06-23 02:38:38,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1379352.0, ans=0.2 2023-06-23 02:38:55,317 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.176e+02 5.676e+02 8.447e+02 1.118e+03 2.154e+03, threshold=1.689e+03, percent-clipped=24.0 2023-06-23 02:38:59,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1379412.0, ans=0.1 2023-06-23 02:39:10,976 INFO [train.py:996] (2/4) Epoch 8, batch 16450, loss[loss=0.2608, simple_loss=0.3303, pruned_loss=0.09565, over 21876.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.316, pruned_loss=0.0833, over 4288482.83 frames. ], batch size: 351, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:39:11,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1379472.0, ans=0.0 2023-06-23 02:39:16,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1379472.0, ans=0.1 2023-06-23 02:39:18,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1379472.0, ans=0.125 2023-06-23 02:39:30,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1379532.0, ans=0.07 2023-06-23 02:40:02,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1379592.0, ans=0.1 2023-06-23 02:40:03,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1379592.0, ans=0.125 2023-06-23 02:40:08,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1379652.0, ans=0.125 2023-06-23 02:40:50,696 INFO [train.py:996] (2/4) Epoch 8, batch 16500, loss[loss=0.285, simple_loss=0.3565, pruned_loss=0.1067, over 21725.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3148, pruned_loss=0.08333, over 4286206.17 frames. ], batch size: 441, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:41:15,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1379832.0, ans=10.0 2023-06-23 02:41:23,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1379832.0, ans=0.125 2023-06-23 02:41:28,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1379832.0, ans=0.125 2023-06-23 02:42:06,914 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-23 02:42:20,404 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.553e+02 5.237e+02 7.941e+02 1.285e+03 2.739e+03, threshold=1.588e+03, percent-clipped=14.0 2023-06-23 02:42:36,154 INFO [train.py:996] (2/4) Epoch 8, batch 16550, loss[loss=0.2381, simple_loss=0.3359, pruned_loss=0.07018, over 21271.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3144, pruned_loss=0.08123, over 4277621.17 frames. ], batch size: 548, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:42:57,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1380132.0, ans=0.125 2023-06-23 02:43:12,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1380192.0, ans=0.125 2023-06-23 02:43:13,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1380192.0, ans=0.0 2023-06-23 02:44:08,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1380312.0, ans=0.2 2023-06-23 02:44:21,919 INFO [train.py:996] (2/4) Epoch 8, batch 16600, loss[loss=0.2746, simple_loss=0.3574, pruned_loss=0.09594, over 21255.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3196, pruned_loss=0.08278, over 4280704.61 frames. ], batch size: 159, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:44:45,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1380432.0, ans=0.1 2023-06-23 02:45:49,585 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 02:45:52,271 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.266e+02 4.965e+02 7.305e+02 1.134e+03 2.257e+03, threshold=1.461e+03, percent-clipped=8.0 2023-06-23 02:46:03,720 INFO [train.py:996] (2/4) Epoch 8, batch 16650, loss[loss=0.2845, simple_loss=0.3525, pruned_loss=0.1082, over 21314.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3289, pruned_loss=0.08584, over 4279983.61 frames. ], batch size: 176, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:46:24,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1380732.0, ans=0.2 2023-06-23 02:46:43,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1380732.0, ans=6.0 2023-06-23 02:46:52,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1380792.0, ans=0.1 2023-06-23 02:47:24,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1380852.0, ans=0.025 2023-06-23 02:47:45,727 INFO [train.py:996] (2/4) Epoch 8, batch 16700, loss[loss=0.193, simple_loss=0.2561, pruned_loss=0.06498, over 21404.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.331, pruned_loss=0.08759, over 4284853.11 frames. ], batch size: 194, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:47:56,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1380972.0, ans=0.125 2023-06-23 02:48:39,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1381092.0, ans=0.125 2023-06-23 02:48:44,865 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=22.5 2023-06-23 02:49:09,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1381212.0, ans=0.125 2023-06-23 02:49:21,410 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.381e+02 4.736e+02 6.147e+02 8.645e+02 1.656e+03, threshold=1.229e+03, percent-clipped=2.0 2023-06-23 02:49:28,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1381212.0, ans=0.1 2023-06-23 02:49:38,533 INFO [train.py:996] (2/4) Epoch 8, batch 16750, loss[loss=0.2125, simple_loss=0.33, pruned_loss=0.04746, over 20812.00 frames. ], tot_loss[loss=0.256, simple_loss=0.3331, pruned_loss=0.0894, over 4276487.56 frames. ], batch size: 608, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:49:57,194 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-23 02:50:04,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1381332.0, ans=0.2 2023-06-23 02:50:31,594 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.18 vs. limit=12.0 2023-06-23 02:51:25,391 INFO [train.py:996] (2/4) Epoch 8, batch 16800, loss[loss=0.2455, simple_loss=0.3096, pruned_loss=0.09068, over 21673.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3353, pruned_loss=0.08869, over 4272092.95 frames. ], batch size: 263, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:51:25,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1381572.0, ans=0.125 2023-06-23 02:51:32,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1381572.0, ans=0.125 2023-06-23 02:51:40,213 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.46 vs. limit=15.0 2023-06-23 02:51:44,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1381632.0, ans=0.0 2023-06-23 02:51:46,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1381632.0, ans=0.125 2023-06-23 02:51:56,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1381632.0, ans=0.125 2023-06-23 02:52:01,408 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.83 vs. limit=15.0 2023-06-23 02:52:12,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1381692.0, ans=0.2 2023-06-23 02:52:24,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1381752.0, ans=0.125 2023-06-23 02:52:29,749 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 02:52:47,891 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.159e+02 4.585e+02 6.307e+02 8.952e+02 1.873e+03, threshold=1.261e+03, percent-clipped=4.0 2023-06-23 02:53:03,692 INFO [train.py:996] (2/4) Epoch 8, batch 16850, loss[loss=0.2252, simple_loss=0.3143, pruned_loss=0.06804, over 18304.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3308, pruned_loss=0.08908, over 4274812.36 frames. ], batch size: 63, lr: 3.72e-03, grad_scale: 32.0 2023-06-23 02:53:13,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1381872.0, ans=0.125 2023-06-23 02:53:39,254 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.51 vs. limit=15.0 2023-06-23 02:53:39,590 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.87 vs. limit=15.0 2023-06-23 02:53:40,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1381992.0, ans=0.5 2023-06-23 02:53:51,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1381992.0, ans=0.2 2023-06-23 02:54:11,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1382052.0, ans=0.0 2023-06-23 02:54:46,752 INFO [train.py:996] (2/4) Epoch 8, batch 16900, loss[loss=0.2275, simple_loss=0.2934, pruned_loss=0.08082, over 21279.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3248, pruned_loss=0.08687, over 4284409.42 frames. ], batch size: 144, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:55:08,245 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.86 vs. limit=15.0 2023-06-23 02:55:23,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1382292.0, ans=0.04949747468305833 2023-06-23 02:56:07,853 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.361e+02 4.825e+02 6.497e+02 9.276e+02 2.744e+03, threshold=1.299e+03, percent-clipped=9.0 2023-06-23 02:56:20,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1382472.0, ans=0.125 2023-06-23 02:56:26,516 INFO [train.py:996] (2/4) Epoch 8, batch 16950, loss[loss=0.2486, simple_loss=0.337, pruned_loss=0.08008, over 19958.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.318, pruned_loss=0.08505, over 4278988.56 frames. ], batch size: 703, lr: 3.72e-03, grad_scale: 16.0 2023-06-23 02:56:37,967 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 02:56:48,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1382532.0, ans=0.0 2023-06-23 02:57:27,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1382652.0, ans=0.125 2023-06-23 02:58:05,776 INFO [train.py:996] (2/4) Epoch 8, batch 17000, loss[loss=0.2181, simple_loss=0.3123, pruned_loss=0.06198, over 21065.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3173, pruned_loss=0.08513, over 4278496.00 frames. ], batch size: 607, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 02:58:12,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1382772.0, ans=0.125 2023-06-23 02:59:36,304 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.719e+02 6.116e+02 8.558e+02 1.129e+03 2.527e+03, threshold=1.712e+03, percent-clipped=16.0 2023-06-23 02:59:38,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1383012.0, ans=0.2 2023-06-23 02:59:46,391 INFO [train.py:996] (2/4) Epoch 8, batch 17050, loss[loss=0.2624, simple_loss=0.3451, pruned_loss=0.08985, over 21779.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3229, pruned_loss=0.08707, over 4278786.85 frames. ], batch size: 414, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 02:59:59,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1383072.0, ans=0.0 2023-06-23 03:00:35,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1383192.0, ans=0.125 2023-06-23 03:00:37,154 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-23 03:01:20,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1383312.0, ans=0.0 2023-06-23 03:01:27,110 INFO [train.py:996] (2/4) Epoch 8, batch 17100, loss[loss=0.2111, simple_loss=0.276, pruned_loss=0.0731, over 21497.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3236, pruned_loss=0.08889, over 4283326.50 frames. ], batch size: 212, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:01:27,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1383372.0, ans=0.0 2023-06-23 03:01:57,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1383492.0, ans=0.2 2023-06-23 03:02:16,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1383492.0, ans=0.1 2023-06-23 03:02:52,338 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.295e+02 4.559e+02 5.897e+02 8.703e+02 1.483e+03, threshold=1.179e+03, percent-clipped=0.0 2023-06-23 03:03:01,682 INFO [train.py:996] (2/4) Epoch 8, batch 17150, loss[loss=0.2004, simple_loss=0.2697, pruned_loss=0.06551, over 21348.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3195, pruned_loss=0.08829, over 4284476.04 frames. ], batch size: 159, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:03:41,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1383792.0, ans=0.05 2023-06-23 03:03:55,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1383792.0, ans=0.1 2023-06-23 03:03:59,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1383852.0, ans=0.125 2023-06-23 03:04:42,739 INFO [train.py:996] (2/4) Epoch 8, batch 17200, loss[loss=0.2873, simple_loss=0.3526, pruned_loss=0.111, over 21543.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3199, pruned_loss=0.0882, over 4281097.53 frames. ], batch size: 389, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:04:48,101 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 03:05:25,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1384092.0, ans=0.125 2023-06-23 03:05:52,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1384152.0, ans=0.125 2023-06-23 03:06:13,107 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.394e+02 4.493e+02 5.793e+02 8.418e+02 1.650e+03, threshold=1.159e+03, percent-clipped=7.0 2023-06-23 03:06:16,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=1384212.0, ans=0.1 2023-06-23 03:06:23,177 INFO [train.py:996] (2/4) Epoch 8, batch 17250, loss[loss=0.2686, simple_loss=0.353, pruned_loss=0.09211, over 21724.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3225, pruned_loss=0.0895, over 4284367.17 frames. ], batch size: 124, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:07:00,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1384332.0, ans=0.0 2023-06-23 03:07:05,909 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.00 vs. limit=15.0 2023-06-23 03:07:24,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1384392.0, ans=0.2 2023-06-23 03:07:35,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1384452.0, ans=0.0 2023-06-23 03:07:39,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1384452.0, ans=0.125 2023-06-23 03:07:41,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1384452.0, ans=0.0 2023-06-23 03:08:09,826 INFO [train.py:996] (2/4) Epoch 8, batch 17300, loss[loss=0.2152, simple_loss=0.2567, pruned_loss=0.0868, over 20086.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.3288, pruned_loss=0.09232, over 4281800.97 frames. ], batch size: 705, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:09:41,176 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.372e+02 4.842e+02 6.354e+02 8.974e+02 2.324e+03, threshold=1.271e+03, percent-clipped=7.0 2023-06-23 03:09:56,680 INFO [train.py:996] (2/4) Epoch 8, batch 17350, loss[loss=0.2133, simple_loss=0.2937, pruned_loss=0.06643, over 21809.00 frames. ], tot_loss[loss=0.2556, simple_loss=0.3285, pruned_loss=0.09136, over 4284719.68 frames. ], batch size: 282, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:10:12,615 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.79 vs. limit=12.0 2023-06-23 03:10:14,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1384872.0, ans=0.0 2023-06-23 03:10:20,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1384932.0, ans=0.125 2023-06-23 03:10:20,772 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=15.0 2023-06-23 03:10:33,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1384992.0, ans=0.1 2023-06-23 03:10:41,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1384992.0, ans=0.125 2023-06-23 03:11:11,661 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-06-23 03:11:33,388 INFO [train.py:996] (2/4) Epoch 8, batch 17400, loss[loss=0.2389, simple_loss=0.3174, pruned_loss=0.08023, over 20665.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3262, pruned_loss=0.08827, over 4286271.96 frames. ], batch size: 607, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:12:47,313 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-06-23 03:12:56,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1385412.0, ans=0.125 2023-06-23 03:13:07,296 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.204e+02 4.616e+02 6.487e+02 8.880e+02 2.609e+03, threshold=1.297e+03, percent-clipped=10.0 2023-06-23 03:13:08,316 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-06-23 03:13:19,785 INFO [train.py:996] (2/4) Epoch 8, batch 17450, loss[loss=0.199, simple_loss=0.2839, pruned_loss=0.05708, over 21767.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3246, pruned_loss=0.08608, over 4282808.01 frames. ], batch size: 282, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:13:27,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1385472.0, ans=0.0 2023-06-23 03:14:09,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1385592.0, ans=0.0 2023-06-23 03:14:25,837 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.90 vs. limit=15.0 2023-06-23 03:14:57,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1385712.0, ans=0.0 2023-06-23 03:14:59,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1385772.0, ans=0.125 2023-06-23 03:15:00,649 INFO [train.py:996] (2/4) Epoch 8, batch 17500, loss[loss=0.2264, simple_loss=0.2988, pruned_loss=0.07697, over 21874.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3203, pruned_loss=0.08367, over 4280919.67 frames. ], batch size: 351, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:15:23,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1385832.0, ans=0.05 2023-06-23 03:16:31,994 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.152e+02 4.278e+02 5.524e+02 8.928e+02 1.678e+03, threshold=1.105e+03, percent-clipped=3.0 2023-06-23 03:16:40,126 INFO [train.py:996] (2/4) Epoch 8, batch 17550, loss[loss=0.2157, simple_loss=0.3097, pruned_loss=0.06081, over 21208.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3199, pruned_loss=0.08227, over 4270625.99 frames. ], batch size: 143, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:17:19,598 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 03:18:18,846 INFO [train.py:996] (2/4) Epoch 8, batch 17600, loss[loss=0.2389, simple_loss=0.3243, pruned_loss=0.07677, over 21826.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3221, pruned_loss=0.08272, over 4270672.84 frames. ], batch size: 118, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:19:42,635 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.71 vs. limit=15.0 2023-06-23 03:19:48,288 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.108e+02 4.735e+02 6.263e+02 8.398e+02 1.704e+03, threshold=1.253e+03, percent-clipped=10.0 2023-06-23 03:19:55,688 INFO [train.py:996] (2/4) Epoch 8, batch 17650, loss[loss=0.1946, simple_loss=0.2607, pruned_loss=0.06426, over 21325.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3212, pruned_loss=0.0839, over 4274330.40 frames. ], batch size: 211, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:20:07,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1386672.0, ans=0.125 2023-06-23 03:20:29,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1386732.0, ans=0.09899494936611666 2023-06-23 03:20:38,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1386792.0, ans=0.0 2023-06-23 03:21:03,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1386852.0, ans=0.04949747468305833 2023-06-23 03:21:16,355 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 03:21:27,956 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.59 vs. limit=6.0 2023-06-23 03:21:36,246 INFO [train.py:996] (2/4) Epoch 8, batch 17700, loss[loss=0.1806, simple_loss=0.2382, pruned_loss=0.06146, over 21509.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3162, pruned_loss=0.08122, over 4262575.36 frames. ], batch size: 212, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:21:40,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1386972.0, ans=0.125 2023-06-23 03:22:17,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1387092.0, ans=0.125 2023-06-23 03:22:29,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1387092.0, ans=0.125 2023-06-23 03:22:50,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1387152.0, ans=0.1 2023-06-23 03:22:53,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1387152.0, ans=0.05 2023-06-23 03:23:11,340 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.033e+02 4.293e+02 5.499e+02 1.006e+03 2.228e+03, threshold=1.100e+03, percent-clipped=12.0 2023-06-23 03:23:17,905 INFO [train.py:996] (2/4) Epoch 8, batch 17750, loss[loss=0.2608, simple_loss=0.3358, pruned_loss=0.09296, over 21988.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3234, pruned_loss=0.08479, over 4265220.06 frames. ], batch size: 317, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:23:37,061 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=22.5 2023-06-23 03:24:00,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1387392.0, ans=0.125 2023-06-23 03:24:18,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1387392.0, ans=0.125 2023-06-23 03:24:40,021 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=22.5 2023-06-23 03:24:46,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1387512.0, ans=0.0 2023-06-23 03:24:58,717 INFO [train.py:996] (2/4) Epoch 8, batch 17800, loss[loss=0.2233, simple_loss=0.292, pruned_loss=0.0773, over 21438.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3234, pruned_loss=0.08484, over 4259125.37 frames. ], batch size: 194, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:25:04,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1387572.0, ans=0.125 2023-06-23 03:25:06,590 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.18 vs. limit=15.0 2023-06-23 03:25:14,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1387632.0, ans=0.2 2023-06-23 03:25:19,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1387632.0, ans=0.125 2023-06-23 03:25:58,566 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.08 vs. limit=22.5 2023-06-23 03:26:03,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1387752.0, ans=0.125 2023-06-23 03:26:31,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1387812.0, ans=0.0 2023-06-23 03:26:32,495 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.730e+02 4.467e+02 6.036e+02 8.319e+02 2.220e+03, threshold=1.207e+03, percent-clipped=14.0 2023-06-23 03:26:39,342 INFO [train.py:996] (2/4) Epoch 8, batch 17850, loss[loss=0.273, simple_loss=0.3394, pruned_loss=0.1034, over 21988.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3235, pruned_loss=0.08556, over 4265656.86 frames. ], batch size: 317, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:26:44,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1387872.0, ans=0.0 2023-06-23 03:27:09,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1387932.0, ans=0.07 2023-06-23 03:27:29,161 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-06-23 03:27:52,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1388112.0, ans=0.125 2023-06-23 03:28:12,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1388112.0, ans=0.0 2023-06-23 03:28:16,060 INFO [train.py:996] (2/4) Epoch 8, batch 17900, loss[loss=0.2799, simple_loss=0.3448, pruned_loss=0.1075, over 21339.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3287, pruned_loss=0.08715, over 4266472.19 frames. ], batch size: 549, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:28:34,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1388172.0, ans=0.1 2023-06-23 03:28:47,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1388232.0, ans=0.1 2023-06-23 03:28:47,819 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=12.0 2023-06-23 03:28:49,405 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.13 vs. limit=12.0 2023-06-23 03:29:54,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1388412.0, ans=0.2 2023-06-23 03:30:00,905 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.268e+02 4.650e+02 5.974e+02 7.368e+02 1.876e+03, threshold=1.195e+03, percent-clipped=6.0 2023-06-23 03:30:11,643 INFO [train.py:996] (2/4) Epoch 8, batch 17950, loss[loss=0.2152, simple_loss=0.3008, pruned_loss=0.0648, over 21647.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3295, pruned_loss=0.0841, over 4267860.48 frames. ], batch size: 247, lr: 3.71e-03, grad_scale: 16.0 2023-06-23 03:30:43,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1388592.0, ans=0.125 2023-06-23 03:31:03,614 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.11 vs. limit=10.0 2023-06-23 03:31:08,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1388652.0, ans=0.0 2023-06-23 03:31:09,595 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-23 03:31:18,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1388712.0, ans=0.035 2023-06-23 03:31:49,993 INFO [train.py:996] (2/4) Epoch 8, batch 18000, loss[loss=0.2214, simple_loss=0.2929, pruned_loss=0.07499, over 21616.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.323, pruned_loss=0.08199, over 4263046.67 frames. ], batch size: 332, lr: 3.71e-03, grad_scale: 32.0 2023-06-23 03:31:49,993 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-23 03:32:06,874 INFO [train.py:1028] (2/4) Epoch 8, validation: loss=0.2644, simple_loss=0.3593, pruned_loss=0.08473, over 1796401.00 frames. 2023-06-23 03:32:06,875 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-23 03:32:23,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1388832.0, ans=0.125 2023-06-23 03:32:31,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1388832.0, ans=0.125 2023-06-23 03:32:36,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1388832.0, ans=0.125 2023-06-23 03:33:37,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1389012.0, ans=0.125 2023-06-23 03:33:43,872 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.172e+02 4.301e+02 6.081e+02 8.972e+02 1.795e+03, threshold=1.216e+03, percent-clipped=14.0 2023-06-23 03:33:46,945 INFO [train.py:996] (2/4) Epoch 8, batch 18050, loss[loss=0.252, simple_loss=0.3159, pruned_loss=0.09407, over 21892.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3178, pruned_loss=0.08116, over 4263101.71 frames. ], batch size: 317, lr: 3.71e-03, grad_scale: 8.0 2023-06-23 03:34:25,406 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.49 vs. limit=6.0 2023-06-23 03:34:43,502 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-23 03:35:26,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1389372.0, ans=0.05 2023-06-23 03:35:28,042 INFO [train.py:996] (2/4) Epoch 8, batch 18100, loss[loss=0.2256, simple_loss=0.2959, pruned_loss=0.07767, over 21196.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3184, pruned_loss=0.08292, over 4266349.69 frames. ], batch size: 176, lr: 3.71e-03, grad_scale: 8.0 2023-06-23 03:35:30,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1389372.0, ans=0.125 2023-06-23 03:35:34,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1389372.0, ans=0.1 2023-06-23 03:36:39,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1389552.0, ans=0.125 2023-06-23 03:36:47,467 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.63 vs. limit=12.0 2023-06-23 03:37:04,990 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.338e+02 4.664e+02 6.579e+02 9.782e+02 2.052e+03, threshold=1.316e+03, percent-clipped=11.0 2023-06-23 03:37:06,647 INFO [train.py:996] (2/4) Epoch 8, batch 18150, loss[loss=0.253, simple_loss=0.3283, pruned_loss=0.08882, over 21903.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3189, pruned_loss=0.08259, over 4253505.50 frames. ], batch size: 373, lr: 3.71e-03, grad_scale: 8.0 2023-06-23 03:37:15,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1389672.0, ans=0.125 2023-06-23 03:37:38,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1389732.0, ans=10.0 2023-06-23 03:38:43,070 INFO [train.py:996] (2/4) Epoch 8, batch 18200, loss[loss=0.25, simple_loss=0.3041, pruned_loss=0.09799, over 21718.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.314, pruned_loss=0.08272, over 4264910.39 frames. ], batch size: 112, lr: 3.71e-03, grad_scale: 8.0 2023-06-23 03:39:18,919 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.02 vs. limit=22.5 2023-06-23 03:39:27,460 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=15.0 2023-06-23 03:39:36,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1390092.0, ans=0.0 2023-06-23 03:39:36,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1390092.0, ans=0.2 2023-06-23 03:40:05,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1390212.0, ans=0.0 2023-06-23 03:40:17,371 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.155e+02 4.842e+02 6.713e+02 9.646e+02 2.158e+03, threshold=1.343e+03, percent-clipped=10.0 2023-06-23 03:40:19,039 INFO [train.py:996] (2/4) Epoch 8, batch 18250, loss[loss=0.2159, simple_loss=0.2835, pruned_loss=0.07411, over 21819.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3066, pruned_loss=0.08054, over 4274273.05 frames. ], batch size: 298, lr: 3.70e-03, grad_scale: 8.0 2023-06-23 03:40:48,821 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 03:41:03,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1390392.0, ans=0.1 2023-06-23 03:41:56,235 INFO [train.py:996] (2/4) Epoch 8, batch 18300, loss[loss=0.2176, simple_loss=0.2797, pruned_loss=0.07775, over 21872.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3055, pruned_loss=0.07978, over 4263706.35 frames. ], batch size: 98, lr: 3.70e-03, grad_scale: 8.0 2023-06-23 03:41:56,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1390572.0, ans=0.125 2023-06-23 03:42:20,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1390632.0, ans=0.125 2023-06-23 03:42:50,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1390752.0, ans=0.2 2023-06-23 03:43:32,354 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.212e+02 4.888e+02 7.173e+02 1.170e+03 2.600e+03, threshold=1.435e+03, percent-clipped=18.0 2023-06-23 03:43:34,032 INFO [train.py:996] (2/4) Epoch 8, batch 18350, loss[loss=0.2222, simple_loss=0.322, pruned_loss=0.06118, over 21398.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3124, pruned_loss=0.07977, over 4260628.63 frames. ], batch size: 211, lr: 3.70e-03, grad_scale: 8.0 2023-06-23 03:43:43,141 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-23 03:43:47,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1390872.0, ans=0.0 2023-06-23 03:44:23,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1390992.0, ans=0.125 2023-06-23 03:44:31,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1391052.0, ans=0.1 2023-06-23 03:45:12,224 INFO [train.py:996] (2/4) Epoch 8, batch 18400, loss[loss=0.2021, simple_loss=0.2784, pruned_loss=0.06285, over 21556.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3098, pruned_loss=0.07864, over 4259753.63 frames. ], batch size: 195, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:45:25,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1391172.0, ans=0.125 2023-06-23 03:45:50,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1391292.0, ans=0.1 2023-06-23 03:46:37,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1391412.0, ans=0.0 2023-06-23 03:46:46,176 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.039e+02 4.306e+02 6.034e+02 8.770e+02 2.014e+03, threshold=1.207e+03, percent-clipped=5.0 2023-06-23 03:46:48,007 INFO [train.py:996] (2/4) Epoch 8, batch 18450, loss[loss=0.2034, simple_loss=0.2841, pruned_loss=0.06132, over 21682.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3057, pruned_loss=0.07531, over 4266429.55 frames. ], batch size: 247, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:46:50,000 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 03:46:51,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1391472.0, ans=0.125 2023-06-23 03:46:59,671 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.34 vs. limit=22.5 2023-06-23 03:47:19,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1391592.0, ans=0.125 2023-06-23 03:47:26,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1391592.0, ans=0.2 2023-06-23 03:47:43,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1391652.0, ans=0.0 2023-06-23 03:48:00,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1391652.0, ans=0.0 2023-06-23 03:48:25,035 INFO [train.py:996] (2/4) Epoch 8, batch 18500, loss[loss=0.1825, simple_loss=0.2526, pruned_loss=0.05625, over 17990.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2997, pruned_loss=0.07393, over 4264065.20 frames. ], batch size: 68, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:48:34,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1391772.0, ans=0.1 2023-06-23 03:49:02,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1391892.0, ans=0.125 2023-06-23 03:49:31,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1391952.0, ans=0.0 2023-06-23 03:50:02,673 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.977e+02 4.135e+02 5.661e+02 7.712e+02 1.457e+03, threshold=1.132e+03, percent-clipped=3.0 2023-06-23 03:50:04,105 INFO [train.py:996] (2/4) Epoch 8, batch 18550, loss[loss=0.2243, simple_loss=0.2875, pruned_loss=0.0806, over 21532.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2969, pruned_loss=0.07316, over 4260366.84 frames. ], batch size: 414, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:50:12,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1392072.0, ans=0.0 2023-06-23 03:51:43,198 INFO [train.py:996] (2/4) Epoch 8, batch 18600, loss[loss=0.2012, simple_loss=0.2647, pruned_loss=0.06881, over 20104.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2963, pruned_loss=0.07384, over 4265458.62 frames. ], batch size: 703, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:51:46,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1392372.0, ans=0.1 2023-06-23 03:52:00,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1392432.0, ans=0.2 2023-06-23 03:52:03,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1392432.0, ans=0.1 2023-06-23 03:52:07,295 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-23 03:52:08,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1392432.0, ans=0.04949747468305833 2023-06-23 03:52:52,557 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.01 vs. limit=15.0 2023-06-23 03:53:17,922 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.135e+02 5.174e+02 7.950e+02 1.061e+03 1.906e+03, threshold=1.590e+03, percent-clipped=19.0 2023-06-23 03:53:19,647 INFO [train.py:996] (2/4) Epoch 8, batch 18650, loss[loss=0.2076, simple_loss=0.2805, pruned_loss=0.06739, over 21814.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.295, pruned_loss=0.07412, over 4258061.26 frames. ], batch size: 352, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:54:31,226 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.66 vs. limit=15.0 2023-06-23 03:54:34,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1392912.0, ans=0.125 2023-06-23 03:54:56,234 INFO [train.py:996] (2/4) Epoch 8, batch 18700, loss[loss=0.2083, simple_loss=0.2807, pruned_loss=0.06795, over 21739.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2925, pruned_loss=0.07525, over 4265727.20 frames. ], batch size: 112, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:55:10,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1393032.0, ans=0.2 2023-06-23 03:55:27,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1393092.0, ans=0.0 2023-06-23 03:55:58,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1393152.0, ans=0.125 2023-06-23 03:56:31,681 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.84 vs. limit=6.0 2023-06-23 03:56:32,193 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.347e+02 4.104e+02 5.076e+02 6.613e+02 1.727e+03, threshold=1.015e+03, percent-clipped=1.0 2023-06-23 03:56:33,802 INFO [train.py:996] (2/4) Epoch 8, batch 18750, loss[loss=0.2418, simple_loss=0.3157, pruned_loss=0.08395, over 21634.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2965, pruned_loss=0.07803, over 4258924.03 frames. ], batch size: 230, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 03:56:47,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1393332.0, ans=0.125 2023-06-23 03:57:07,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1393392.0, ans=0.125 2023-06-23 03:58:04,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1393512.0, ans=0.1 2023-06-23 03:58:11,913 INFO [train.py:996] (2/4) Epoch 8, batch 18800, loss[loss=0.1835, simple_loss=0.2404, pruned_loss=0.06331, over 16246.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3017, pruned_loss=0.07917, over 4244500.70 frames. ], batch size: 60, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 03:58:19,428 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=10.53 vs. limit=10.0 2023-06-23 03:59:34,763 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.93 vs. limit=6.0 2023-06-23 03:59:48,198 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.636e+02 4.595e+02 6.296e+02 8.883e+02 2.093e+03, threshold=1.259e+03, percent-clipped=21.0 2023-06-23 03:59:50,061 INFO [train.py:996] (2/4) Epoch 8, batch 18850, loss[loss=0.2399, simple_loss=0.3073, pruned_loss=0.08625, over 21498.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2963, pruned_loss=0.07462, over 4240145.48 frames. ], batch size: 442, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:00:06,719 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=22.5 2023-06-23 04:01:21,380 INFO [train.py:996] (2/4) Epoch 8, batch 18900, loss[loss=0.2173, simple_loss=0.2792, pruned_loss=0.07775, over 21845.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2942, pruned_loss=0.07587, over 4254841.03 frames. ], batch size: 373, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:01:29,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1394172.0, ans=0.2 2023-06-23 04:01:48,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1394232.0, ans=0.0 2023-06-23 04:01:53,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1394232.0, ans=0.125 2023-06-23 04:02:58,737 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.232e+02 4.494e+02 5.345e+02 6.718e+02 1.434e+03, threshold=1.069e+03, percent-clipped=2.0 2023-06-23 04:03:00,404 INFO [train.py:996] (2/4) Epoch 8, batch 18950, loss[loss=0.2547, simple_loss=0.3468, pruned_loss=0.0813, over 21733.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2947, pruned_loss=0.07749, over 4264630.84 frames. ], batch size: 298, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:03:07,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1394472.0, ans=0.035 2023-06-23 04:03:16,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1394532.0, ans=0.1 2023-06-23 04:03:16,954 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=15.0 2023-06-23 04:03:31,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1394592.0, ans=0.0 2023-06-23 04:03:47,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1394592.0, ans=0.125 2023-06-23 04:03:54,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1394592.0, ans=0.125 2023-06-23 04:04:02,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.whiten.whitening_limit, batch_count=1394652.0, ans=12.0 2023-06-23 04:04:31,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1394712.0, ans=0.125 2023-06-23 04:04:39,989 INFO [train.py:996] (2/4) Epoch 8, batch 19000, loss[loss=0.2811, simple_loss=0.3515, pruned_loss=0.1054, over 21750.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3053, pruned_loss=0.07935, over 4270801.55 frames. ], batch size: 332, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:04:45,820 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-06-23 04:05:04,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1394832.0, ans=0.0 2023-06-23 04:05:07,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1394832.0, ans=0.0 2023-06-23 04:05:13,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1394832.0, ans=0.125 2023-06-23 04:06:11,613 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.757e+02 5.109e+02 8.049e+02 1.091e+03 2.389e+03, threshold=1.610e+03, percent-clipped=25.0 2023-06-23 04:06:13,345 INFO [train.py:996] (2/4) Epoch 8, batch 19050, loss[loss=0.2471, simple_loss=0.3007, pruned_loss=0.09676, over 21348.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3104, pruned_loss=0.08295, over 4275228.74 frames. ], batch size: 176, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:06:18,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1395072.0, ans=0.125 2023-06-23 04:06:42,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1395132.0, ans=0.125 2023-06-23 04:06:45,763 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=22.5 2023-06-23 04:07:53,275 INFO [train.py:996] (2/4) Epoch 8, batch 19100, loss[loss=0.2274, simple_loss=0.2897, pruned_loss=0.08256, over 21780.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3096, pruned_loss=0.08387, over 4272327.09 frames. ], batch size: 371, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:08:18,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1395432.0, ans=0.125 2023-06-23 04:08:23,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1395432.0, ans=0.0 2023-06-23 04:08:43,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1395492.0, ans=0.1 2023-06-23 04:09:21,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1395612.0, ans=0.1 2023-06-23 04:09:25,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1395612.0, ans=0.125 2023-06-23 04:09:33,310 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.382e+02 4.590e+02 5.879e+02 8.375e+02 2.097e+03, threshold=1.176e+03, percent-clipped=3.0 2023-06-23 04:09:34,834 INFO [train.py:996] (2/4) Epoch 8, batch 19150, loss[loss=0.2378, simple_loss=0.3266, pruned_loss=0.07449, over 20871.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3112, pruned_loss=0.08483, over 4264222.51 frames. ], batch size: 608, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:09:38,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1395672.0, ans=0.05 2023-06-23 04:10:05,184 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=22.5 2023-06-23 04:10:20,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1395792.0, ans=0.04949747468305833 2023-06-23 04:10:34,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1395792.0, ans=0.2 2023-06-23 04:11:19,391 INFO [train.py:996] (2/4) Epoch 8, batch 19200, loss[loss=0.2853, simple_loss=0.3952, pruned_loss=0.08766, over 21224.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3232, pruned_loss=0.0861, over 4269848.60 frames. ], batch size: 549, lr: 3.70e-03, grad_scale: 32.0 2023-06-23 04:11:58,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1396092.0, ans=0.1 2023-06-23 04:12:50,651 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.764e+02 4.861e+02 7.058e+02 9.743e+02 2.046e+03, threshold=1.412e+03, percent-clipped=16.0 2023-06-23 04:12:50,677 INFO [train.py:996] (2/4) Epoch 8, batch 19250, loss[loss=0.2047, simple_loss=0.2969, pruned_loss=0.0562, over 21418.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3221, pruned_loss=0.08073, over 4266182.59 frames. ], batch size: 548, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 04:13:18,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1396332.0, ans=0.125 2023-06-23 04:13:22,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1396332.0, ans=0.1 2023-06-23 04:13:38,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1396392.0, ans=0.125 2023-06-23 04:13:44,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1396392.0, ans=0.5 2023-06-23 04:14:29,804 INFO [train.py:996] (2/4) Epoch 8, batch 19300, loss[loss=0.2296, simple_loss=0.2999, pruned_loss=0.07966, over 21627.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3204, pruned_loss=0.08084, over 4269242.47 frames. ], batch size: 195, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 04:14:38,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1396572.0, ans=0.125 2023-06-23 04:14:51,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1396572.0, ans=0.1 2023-06-23 04:15:18,678 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=15.0 2023-06-23 04:15:47,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1396752.0, ans=0.0 2023-06-23 04:16:14,256 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.848e+02 5.029e+02 6.832e+02 8.768e+02 1.869e+03, threshold=1.366e+03, percent-clipped=8.0 2023-06-23 04:16:14,276 INFO [train.py:996] (2/4) Epoch 8, batch 19350, loss[loss=0.2228, simple_loss=0.31, pruned_loss=0.06781, over 21851.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3139, pruned_loss=0.07672, over 4272452.06 frames. ], batch size: 373, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 04:16:48,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1396932.0, ans=0.125 2023-06-23 04:17:17,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1397052.0, ans=0.2 2023-06-23 04:17:26,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1397052.0, ans=0.1 2023-06-23 04:17:36,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1397112.0, ans=0.1 2023-06-23 04:17:54,536 INFO [train.py:996] (2/4) Epoch 8, batch 19400, loss[loss=0.2004, simple_loss=0.2821, pruned_loss=0.05934, over 21803.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3109, pruned_loss=0.07547, over 4268412.12 frames. ], batch size: 298, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 04:18:19,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1397232.0, ans=0.0 2023-06-23 04:18:22,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1397232.0, ans=0.125 2023-06-23 04:18:29,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1397232.0, ans=0.0 2023-06-23 04:19:09,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1397352.0, ans=0.035 2023-06-23 04:19:28,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1397412.0, ans=0.125 2023-06-23 04:19:38,372 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.184e+02 4.448e+02 5.769e+02 7.543e+02 1.139e+03, threshold=1.154e+03, percent-clipped=0.0 2023-06-23 04:19:38,392 INFO [train.py:996] (2/4) Epoch 8, batch 19450, loss[loss=0.2162, simple_loss=0.2838, pruned_loss=0.07431, over 21733.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3088, pruned_loss=0.07756, over 4278934.66 frames. ], batch size: 298, lr: 3.70e-03, grad_scale: 16.0 2023-06-23 04:20:05,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1397532.0, ans=0.125 2023-06-23 04:20:20,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1397592.0, ans=0.1 2023-06-23 04:20:24,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1397592.0, ans=0.0 2023-06-23 04:21:16,472 INFO [train.py:996] (2/4) Epoch 8, batch 19500, loss[loss=0.223, simple_loss=0.304, pruned_loss=0.07101, over 21580.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.305, pruned_loss=0.07884, over 4282163.10 frames. ], batch size: 389, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:21:39,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1397832.0, ans=15.0 2023-06-23 04:22:01,994 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-23 04:22:23,218 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.55 vs. limit=10.0 2023-06-23 04:22:52,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1398012.0, ans=0.5 2023-06-23 04:22:54,692 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.438e+02 5.009e+02 6.847e+02 1.109e+03 2.464e+03, threshold=1.369e+03, percent-clipped=22.0 2023-06-23 04:22:54,713 INFO [train.py:996] (2/4) Epoch 8, batch 19550, loss[loss=0.2049, simple_loss=0.3, pruned_loss=0.05489, over 21845.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2997, pruned_loss=0.07702, over 4275808.53 frames. ], batch size: 371, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:23:18,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1398132.0, ans=0.2 2023-06-23 04:24:02,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1398252.0, ans=0.125 2023-06-23 04:24:02,915 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-06-23 04:24:30,120 INFO [train.py:996] (2/4) Epoch 8, batch 19600, loss[loss=0.2839, simple_loss=0.3394, pruned_loss=0.1142, over 21319.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3014, pruned_loss=0.07784, over 4286477.74 frames. ], batch size: 159, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:25:15,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1398492.0, ans=0.125 2023-06-23 04:25:28,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1398552.0, ans=0.2 2023-06-23 04:25:49,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1398612.0, ans=0.125 2023-06-23 04:25:52,120 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.23 vs. limit=15.0 2023-06-23 04:25:58,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1398612.0, ans=0.125 2023-06-23 04:26:04,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1398612.0, ans=0.0 2023-06-23 04:26:08,759 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.522e+02 4.617e+02 5.615e+02 7.940e+02 2.383e+03, threshold=1.123e+03, percent-clipped=6.0 2023-06-23 04:26:08,783 INFO [train.py:996] (2/4) Epoch 8, batch 19650, loss[loss=0.228, simple_loss=0.3358, pruned_loss=0.0601, over 19879.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3071, pruned_loss=0.08198, over 4286734.58 frames. ], batch size: 702, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:26:19,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1398672.0, ans=0.07 2023-06-23 04:26:24,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1398672.0, ans=0.2 2023-06-23 04:26:31,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1398732.0, ans=0.125 2023-06-23 04:27:24,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1398852.0, ans=0.125 2023-06-23 04:27:41,542 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.62 vs. limit=10.0 2023-06-23 04:27:42,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1398912.0, ans=0.125 2023-06-23 04:27:55,184 INFO [train.py:996] (2/4) Epoch 8, batch 19700, loss[loss=0.2319, simple_loss=0.3251, pruned_loss=0.06937, over 21725.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3095, pruned_loss=0.08255, over 4281221.80 frames. ], batch size: 352, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:29:06,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1399152.0, ans=0.125 2023-06-23 04:29:31,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1399212.0, ans=0.125 2023-06-23 04:29:34,733 INFO [train.py:996] (2/4) Epoch 8, batch 19750, loss[loss=0.2678, simple_loss=0.35, pruned_loss=0.09283, over 21595.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3189, pruned_loss=0.08373, over 4278693.71 frames. ], batch size: 230, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:29:36,357 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.270e+02 5.158e+02 7.198e+02 1.115e+03 3.431e+03, threshold=1.440e+03, percent-clipped=24.0 2023-06-23 04:30:26,112 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=15.0 2023-06-23 04:30:45,247 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.80 vs. limit=8.0 2023-06-23 04:31:12,826 INFO [train.py:996] (2/4) Epoch 8, batch 19800, loss[loss=0.2422, simple_loss=0.3248, pruned_loss=0.07979, over 21512.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3194, pruned_loss=0.08436, over 4277222.25 frames. ], batch size: 471, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:31:29,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1399572.0, ans=0.125 2023-06-23 04:31:52,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1399632.0, ans=0.125 2023-06-23 04:31:57,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1399692.0, ans=0.2 2023-06-23 04:32:09,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1399692.0, ans=0.2 2023-06-23 04:32:51,962 INFO [train.py:996] (2/4) Epoch 8, batch 19850, loss[loss=0.2224, simple_loss=0.3159, pruned_loss=0.06443, over 21759.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3119, pruned_loss=0.07966, over 4275664.78 frames. ], batch size: 332, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:32:53,415 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.027e+02 4.926e+02 6.065e+02 9.192e+02 2.099e+03, threshold=1.213e+03, percent-clipped=4.0 2023-06-23 04:32:57,469 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-06-23 04:34:28,510 INFO [train.py:996] (2/4) Epoch 8, batch 19900, loss[loss=0.2261, simple_loss=0.2987, pruned_loss=0.07681, over 21720.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3111, pruned_loss=0.07644, over 4276560.27 frames. ], batch size: 316, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:34:32,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1400172.0, ans=0.125 2023-06-23 04:34:36,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1400172.0, ans=0.125 2023-06-23 04:35:48,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1400412.0, ans=0.0 2023-06-23 04:36:08,889 INFO [train.py:996] (2/4) Epoch 8, batch 19950, loss[loss=0.1945, simple_loss=0.2592, pruned_loss=0.06483, over 21338.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3052, pruned_loss=0.07647, over 4275038.74 frames. ], batch size: 131, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:36:10,386 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.980e+02 4.068e+02 6.013e+02 8.874e+02 2.224e+03, threshold=1.203e+03, percent-clipped=12.0 2023-06-23 04:36:58,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1400592.0, ans=0.05 2023-06-23 04:37:00,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1400592.0, ans=0.125 2023-06-23 04:37:42,767 INFO [train.py:996] (2/4) Epoch 8, batch 20000, loss[loss=0.2533, simple_loss=0.3234, pruned_loss=0.09157, over 21847.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3064, pruned_loss=0.07794, over 4262885.62 frames. ], batch size: 332, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:38:33,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1400892.0, ans=0.125 2023-06-23 04:38:48,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1400952.0, ans=0.1 2023-06-23 04:38:49,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1400952.0, ans=0.0 2023-06-23 04:39:05,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1401012.0, ans=0.025 2023-06-23 04:39:15,698 INFO [train.py:996] (2/4) Epoch 8, batch 20050, loss[loss=0.2374, simple_loss=0.306, pruned_loss=0.08442, over 21298.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3081, pruned_loss=0.07997, over 4269985.41 frames. ], batch size: 143, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:39:18,832 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.252e+02 4.577e+02 6.319e+02 8.281e+02 1.487e+03, threshold=1.264e+03, percent-clipped=6.0 2023-06-23 04:39:23,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1401072.0, ans=0.125 2023-06-23 04:40:46,892 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 04:40:54,564 INFO [train.py:996] (2/4) Epoch 8, batch 20100, loss[loss=0.2542, simple_loss=0.323, pruned_loss=0.0927, over 21394.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3099, pruned_loss=0.0816, over 4278076.68 frames. ], batch size: 159, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:41:04,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1401372.0, ans=0.04949747468305833 2023-06-23 04:42:47,131 INFO [train.py:996] (2/4) Epoch 8, batch 20150, loss[loss=0.2843, simple_loss=0.3632, pruned_loss=0.1026, over 21541.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3223, pruned_loss=0.0862, over 4280580.23 frames. ], batch size: 414, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:42:50,158 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.397e+02 4.538e+02 5.704e+02 8.156e+02 2.453e+03, threshold=1.141e+03, percent-clipped=8.0 2023-06-23 04:43:09,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1401732.0, ans=0.125 2023-06-23 04:43:26,984 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.29 vs. limit=15.0 2023-06-23 04:43:38,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1401792.0, ans=0.125 2023-06-23 04:44:00,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1401852.0, ans=0.125 2023-06-23 04:44:01,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1401852.0, ans=0.0 2023-06-23 04:44:24,613 INFO [train.py:996] (2/4) Epoch 8, batch 20200, loss[loss=0.2323, simple_loss=0.3155, pruned_loss=0.07449, over 21277.00 frames. ], tot_loss[loss=0.2531, simple_loss=0.3275, pruned_loss=0.08933, over 4276612.09 frames. ], batch size: 176, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:44:41,882 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.96 vs. limit=15.0 2023-06-23 04:44:50,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1402032.0, ans=0.125 2023-06-23 04:45:25,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1402152.0, ans=0.0 2023-06-23 04:45:28,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1402152.0, ans=0.0 2023-06-23 04:45:58,109 INFO [train.py:996] (2/4) Epoch 8, batch 20250, loss[loss=0.2331, simple_loss=0.3073, pruned_loss=0.07946, over 21665.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3287, pruned_loss=0.08751, over 4271768.54 frames. ], batch size: 230, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:45:59,232 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-06-23 04:46:01,464 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.429e+02 5.077e+02 7.177e+02 9.506e+02 2.179e+03, threshold=1.435e+03, percent-clipped=12.0 2023-06-23 04:46:08,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1402272.0, ans=0.0 2023-06-23 04:46:39,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1402392.0, ans=0.2 2023-06-23 04:47:07,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1402452.0, ans=0.125 2023-06-23 04:47:36,576 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-06-23 04:47:37,005 INFO [train.py:996] (2/4) Epoch 8, batch 20300, loss[loss=0.2284, simple_loss=0.3048, pruned_loss=0.07595, over 21263.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3258, pruned_loss=0.08423, over 4269303.80 frames. ], batch size: 176, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:48:53,785 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=15.0 2023-06-23 04:49:04,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1402812.0, ans=0.125 2023-06-23 04:49:10,166 INFO [train.py:996] (2/4) Epoch 8, batch 20350, loss[loss=0.2513, simple_loss=0.3204, pruned_loss=0.09111, over 21810.00 frames. ], tot_loss[loss=0.249, simple_loss=0.327, pruned_loss=0.08552, over 4268416.06 frames. ], batch size: 124, lr: 3.69e-03, grad_scale: 16.0 2023-06-23 04:49:13,371 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.458e+02 4.961e+02 7.570e+02 1.006e+03 1.715e+03, threshold=1.514e+03, percent-clipped=7.0 2023-06-23 04:49:33,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1402932.0, ans=0.1 2023-06-23 04:50:02,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1403052.0, ans=0.2 2023-06-23 04:50:25,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1403052.0, ans=0.0 2023-06-23 04:50:35,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1403112.0, ans=0.2 2023-06-23 04:50:44,165 INFO [train.py:996] (2/4) Epoch 8, batch 20400, loss[loss=0.2672, simple_loss=0.3395, pruned_loss=0.09743, over 21788.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3291, pruned_loss=0.08807, over 4260264.73 frames. ], batch size: 332, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:50:44,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1403172.0, ans=0.125 2023-06-23 04:51:35,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1403352.0, ans=0.1 2023-06-23 04:51:49,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1403352.0, ans=0.125 2023-06-23 04:52:17,035 INFO [train.py:996] (2/4) Epoch 8, batch 20450, loss[loss=0.2657, simple_loss=0.3276, pruned_loss=0.1019, over 21510.00 frames. ], tot_loss[loss=0.2567, simple_loss=0.331, pruned_loss=0.09117, over 4261351.51 frames. ], batch size: 194, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:52:19,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1403472.0, ans=0.125 2023-06-23 04:52:20,036 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.597e+02 5.065e+02 6.621e+02 9.433e+02 1.870e+03, threshold=1.324e+03, percent-clipped=2.0 2023-06-23 04:52:26,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1403472.0, ans=0.0 2023-06-23 04:52:59,635 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.91 vs. limit=10.0 2023-06-23 04:53:21,814 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=22.5 2023-06-23 04:53:30,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1403652.0, ans=0.125 2023-06-23 04:53:54,341 INFO [train.py:996] (2/4) Epoch 8, batch 20500, loss[loss=0.2044, simple_loss=0.2741, pruned_loss=0.06738, over 21691.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.325, pruned_loss=0.09084, over 4259372.12 frames. ], batch size: 247, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:54:01,335 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-23 04:54:16,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1403832.0, ans=0.1 2023-06-23 04:54:42,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1403892.0, ans=0.0 2023-06-23 04:55:14,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1404012.0, ans=0.0 2023-06-23 04:55:28,228 INFO [train.py:996] (2/4) Epoch 8, batch 20550, loss[loss=0.2663, simple_loss=0.3462, pruned_loss=0.0932, over 21573.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3167, pruned_loss=0.08862, over 4249656.89 frames. ], batch size: 389, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:55:31,311 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.998e+02 4.304e+02 5.833e+02 8.675e+02 1.439e+03, threshold=1.167e+03, percent-clipped=3.0 2023-06-23 04:55:35,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1404072.0, ans=0.125 2023-06-23 04:56:26,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1404192.0, ans=0.09899494936611666 2023-06-23 04:57:06,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1404372.0, ans=0.1 2023-06-23 04:57:07,603 INFO [train.py:996] (2/4) Epoch 8, batch 20600, loss[loss=0.2525, simple_loss=0.3181, pruned_loss=0.09345, over 21031.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3185, pruned_loss=0.08546, over 4247190.45 frames. ], batch size: 607, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:57:27,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1404432.0, ans=0.125 2023-06-23 04:57:36,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1404432.0, ans=0.1 2023-06-23 04:57:46,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1404492.0, ans=0.0 2023-06-23 04:58:09,149 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.30 vs. limit=15.0 2023-06-23 04:58:46,044 INFO [train.py:996] (2/4) Epoch 8, batch 20650, loss[loss=0.2032, simple_loss=0.2675, pruned_loss=0.06945, over 21593.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3148, pruned_loss=0.08565, over 4256143.68 frames. ], batch size: 231, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 04:58:49,134 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.528e+02 4.605e+02 7.657e+02 1.188e+03 2.326e+03, threshold=1.531e+03, percent-clipped=25.0 2023-06-23 04:59:05,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1404732.0, ans=0.1 2023-06-23 05:00:26,461 INFO [train.py:996] (2/4) Epoch 8, batch 20700, loss[loss=0.2884, simple_loss=0.3598, pruned_loss=0.1085, over 21590.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.307, pruned_loss=0.08234, over 4260001.68 frames. ], batch size: 441, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 05:02:07,882 INFO [train.py:996] (2/4) Epoch 8, batch 20750, loss[loss=0.3586, simple_loss=0.4426, pruned_loss=0.1373, over 21511.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3102, pruned_loss=0.08166, over 4259316.64 frames. ], batch size: 471, lr: 3.69e-03, grad_scale: 32.0 2023-06-23 05:02:11,636 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.166e+02 4.120e+02 5.727e+02 9.009e+02 2.135e+03, threshold=1.145e+03, percent-clipped=5.0 2023-06-23 05:02:31,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1405272.0, ans=0.2 2023-06-23 05:02:37,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1405332.0, ans=0.07 2023-06-23 05:02:45,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1405332.0, ans=0.125 2023-06-23 05:03:03,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1405392.0, ans=0.125 2023-06-23 05:03:03,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1405392.0, ans=0.2 2023-06-23 05:03:24,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1405452.0, ans=0.0 2023-06-23 05:03:47,501 INFO [train.py:996] (2/4) Epoch 8, batch 20800, loss[loss=0.2175, simple_loss=0.281, pruned_loss=0.07699, over 19944.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3139, pruned_loss=0.08261, over 4263782.74 frames. ], batch size: 703, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:04:36,118 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.30 vs. limit=15.0 2023-06-23 05:05:00,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1405752.0, ans=0.125 2023-06-23 05:05:20,434 INFO [train.py:996] (2/4) Epoch 8, batch 20850, loss[loss=0.2143, simple_loss=0.2871, pruned_loss=0.07072, over 21776.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3072, pruned_loss=0.08084, over 4261681.52 frames. ], batch size: 414, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:05:28,412 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.968e+02 4.772e+02 9.225e+02 1.220e+03 2.670e+03, threshold=1.845e+03, percent-clipped=33.0 2023-06-23 05:05:38,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1405872.0, ans=0.125 2023-06-23 05:06:36,661 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 05:06:38,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1406052.0, ans=0.04949747468305833 2023-06-23 05:06:45,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1406112.0, ans=0.0 2023-06-23 05:06:57,348 INFO [train.py:996] (2/4) Epoch 8, batch 20900, loss[loss=0.2765, simple_loss=0.3497, pruned_loss=0.1017, over 21931.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.309, pruned_loss=0.08262, over 4271184.59 frames. ], batch size: 373, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:07:05,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1406172.0, ans=0.09899494936611666 2023-06-23 05:07:35,841 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=15.0 2023-06-23 05:08:33,363 INFO [train.py:996] (2/4) Epoch 8, batch 20950, loss[loss=0.1895, simple_loss=0.2657, pruned_loss=0.05668, over 21904.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.304, pruned_loss=0.07864, over 4264012.27 frames. ], batch size: 98, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:08:36,653 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.769e+02 4.522e+02 5.963e+02 9.256e+02 1.585e+03, threshold=1.193e+03, percent-clipped=0.0 2023-06-23 05:08:37,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1406472.0, ans=0.0 2023-06-23 05:09:48,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1406652.0, ans=0.1 2023-06-23 05:10:11,179 INFO [train.py:996] (2/4) Epoch 8, batch 21000, loss[loss=0.237, simple_loss=0.3086, pruned_loss=0.08272, over 21222.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3035, pruned_loss=0.07956, over 4267781.07 frames. ], batch size: 176, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:10:11,180 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-23 05:10:21,609 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.2.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.2657, 4.2359, 2.2779, 2.5530], device='cuda:2') 2023-06-23 05:10:27,253 INFO [train.py:1028] (2/4) Epoch 8, validation: loss=0.2634, simple_loss=0.3611, pruned_loss=0.08288, over 1796401.00 frames. 2023-06-23 05:10:27,254 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-23 05:10:31,179 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.12 vs. limit=15.0 2023-06-23 05:10:59,389 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=15.0 2023-06-23 05:11:00,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1406832.0, ans=0.1 2023-06-23 05:11:18,153 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.35 vs. limit=6.0 2023-06-23 05:11:59,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1407012.0, ans=0.0 2023-06-23 05:12:04,098 INFO [train.py:996] (2/4) Epoch 8, batch 21050, loss[loss=0.228, simple_loss=0.2891, pruned_loss=0.08344, over 21271.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3011, pruned_loss=0.07946, over 4269683.69 frames. ], batch size: 159, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:12:07,318 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.662e+02 4.992e+02 6.776e+02 1.028e+03 2.055e+03, threshold=1.355e+03, percent-clipped=16.0 2023-06-23 05:13:00,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1407192.0, ans=0.0 2023-06-23 05:13:20,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1407252.0, ans=0.125 2023-06-23 05:13:29,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1407312.0, ans=0.125 2023-06-23 05:13:42,102 INFO [train.py:996] (2/4) Epoch 8, batch 21100, loss[loss=0.2218, simple_loss=0.287, pruned_loss=0.0783, over 21525.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.2955, pruned_loss=0.07793, over 4254810.57 frames. ], batch size: 391, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:13:54,421 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=15.0 2023-06-23 05:13:55,972 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.44 vs. limit=15.0 2023-06-23 05:14:28,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1407492.0, ans=0.125 2023-06-23 05:14:34,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1407492.0, ans=0.125 2023-06-23 05:15:04,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1407612.0, ans=0.07 2023-06-23 05:15:14,414 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.52 vs. limit=15.0 2023-06-23 05:15:15,043 INFO [train.py:996] (2/4) Epoch 8, batch 21150, loss[loss=0.204, simple_loss=0.2663, pruned_loss=0.07085, over 21746.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.292, pruned_loss=0.07787, over 4258118.04 frames. ], batch size: 300, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:15:17,990 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.999e+02 4.608e+02 5.910e+02 9.200e+02 1.578e+03, threshold=1.182e+03, percent-clipped=4.0 2023-06-23 05:16:44,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.81 vs. limit=5.0 2023-06-23 05:16:54,276 INFO [train.py:996] (2/4) Epoch 8, batch 21200, loss[loss=0.2308, simple_loss=0.2987, pruned_loss=0.08141, over 21967.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2891, pruned_loss=0.07724, over 4258183.03 frames. ], batch size: 103, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:17:21,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1407972.0, ans=0.125 2023-06-23 05:17:53,094 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 05:17:54,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1408092.0, ans=0.1 2023-06-23 05:18:32,261 INFO [train.py:996] (2/4) Epoch 8, batch 21250, loss[loss=0.2404, simple_loss=0.2938, pruned_loss=0.0935, over 21555.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2882, pruned_loss=0.07784, over 4258541.55 frames. ], batch size: 391, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:18:32,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1408272.0, ans=0.125 2023-06-23 05:18:41,947 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.312e+02 4.434e+02 5.437e+02 7.242e+02 2.137e+03, threshold=1.087e+03, percent-clipped=7.0 2023-06-23 05:19:06,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1408332.0, ans=0.125 2023-06-23 05:19:48,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1408452.0, ans=0.125 2023-06-23 05:20:05,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1408512.0, ans=0.125 2023-06-23 05:20:11,398 INFO [train.py:996] (2/4) Epoch 8, batch 21300, loss[loss=0.2466, simple_loss=0.315, pruned_loss=0.08912, over 21490.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.296, pruned_loss=0.08026, over 4262238.25 frames. ], batch size: 212, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:20:18,987 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2023-06-23 05:21:00,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1408692.0, ans=0.125 2023-06-23 05:21:29,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1408752.0, ans=0.2 2023-06-23 05:21:31,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1408752.0, ans=0.2 2023-06-23 05:21:54,418 INFO [train.py:996] (2/4) Epoch 8, batch 21350, loss[loss=0.205, simple_loss=0.3002, pruned_loss=0.05494, over 21766.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3, pruned_loss=0.08092, over 4270135.17 frames. ], batch size: 332, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:22:10,264 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.040e+02 5.053e+02 6.684e+02 9.217e+02 2.330e+03, threshold=1.337e+03, percent-clipped=18.0 2023-06-23 05:22:59,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1409052.0, ans=0.125 2023-06-23 05:23:09,947 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=22.5 2023-06-23 05:23:38,546 INFO [train.py:996] (2/4) Epoch 8, batch 21400, loss[loss=0.2901, simple_loss=0.3653, pruned_loss=0.1075, over 21379.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3035, pruned_loss=0.08127, over 4270787.73 frames. ], batch size: 471, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:23:49,219 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.13 vs. limit=15.0 2023-06-23 05:23:53,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1409172.0, ans=0.125 2023-06-23 05:24:08,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1409232.0, ans=0.125 2023-06-23 05:24:18,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1409232.0, ans=0.125 2023-06-23 05:25:22,957 INFO [train.py:996] (2/4) Epoch 8, batch 21450, loss[loss=0.322, simple_loss=0.3641, pruned_loss=0.14, over 21715.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3067, pruned_loss=0.08191, over 4270787.98 frames. ], batch size: 507, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:25:28,992 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.984e+02 4.393e+02 5.335e+02 6.741e+02 1.398e+03, threshold=1.067e+03, percent-clipped=1.0 2023-06-23 05:25:43,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1409532.0, ans=0.125 2023-06-23 05:25:53,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1409532.0, ans=0.125 2023-06-23 05:27:01,221 INFO [train.py:996] (2/4) Epoch 8, batch 21500, loss[loss=0.2524, simple_loss=0.3047, pruned_loss=0.1, over 21284.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3061, pruned_loss=0.08343, over 4267693.33 frames. ], batch size: 159, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:28:13,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1410012.0, ans=0.0 2023-06-23 05:28:20,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1410012.0, ans=0.0 2023-06-23 05:28:39,686 INFO [train.py:996] (2/4) Epoch 8, batch 21550, loss[loss=0.2385, simple_loss=0.3556, pruned_loss=0.06069, over 19655.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.2982, pruned_loss=0.08012, over 4256908.54 frames. ], batch size: 703, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:28:46,245 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.014e+02 4.565e+02 6.143e+02 8.904e+02 1.889e+03, threshold=1.229e+03, percent-clipped=13.0 2023-06-23 05:29:01,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1410132.0, ans=0.125 2023-06-23 05:29:22,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1410192.0, ans=0.125 2023-06-23 05:29:43,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1410252.0, ans=0.1 2023-06-23 05:30:01,664 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.91 vs. limit=6.0 2023-06-23 05:30:03,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1410312.0, ans=0.125 2023-06-23 05:30:25,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1410372.0, ans=0.125 2023-06-23 05:30:26,674 INFO [train.py:996] (2/4) Epoch 8, batch 21600, loss[loss=0.2278, simple_loss=0.319, pruned_loss=0.0683, over 21579.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2943, pruned_loss=0.07844, over 4249814.85 frames. ], batch size: 441, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:30:45,857 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=15.0 2023-06-23 05:30:51,985 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-23 05:30:52,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1410432.0, ans=0.0 2023-06-23 05:31:42,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1410612.0, ans=0.125 2023-06-23 05:31:43,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1410612.0, ans=0.2 2023-06-23 05:32:05,250 INFO [train.py:996] (2/4) Epoch 8, batch 21650, loss[loss=0.283, simple_loss=0.3759, pruned_loss=0.09501, over 21652.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2979, pruned_loss=0.0763, over 4247619.51 frames. ], batch size: 441, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:32:08,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1410672.0, ans=0.0 2023-06-23 05:32:10,915 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.286e+02 5.401e+02 7.635e+02 1.107e+03 2.032e+03, threshold=1.527e+03, percent-clipped=20.0 2023-06-23 05:32:45,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1410792.0, ans=0.2 2023-06-23 05:33:24,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1410912.0, ans=0.125 2023-06-23 05:33:36,428 INFO [train.py:996] (2/4) Epoch 8, batch 21700, loss[loss=0.2029, simple_loss=0.2776, pruned_loss=0.06411, over 21457.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2971, pruned_loss=0.07485, over 4249225.96 frames. ], batch size: 211, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:33:59,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1411032.0, ans=0.1 2023-06-23 05:34:13,734 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-23 05:34:19,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1411092.0, ans=0.1 2023-06-23 05:34:58,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1411212.0, ans=0.0 2023-06-23 05:35:14,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1411272.0, ans=0.0 2023-06-23 05:35:15,274 INFO [train.py:996] (2/4) Epoch 8, batch 21750, loss[loss=0.2271, simple_loss=0.2932, pruned_loss=0.08045, over 21726.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.294, pruned_loss=0.07507, over 4247339.99 frames. ], batch size: 124, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:35:27,356 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.075e+02 4.568e+02 6.230e+02 8.144e+02 2.277e+03, threshold=1.246e+03, percent-clipped=1.0 2023-06-23 05:35:42,873 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-23 05:36:01,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1411392.0, ans=0.125 2023-06-23 05:36:01,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1411392.0, ans=0.125 2023-06-23 05:36:15,160 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-23 05:36:37,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1411512.0, ans=22.5 2023-06-23 05:36:38,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1411512.0, ans=0.0 2023-06-23 05:36:50,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1411512.0, ans=0.125 2023-06-23 05:37:01,103 INFO [train.py:996] (2/4) Epoch 8, batch 21800, loss[loss=0.2043, simple_loss=0.2774, pruned_loss=0.06563, over 21566.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.293, pruned_loss=0.07618, over 4235488.30 frames. ], batch size: 391, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:37:06,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1411572.0, ans=0.125 2023-06-23 05:37:09,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1411572.0, ans=0.125 2023-06-23 05:37:17,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1411632.0, ans=0.0 2023-06-23 05:37:53,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1411752.0, ans=0.2 2023-06-23 05:37:59,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1411752.0, ans=0.2 2023-06-23 05:38:39,314 INFO [train.py:996] (2/4) Epoch 8, batch 21850, loss[loss=0.2285, simple_loss=0.3038, pruned_loss=0.07654, over 21273.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2963, pruned_loss=0.07622, over 4244037.55 frames. ], batch size: 176, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:38:40,340 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.71 vs. limit=12.0 2023-06-23 05:38:47,503 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.413e+02 4.593e+02 6.628e+02 8.915e+02 2.617e+03, threshold=1.326e+03, percent-clipped=11.0 2023-06-23 05:38:55,136 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=12.0 2023-06-23 05:38:55,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1411932.0, ans=0.0 2023-06-23 05:39:11,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1411992.0, ans=0.1 2023-06-23 05:40:01,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1412112.0, ans=0.0 2023-06-23 05:40:12,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1412112.0, ans=0.125 2023-06-23 05:40:20,478 INFO [train.py:996] (2/4) Epoch 8, batch 21900, loss[loss=0.267, simple_loss=0.3314, pruned_loss=0.1013, over 21834.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3008, pruned_loss=0.07764, over 4251862.10 frames. ], batch size: 441, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:40:25,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1412172.0, ans=0.125 2023-06-23 05:40:30,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1412172.0, ans=0.125 2023-06-23 05:40:46,467 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=22.5 2023-06-23 05:41:04,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1412292.0, ans=0.125 2023-06-23 05:42:00,029 INFO [train.py:996] (2/4) Epoch 8, batch 21950, loss[loss=0.1277, simple_loss=0.2021, pruned_loss=0.02663, over 21203.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2962, pruned_loss=0.07748, over 4251552.29 frames. ], batch size: 176, lr: 3.68e-03, grad_scale: 16.0 2023-06-23 05:42:07,944 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.901e+02 4.723e+02 6.314e+02 7.880e+02 1.650e+03, threshold=1.263e+03, percent-clipped=2.0 2023-06-23 05:42:35,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1412592.0, ans=0.125 2023-06-23 05:42:42,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1412592.0, ans=0.0 2023-06-23 05:43:16,107 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-23 05:43:26,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1412712.0, ans=0.0 2023-06-23 05:43:40,021 INFO [train.py:996] (2/4) Epoch 8, batch 22000, loss[loss=0.2273, simple_loss=0.2922, pruned_loss=0.08126, over 21846.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2904, pruned_loss=0.07475, over 4259224.27 frames. ], batch size: 372, lr: 3.68e-03, grad_scale: 32.0 2023-06-23 05:43:43,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1412772.0, ans=0.2 2023-06-23 05:43:43,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1412772.0, ans=0.125 2023-06-23 05:44:04,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1412832.0, ans=0.1 2023-06-23 05:44:33,213 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-23 05:45:21,157 INFO [train.py:996] (2/4) Epoch 8, batch 22050, loss[loss=0.2747, simple_loss=0.3571, pruned_loss=0.09618, over 21278.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2939, pruned_loss=0.07504, over 4257502.64 frames. ], batch size: 549, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:45:33,013 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.844e+02 4.843e+02 7.365e+02 1.302e+03 3.775e+03, threshold=1.473e+03, percent-clipped=26.0 2023-06-23 05:45:55,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1413192.0, ans=0.0 2023-06-23 05:47:02,683 INFO [train.py:996] (2/4) Epoch 8, batch 22100, loss[loss=0.3368, simple_loss=0.3927, pruned_loss=0.1405, over 21791.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3035, pruned_loss=0.08013, over 4255507.58 frames. ], batch size: 441, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:47:59,374 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.00 vs. limit=15.0 2023-06-23 05:48:39,289 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.44 vs. limit=10.0 2023-06-23 05:48:40,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1413672.0, ans=0.125 2023-06-23 05:48:41,573 INFO [train.py:996] (2/4) Epoch 8, batch 22150, loss[loss=0.2344, simple_loss=0.3082, pruned_loss=0.08029, over 21734.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3079, pruned_loss=0.08209, over 4266200.35 frames. ], batch size: 389, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:48:41,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1413672.0, ans=0.0 2023-06-23 05:48:43,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1413672.0, ans=0.1 2023-06-23 05:48:52,634 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.178e+02 4.829e+02 6.848e+02 1.021e+03 2.130e+03, threshold=1.370e+03, percent-clipped=6.0 2023-06-23 05:49:26,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1413792.0, ans=0.2 2023-06-23 05:49:57,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1413852.0, ans=0.125 2023-06-23 05:49:59,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1413852.0, ans=0.2 2023-06-23 05:50:10,082 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 05:50:11,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1413912.0, ans=0.125 2023-06-23 05:50:21,010 INFO [train.py:996] (2/4) Epoch 8, batch 22200, loss[loss=0.2613, simple_loss=0.3521, pruned_loss=0.08527, over 21846.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3117, pruned_loss=0.08359, over 4272763.01 frames. ], batch size: 371, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:50:29,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1413972.0, ans=0.1 2023-06-23 05:51:47,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1414212.0, ans=0.125 2023-06-23 05:51:53,981 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=22.5 2023-06-23 05:51:58,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1414212.0, ans=0.5 2023-06-23 05:52:00,973 INFO [train.py:996] (2/4) Epoch 8, batch 22250, loss[loss=0.2585, simple_loss=0.3212, pruned_loss=0.09789, over 20189.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3159, pruned_loss=0.08501, over 4277779.66 frames. ], batch size: 702, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:52:04,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1414272.0, ans=0.125 2023-06-23 05:52:12,835 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.556e+02 5.033e+02 6.376e+02 9.699e+02 1.847e+03, threshold=1.275e+03, percent-clipped=11.0 2023-06-23 05:52:14,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1414272.0, ans=0.2 2023-06-23 05:52:21,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1414332.0, ans=0.1 2023-06-23 05:53:37,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1414512.0, ans=0.0 2023-06-23 05:53:40,266 INFO [train.py:996] (2/4) Epoch 8, batch 22300, loss[loss=0.2699, simple_loss=0.3502, pruned_loss=0.09485, over 20798.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3175, pruned_loss=0.08687, over 4277379.27 frames. ], batch size: 607, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:54:25,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=1414692.0, ans=0.02 2023-06-23 05:54:25,714 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-23 05:55:12,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1414872.0, ans=0.125 2023-06-23 05:55:13,962 INFO [train.py:996] (2/4) Epoch 8, batch 22350, loss[loss=0.2491, simple_loss=0.3091, pruned_loss=0.09459, over 21329.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3149, pruned_loss=0.08663, over 4285934.81 frames. ], batch size: 143, lr: 3.67e-03, grad_scale: 8.0 2023-06-23 05:55:19,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1414872.0, ans=0.125 2023-06-23 05:55:25,667 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.473e+02 4.765e+02 6.117e+02 7.891e+02 1.509e+03, threshold=1.223e+03, percent-clipped=2.0 2023-06-23 05:55:31,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1414932.0, ans=0.025 2023-06-23 05:55:53,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1414992.0, ans=0.125 2023-06-23 05:56:48,469 INFO [train.py:996] (2/4) Epoch 8, batch 22400, loss[loss=0.2298, simple_loss=0.2841, pruned_loss=0.08773, over 21552.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3111, pruned_loss=0.08268, over 4288338.69 frames. ], batch size: 230, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 05:56:53,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1415172.0, ans=0.125 2023-06-23 05:57:01,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1415172.0, ans=0.2 2023-06-23 05:58:11,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1415412.0, ans=0.125 2023-06-23 05:58:11,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1415412.0, ans=0.2 2023-06-23 05:58:24,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1415412.0, ans=0.125 2023-06-23 05:58:26,817 INFO [train.py:996] (2/4) Epoch 8, batch 22450, loss[loss=0.2167, simple_loss=0.2825, pruned_loss=0.07542, over 21779.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3058, pruned_loss=0.08227, over 4284730.03 frames. ], batch size: 317, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 05:58:37,905 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.026e+02 3.949e+02 5.140e+02 7.263e+02 1.360e+03, threshold=1.028e+03, percent-clipped=2.0 2023-06-23 05:59:54,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1415712.0, ans=0.2 2023-06-23 06:00:06,792 INFO [train.py:996] (2/4) Epoch 8, batch 22500, loss[loss=0.2321, simple_loss=0.3208, pruned_loss=0.07171, over 21378.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3008, pruned_loss=0.08204, over 4286387.49 frames. ], batch size: 194, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:00:34,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1415832.0, ans=0.125 2023-06-23 06:00:40,857 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2023-06-23 06:00:53,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1415892.0, ans=0.0 2023-06-23 06:01:01,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1415892.0, ans=0.2 2023-06-23 06:01:04,807 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=22.5 2023-06-23 06:01:22,283 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=22.5 2023-06-23 06:01:47,295 INFO [train.py:996] (2/4) Epoch 8, batch 22550, loss[loss=0.2514, simple_loss=0.3259, pruned_loss=0.08843, over 21890.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3057, pruned_loss=0.08339, over 4291461.23 frames. ], batch size: 107, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:02:04,071 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.380e+02 5.264e+02 6.977e+02 1.047e+03 2.151e+03, threshold=1.395e+03, percent-clipped=25.0 2023-06-23 06:02:37,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1416192.0, ans=0.5 2023-06-23 06:03:02,816 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.01 vs. limit=15.0 2023-06-23 06:03:05,702 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.51 vs. limit=10.0 2023-06-23 06:03:18,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1416312.0, ans=0.125 2023-06-23 06:03:29,262 INFO [train.py:996] (2/4) Epoch 8, batch 22600, loss[loss=0.2329, simple_loss=0.3171, pruned_loss=0.07433, over 21674.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3093, pruned_loss=0.08301, over 4287848.28 frames. ], batch size: 389, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:04:14,577 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.48 vs. limit=10.0 2023-06-23 06:04:25,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1416552.0, ans=0.125 2023-06-23 06:04:30,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1416552.0, ans=0.125 2023-06-23 06:05:05,342 INFO [train.py:996] (2/4) Epoch 8, batch 22650, loss[loss=0.3366, simple_loss=0.4107, pruned_loss=0.1312, over 21417.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3069, pruned_loss=0.0825, over 4282867.31 frames. ], batch size: 507, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:05:21,123 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.542e+02 6.131e+02 9.012e+02 1.354e+03 2.560e+03, threshold=1.802e+03, percent-clipped=24.0 2023-06-23 06:05:43,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1416732.0, ans=0.2 2023-06-23 06:05:51,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1416792.0, ans=0.5 2023-06-23 06:06:29,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1416912.0, ans=0.0 2023-06-23 06:06:36,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1416972.0, ans=0.2 2023-06-23 06:06:37,786 INFO [train.py:996] (2/4) Epoch 8, batch 22700, loss[loss=0.2253, simple_loss=0.2945, pruned_loss=0.07803, over 20021.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3023, pruned_loss=0.08257, over 4276944.14 frames. ], batch size: 703, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:07:24,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1417092.0, ans=0.125 2023-06-23 06:07:59,087 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.51 vs. limit=15.0 2023-06-23 06:08:08,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1417212.0, ans=0.1 2023-06-23 06:08:16,114 INFO [train.py:996] (2/4) Epoch 8, batch 22750, loss[loss=0.2663, simple_loss=0.3352, pruned_loss=0.09867, over 21697.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3037, pruned_loss=0.08457, over 4279086.91 frames. ], batch size: 351, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:08:31,923 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.389e+02 4.804e+02 6.420e+02 9.928e+02 2.099e+03, threshold=1.284e+03, percent-clipped=4.0 2023-06-23 06:08:33,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=1417272.0, ans=0.02 2023-06-23 06:08:48,640 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.00 vs. limit=15.0 2023-06-23 06:09:19,811 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=15.0 2023-06-23 06:09:54,197 INFO [train.py:996] (2/4) Epoch 8, batch 22800, loss[loss=0.2151, simple_loss=0.287, pruned_loss=0.07159, over 21865.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3074, pruned_loss=0.08658, over 4275033.55 frames. ], batch size: 333, lr: 3.67e-03, grad_scale: 32.0 2023-06-23 06:10:09,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1417572.0, ans=0.0 2023-06-23 06:10:20,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1417632.0, ans=0.125 2023-06-23 06:10:40,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1417692.0, ans=0.1 2023-06-23 06:10:54,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1417752.0, ans=0.5 2023-06-23 06:11:32,450 INFO [train.py:996] (2/4) Epoch 8, batch 22850, loss[loss=0.2088, simple_loss=0.2661, pruned_loss=0.07569, over 21485.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3035, pruned_loss=0.08571, over 4271012.09 frames. ], batch size: 441, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:11:49,366 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.448e+02 5.341e+02 7.317e+02 9.622e+02 1.873e+03, threshold=1.463e+03, percent-clipped=13.0 2023-06-23 06:12:09,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1417932.0, ans=0.125 2023-06-23 06:12:57,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1418112.0, ans=0.125 2023-06-23 06:13:07,125 INFO [train.py:996] (2/4) Epoch 8, batch 22900, loss[loss=0.284, simple_loss=0.3874, pruned_loss=0.09026, over 21617.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3051, pruned_loss=0.08437, over 4280085.26 frames. ], batch size: 441, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:13:48,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1418292.0, ans=0.125 2023-06-23 06:14:08,824 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.10 vs. limit=15.0 2023-06-23 06:14:16,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1418352.0, ans=0.0 2023-06-23 06:14:29,110 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-23 06:14:34,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1418412.0, ans=0.125 2023-06-23 06:14:56,811 INFO [train.py:996] (2/4) Epoch 8, batch 22950, loss[loss=0.2851, simple_loss=0.414, pruned_loss=0.07814, over 21617.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3183, pruned_loss=0.08354, over 4274935.00 frames. ], batch size: 389, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:15:05,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1418472.0, ans=0.2 2023-06-23 06:15:10,122 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.029e+02 4.953e+02 7.269e+02 1.039e+03 2.026e+03, threshold=1.454e+03, percent-clipped=12.0 2023-06-23 06:15:23,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1418532.0, ans=0.1 2023-06-23 06:15:45,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1418652.0, ans=0.025 2023-06-23 06:16:08,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1418652.0, ans=0.125 2023-06-23 06:16:22,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1418712.0, ans=0.5 2023-06-23 06:16:36,816 INFO [train.py:996] (2/4) Epoch 8, batch 23000, loss[loss=0.2372, simple_loss=0.3056, pruned_loss=0.08436, over 21916.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3171, pruned_loss=0.0814, over 4279337.22 frames. ], batch size: 333, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:16:37,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1418772.0, ans=0.0 2023-06-23 06:16:54,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1418832.0, ans=0.2 2023-06-23 06:18:12,362 INFO [train.py:996] (2/4) Epoch 8, batch 23050, loss[loss=0.2234, simple_loss=0.3017, pruned_loss=0.07256, over 21693.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3179, pruned_loss=0.08302, over 4281474.45 frames. ], batch size: 298, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:18:25,318 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.192e+02 4.592e+02 5.368e+02 6.927e+02 1.540e+03, threshold=1.074e+03, percent-clipped=1.0 2023-06-23 06:18:51,864 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 06:19:06,064 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.77 vs. limit=10.0 2023-06-23 06:19:34,650 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 06:19:47,023 INFO [train.py:996] (2/4) Epoch 8, batch 23100, loss[loss=0.1915, simple_loss=0.2553, pruned_loss=0.06387, over 21524.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3124, pruned_loss=0.08302, over 4281623.81 frames. ], batch size: 212, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:20:00,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1419372.0, ans=0.1 2023-06-23 06:20:50,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1419552.0, ans=0.0 2023-06-23 06:21:21,794 INFO [train.py:996] (2/4) Epoch 8, batch 23150, loss[loss=0.2353, simple_loss=0.2996, pruned_loss=0.08546, over 21928.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3068, pruned_loss=0.08258, over 4288845.74 frames. ], batch size: 316, lr: 3.67e-03, grad_scale: 16.0 2023-06-23 06:21:34,635 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.444e+02 4.721e+02 6.329e+02 9.421e+02 1.968e+03, threshold=1.266e+03, percent-clipped=20.0 2023-06-23 06:21:49,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1419732.0, ans=0.1 2023-06-23 06:22:13,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1419792.0, ans=0.2 2023-06-23 06:22:21,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1419852.0, ans=0.125 2023-06-23 06:22:29,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1419852.0, ans=0.0 2023-06-23 06:22:41,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1419852.0, ans=0.2 2023-06-23 06:22:41,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1419852.0, ans=0.0 2023-06-23 06:22:59,178 INFO [train.py:996] (2/4) Epoch 8, batch 23200, loss[loss=0.2512, simple_loss=0.3179, pruned_loss=0.0923, over 21790.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3063, pruned_loss=0.08389, over 4294303.15 frames. ], batch size: 112, lr: 3.67e-03, grad_scale: 32.0 2023-06-23 06:24:06,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1420152.0, ans=0.125 2023-06-23 06:24:24,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1420212.0, ans=0.0 2023-06-23 06:24:37,810 INFO [train.py:996] (2/4) Epoch 8, batch 23250, loss[loss=0.2283, simple_loss=0.2986, pruned_loss=0.079, over 21534.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3071, pruned_loss=0.08497, over 4296392.08 frames. ], batch size: 131, lr: 3.67e-03, grad_scale: 32.0 2023-06-23 06:24:50,325 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.546e+02 4.969e+02 6.559e+02 1.052e+03 2.390e+03, threshold=1.312e+03, percent-clipped=18.0 2023-06-23 06:25:07,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1420332.0, ans=0.0 2023-06-23 06:25:16,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1420392.0, ans=0.0 2023-06-23 06:25:26,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1420392.0, ans=0.0 2023-06-23 06:25:26,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1420392.0, ans=0.125 2023-06-23 06:25:39,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1420452.0, ans=0.1 2023-06-23 06:25:57,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1420452.0, ans=0.0 2023-06-23 06:26:18,060 INFO [train.py:996] (2/4) Epoch 8, batch 23300, loss[loss=0.2613, simple_loss=0.3613, pruned_loss=0.08062, over 21786.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3148, pruned_loss=0.08682, over 4299545.58 frames. ], batch size: 247, lr: 3.67e-03, grad_scale: 32.0 2023-06-23 06:26:21,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1420572.0, ans=0.0 2023-06-23 06:26:59,099 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.39 vs. limit=10.0 2023-06-23 06:27:06,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1420692.0, ans=0.0 2023-06-23 06:27:27,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1420752.0, ans=0.07 2023-06-23 06:27:40,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1420812.0, ans=0.125 2023-06-23 06:27:50,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1420812.0, ans=0.0 2023-06-23 06:27:58,438 INFO [train.py:996] (2/4) Epoch 8, batch 23350, loss[loss=0.182, simple_loss=0.2698, pruned_loss=0.04713, over 21838.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3198, pruned_loss=0.08559, over 4297407.46 frames. ], batch size: 317, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 06:28:18,033 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.315e+02 4.912e+02 6.155e+02 8.820e+02 1.771e+03, threshold=1.231e+03, percent-clipped=5.0 2023-06-23 06:29:16,996 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 06:29:22,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1421112.0, ans=0.125 2023-06-23 06:29:37,109 INFO [train.py:996] (2/4) Epoch 8, batch 23400, loss[loss=0.213, simple_loss=0.3007, pruned_loss=0.06263, over 20890.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3137, pruned_loss=0.08176, over 4289888.00 frames. ], batch size: 607, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 06:31:09,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1421412.0, ans=0.1 2023-06-23 06:31:20,225 INFO [train.py:996] (2/4) Epoch 8, batch 23450, loss[loss=0.2604, simple_loss=0.3418, pruned_loss=0.08952, over 21516.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3139, pruned_loss=0.08419, over 4292220.77 frames. ], batch size: 131, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 06:31:25,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1421472.0, ans=0.125 2023-06-23 06:31:38,555 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.252e+02 4.296e+02 5.237e+02 7.563e+02 1.579e+03, threshold=1.047e+03, percent-clipped=8.0 2023-06-23 06:31:43,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1421532.0, ans=0.125 2023-06-23 06:32:32,532 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.76 vs. limit=6.0 2023-06-23 06:32:58,359 INFO [train.py:996] (2/4) Epoch 8, batch 23500, loss[loss=0.2525, simple_loss=0.3808, pruned_loss=0.06206, over 19764.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3145, pruned_loss=0.08575, over 4291374.86 frames. ], batch size: 702, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 06:33:21,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1421832.0, ans=0.0 2023-06-23 06:33:37,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1421832.0, ans=0.125 2023-06-23 06:33:41,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1421892.0, ans=10.0 2023-06-23 06:34:00,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1421952.0, ans=0.0 2023-06-23 06:34:22,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1422012.0, ans=0.2 2023-06-23 06:34:35,782 INFO [train.py:996] (2/4) Epoch 8, batch 23550, loss[loss=0.2339, simple_loss=0.2812, pruned_loss=0.09328, over 21363.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3101, pruned_loss=0.08551, over 4294212.00 frames. ], batch size: 508, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 06:34:54,226 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.322e+02 4.998e+02 7.038e+02 9.548e+02 2.153e+03, threshold=1.408e+03, percent-clipped=14.0 2023-06-23 06:35:27,589 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.76 vs. limit=15.0 2023-06-23 06:35:33,682 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.25 vs. limit=10.0 2023-06-23 06:35:38,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1422252.0, ans=0.2 2023-06-23 06:36:18,152 INFO [train.py:996] (2/4) Epoch 8, batch 23600, loss[loss=0.2619, simple_loss=0.3344, pruned_loss=0.0947, over 21279.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3088, pruned_loss=0.08531, over 4277630.26 frames. ], batch size: 159, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:36:54,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1422432.0, ans=0.2 2023-06-23 06:37:03,256 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.88 vs. limit=15.0 2023-06-23 06:37:10,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1422492.0, ans=0.0 2023-06-23 06:37:25,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1422552.0, ans=0.0 2023-06-23 06:37:58,153 INFO [train.py:996] (2/4) Epoch 8, batch 23650, loss[loss=0.2679, simple_loss=0.3443, pruned_loss=0.09572, over 21718.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3081, pruned_loss=0.08271, over 4276856.77 frames. ], batch size: 441, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:37:58,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1422672.0, ans=0.2 2023-06-23 06:38:22,826 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.610e+02 4.602e+02 5.917e+02 8.221e+02 1.589e+03, threshold=1.183e+03, percent-clipped=3.0 2023-06-23 06:38:42,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1422792.0, ans=0.1 2023-06-23 06:38:49,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1422792.0, ans=0.125 2023-06-23 06:39:08,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1422852.0, ans=0.125 2023-06-23 06:39:24,298 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 06:39:48,535 INFO [train.py:996] (2/4) Epoch 8, batch 23700, loss[loss=0.2215, simple_loss=0.3076, pruned_loss=0.06765, over 21600.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.31, pruned_loss=0.08186, over 4277866.69 frames. ], batch size: 414, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:41:16,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1423212.0, ans=0.125 2023-06-23 06:41:19,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1423212.0, ans=0.95 2023-06-23 06:41:28,734 INFO [train.py:996] (2/4) Epoch 8, batch 23750, loss[loss=0.2237, simple_loss=0.3021, pruned_loss=0.07263, over 21342.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3132, pruned_loss=0.08232, over 4280793.57 frames. ], batch size: 176, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:41:37,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1423272.0, ans=0.1 2023-06-23 06:41:40,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1423272.0, ans=0.125 2023-06-23 06:41:42,893 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.157e+02 4.173e+02 5.450e+02 7.281e+02 1.269e+03, threshold=1.090e+03, percent-clipped=1.0 2023-06-23 06:42:08,889 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-23 06:42:13,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1423392.0, ans=0.1 2023-06-23 06:42:22,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1423392.0, ans=0.0 2023-06-23 06:42:43,021 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.01 vs. limit=10.0 2023-06-23 06:42:51,712 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 06:43:07,498 INFO [train.py:996] (2/4) Epoch 8, batch 23800, loss[loss=0.2474, simple_loss=0.3289, pruned_loss=0.08295, over 21613.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.315, pruned_loss=0.08182, over 4273446.53 frames. ], batch size: 263, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:43:12,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1423572.0, ans=0.125 2023-06-23 06:43:39,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1423632.0, ans=0.0 2023-06-23 06:44:10,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1423692.0, ans=0.125 2023-06-23 06:44:45,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1423812.0, ans=0.0 2023-06-23 06:44:47,936 INFO [train.py:996] (2/4) Epoch 8, batch 23850, loss[loss=0.2368, simple_loss=0.3149, pruned_loss=0.07937, over 21569.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3252, pruned_loss=0.08428, over 4275926.09 frames. ], batch size: 389, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:45:04,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1423872.0, ans=0.0 2023-06-23 06:45:07,739 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.341e+02 5.290e+02 6.961e+02 9.016e+02 2.497e+03, threshold=1.392e+03, percent-clipped=15.0 2023-06-23 06:45:43,561 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.92 vs. limit=12.0 2023-06-23 06:46:33,155 INFO [train.py:996] (2/4) Epoch 8, batch 23900, loss[loss=0.2406, simple_loss=0.327, pruned_loss=0.07713, over 20658.00 frames. ], tot_loss[loss=0.2533, simple_loss=0.3323, pruned_loss=0.08713, over 4271474.49 frames. ], batch size: 607, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:47:52,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1424412.0, ans=0.1 2023-06-23 06:48:06,324 INFO [train.py:996] (2/4) Epoch 8, batch 23950, loss[loss=0.2402, simple_loss=0.3128, pruned_loss=0.08381, over 21669.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3264, pruned_loss=0.0869, over 4261344.96 frames. ], batch size: 332, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:48:07,278 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.92 vs. limit=12.0 2023-06-23 06:48:14,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1424472.0, ans=0.125 2023-06-23 06:48:25,218 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.604e+02 5.747e+02 7.946e+02 1.092e+03 1.988e+03, threshold=1.589e+03, percent-clipped=11.0 2023-06-23 06:48:27,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1424532.0, ans=0.125 2023-06-23 06:48:49,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1424532.0, ans=0.0 2023-06-23 06:48:51,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1424592.0, ans=0.125 2023-06-23 06:49:45,375 INFO [train.py:996] (2/4) Epoch 8, batch 24000, loss[loss=0.2396, simple_loss=0.3403, pruned_loss=0.0695, over 19900.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3278, pruned_loss=0.09012, over 4256060.43 frames. ], batch size: 703, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:49:45,375 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-23 06:50:04,126 INFO [train.py:1028] (2/4) Epoch 8, validation: loss=0.2639, simple_loss=0.3603, pruned_loss=0.08376, over 1796401.00 frames. 2023-06-23 06:50:04,126 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-23 06:51:36,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1425012.0, ans=0.125 2023-06-23 06:51:43,400 INFO [train.py:996] (2/4) Epoch 8, batch 24050, loss[loss=0.226, simple_loss=0.3112, pruned_loss=0.07041, over 21771.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3291, pruned_loss=0.09065, over 4264260.84 frames. ], batch size: 332, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:52:08,104 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.279e+02 4.739e+02 5.574e+02 8.138e+02 1.478e+03, threshold=1.115e+03, percent-clipped=0.0 2023-06-23 06:53:16,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1425312.0, ans=0.125 2023-06-23 06:53:28,942 INFO [train.py:996] (2/4) Epoch 8, batch 24100, loss[loss=0.2166, simple_loss=0.3095, pruned_loss=0.06187, over 20790.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3283, pruned_loss=0.08872, over 4266305.73 frames. ], batch size: 607, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:53:43,711 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-23 06:54:03,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1425492.0, ans=0.125 2023-06-23 06:54:25,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1425552.0, ans=0.0 2023-06-23 06:54:57,776 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 06:54:59,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1425612.0, ans=0.1 2023-06-23 06:55:07,247 INFO [train.py:996] (2/4) Epoch 8, batch 24150, loss[loss=0.2935, simple_loss=0.35, pruned_loss=0.1184, over 21726.00 frames. ], tot_loss[loss=0.2545, simple_loss=0.3278, pruned_loss=0.09057, over 4273239.83 frames. ], batch size: 389, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:55:19,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1425672.0, ans=0.2 2023-06-23 06:55:22,323 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.625e+02 4.851e+02 6.515e+02 9.296e+02 1.728e+03, threshold=1.303e+03, percent-clipped=14.0 2023-06-23 06:55:24,984 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.74 vs. limit=12.0 2023-06-23 06:55:32,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1425732.0, ans=0.0 2023-06-23 06:56:43,549 INFO [train.py:996] (2/4) Epoch 8, batch 24200, loss[loss=0.2193, simple_loss=0.3, pruned_loss=0.06928, over 21682.00 frames. ], tot_loss[loss=0.2561, simple_loss=0.3293, pruned_loss=0.09145, over 4272794.08 frames. ], batch size: 247, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:57:22,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1426092.0, ans=0.025 2023-06-23 06:58:08,418 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.42 vs. limit=15.0 2023-06-23 06:58:25,332 INFO [train.py:996] (2/4) Epoch 8, batch 24250, loss[loss=0.2119, simple_loss=0.3348, pruned_loss=0.04452, over 20786.00 frames. ], tot_loss[loss=0.2498, simple_loss=0.3275, pruned_loss=0.08608, over 4269948.60 frames. ], batch size: 607, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 06:58:44,555 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.260e+02 4.495e+02 7.277e+02 1.167e+03 2.451e+03, threshold=1.455e+03, percent-clipped=16.0 2023-06-23 07:00:04,111 INFO [train.py:996] (2/4) Epoch 8, batch 24300, loss[loss=0.2312, simple_loss=0.3458, pruned_loss=0.05833, over 20832.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3218, pruned_loss=0.08032, over 4265410.63 frames. ], batch size: 607, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 07:00:04,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1426572.0, ans=0.2 2023-06-23 07:01:24,731 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.20 vs. limit=15.0 2023-06-23 07:01:33,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1426812.0, ans=0.1 2023-06-23 07:01:46,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1426872.0, ans=0.125 2023-06-23 07:01:47,230 INFO [train.py:996] (2/4) Epoch 8, batch 24350, loss[loss=0.2643, simple_loss=0.3384, pruned_loss=0.09512, over 21828.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3152, pruned_loss=0.07786, over 4266406.61 frames. ], batch size: 298, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 07:02:03,691 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.048e+02 4.784e+02 6.670e+02 9.592e+02 1.817e+03, threshold=1.334e+03, percent-clipped=7.0 2023-06-23 07:02:41,819 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.22 vs. limit=10.0 2023-06-23 07:03:27,453 INFO [train.py:996] (2/4) Epoch 8, batch 24400, loss[loss=0.2347, simple_loss=0.3091, pruned_loss=0.08017, over 21588.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3189, pruned_loss=0.0812, over 4269798.87 frames. ], batch size: 263, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 07:03:30,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1427172.0, ans=0.2 2023-06-23 07:03:39,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1427172.0, ans=0.0 2023-06-23 07:04:26,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1427292.0, ans=0.0 2023-06-23 07:04:27,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1427292.0, ans=0.0 2023-06-23 07:04:32,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1427352.0, ans=0.0 2023-06-23 07:04:32,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1427352.0, ans=0.0 2023-06-23 07:05:07,060 INFO [train.py:996] (2/4) Epoch 8, batch 24450, loss[loss=0.3199, simple_loss=0.3991, pruned_loss=0.1204, over 21461.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3224, pruned_loss=0.08367, over 4273564.20 frames. ], batch size: 507, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 07:05:07,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1427472.0, ans=0.0 2023-06-23 07:05:07,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1427472.0, ans=0.125 2023-06-23 07:05:23,011 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.696e+02 5.464e+02 7.459e+02 1.124e+03 2.090e+03, threshold=1.492e+03, percent-clipped=14.0 2023-06-23 07:05:36,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1427532.0, ans=0.1 2023-06-23 07:05:50,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1427592.0, ans=0.125 2023-06-23 07:06:23,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1427652.0, ans=0.015 2023-06-23 07:06:34,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1427712.0, ans=0.125 2023-06-23 07:06:44,603 INFO [train.py:996] (2/4) Epoch 8, batch 24500, loss[loss=0.2212, simple_loss=0.2968, pruned_loss=0.07276, over 21850.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3229, pruned_loss=0.08394, over 4276215.81 frames. ], batch size: 282, lr: 3.66e-03, grad_scale: 32.0 2023-06-23 07:07:24,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1427892.0, ans=0.1 2023-06-23 07:07:26,660 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=22.5 2023-06-23 07:07:35,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1427892.0, ans=0.0 2023-06-23 07:08:24,393 INFO [train.py:996] (2/4) Epoch 8, batch 24550, loss[loss=0.2412, simple_loss=0.3095, pruned_loss=0.08643, over 21835.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3237, pruned_loss=0.08562, over 4273053.37 frames. ], batch size: 247, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 07:08:24,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1428072.0, ans=0.125 2023-06-23 07:08:50,838 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.336e+02 4.677e+02 6.091e+02 7.782e+02 1.609e+03, threshold=1.218e+03, percent-clipped=3.0 2023-06-23 07:08:54,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1428132.0, ans=0.0 2023-06-23 07:09:47,122 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=22.5 2023-06-23 07:10:02,236 INFO [train.py:996] (2/4) Epoch 8, batch 24600, loss[loss=0.1896, simple_loss=0.2513, pruned_loss=0.06393, over 21969.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3201, pruned_loss=0.08679, over 4269274.53 frames. ], batch size: 103, lr: 3.66e-03, grad_scale: 16.0 2023-06-23 07:10:15,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1428372.0, ans=0.0 2023-06-23 07:10:20,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1428372.0, ans=0.0 2023-06-23 07:10:56,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1428492.0, ans=0.1 2023-06-23 07:11:14,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1428552.0, ans=0.0 2023-06-23 07:11:38,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1428612.0, ans=0.0 2023-06-23 07:11:38,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1428612.0, ans=0.125 2023-06-23 07:11:38,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1428612.0, ans=0.05 2023-06-23 07:11:39,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1428672.0, ans=0.125 2023-06-23 07:11:40,891 INFO [train.py:996] (2/4) Epoch 8, batch 24650, loss[loss=0.2023, simple_loss=0.2672, pruned_loss=0.06872, over 21299.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3116, pruned_loss=0.08509, over 4270903.15 frames. ], batch size: 131, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:12:13,320 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.435e+02 5.561e+02 8.132e+02 1.139e+03 1.963e+03, threshold=1.626e+03, percent-clipped=16.0 2023-06-23 07:12:51,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1428852.0, ans=0.125 2023-06-23 07:13:19,836 INFO [train.py:996] (2/4) Epoch 8, batch 24700, loss[loss=0.2357, simple_loss=0.2982, pruned_loss=0.08666, over 21551.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3093, pruned_loss=0.08336, over 4254036.70 frames. ], batch size: 414, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:14:19,457 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.59 vs. limit=10.0 2023-06-23 07:14:35,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1429152.0, ans=0.1 2023-06-23 07:14:48,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1429212.0, ans=0.025 2023-06-23 07:14:52,851 INFO [train.py:996] (2/4) Epoch 8, batch 24750, loss[loss=0.2298, simple_loss=0.3639, pruned_loss=0.04783, over 19777.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3044, pruned_loss=0.08109, over 4248386.19 frames. ], batch size: 702, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:14:54,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1429272.0, ans=0.125 2023-06-23 07:15:19,821 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.105e+02 4.833e+02 6.692e+02 9.106e+02 2.171e+03, threshold=1.338e+03, percent-clipped=2.0 2023-06-23 07:15:23,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1429332.0, ans=0.0 2023-06-23 07:15:49,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1429392.0, ans=0.125 2023-06-23 07:16:31,254 INFO [train.py:996] (2/4) Epoch 8, batch 24800, loss[loss=0.1823, simple_loss=0.2322, pruned_loss=0.06616, over 20829.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2988, pruned_loss=0.08035, over 4255885.51 frames. ], batch size: 609, lr: 3.65e-03, grad_scale: 32.0 2023-06-23 07:16:52,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1429572.0, ans=0.1 2023-06-23 07:17:14,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1429632.0, ans=0.0 2023-06-23 07:17:34,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1429752.0, ans=0.125 2023-06-23 07:17:51,320 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.73 vs. limit=5.0 2023-06-23 07:18:04,055 INFO [train.py:996] (2/4) Epoch 8, batch 24850, loss[loss=0.2001, simple_loss=0.2595, pruned_loss=0.07036, over 21290.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.2995, pruned_loss=0.0817, over 4265299.81 frames. ], batch size: 159, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:18:32,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1429932.0, ans=0.0 2023-06-23 07:18:33,264 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.302e+02 4.718e+02 6.141e+02 8.581e+02 1.389e+03, threshold=1.228e+03, percent-clipped=1.0 2023-06-23 07:18:56,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1429992.0, ans=0.0 2023-06-23 07:18:58,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1429992.0, ans=0.09899494936611666 2023-06-23 07:19:13,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1430052.0, ans=0.125 2023-06-23 07:19:15,288 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 07:19:16,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1430052.0, ans=0.025 2023-06-23 07:19:28,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1430112.0, ans=0.1 2023-06-23 07:19:49,208 INFO [train.py:996] (2/4) Epoch 8, batch 24900, loss[loss=0.2802, simple_loss=0.3475, pruned_loss=0.1065, over 21594.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.302, pruned_loss=0.08207, over 4272058.20 frames. ], batch size: 389, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:20:57,807 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=13.11 vs. limit=15.0 2023-06-23 07:21:34,303 INFO [train.py:996] (2/4) Epoch 8, batch 24950, loss[loss=0.2333, simple_loss=0.3113, pruned_loss=0.07767, over 20632.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.31, pruned_loss=0.08644, over 4265799.93 frames. ], batch size: 607, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:22:00,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1430532.0, ans=0.125 2023-06-23 07:22:03,158 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.572e+02 4.703e+02 5.868e+02 8.505e+02 2.192e+03, threshold=1.174e+03, percent-clipped=6.0 2023-06-23 07:22:06,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1430532.0, ans=0.0 2023-06-23 07:22:11,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1430532.0, ans=0.5 2023-06-23 07:23:04,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1430712.0, ans=0.0 2023-06-23 07:23:21,889 INFO [train.py:996] (2/4) Epoch 8, batch 25000, loss[loss=0.2495, simple_loss=0.3028, pruned_loss=0.09812, over 21352.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3166, pruned_loss=0.08856, over 4270962.84 frames. ], batch size: 507, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:23:27,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1430772.0, ans=0.1 2023-06-23 07:24:06,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1430892.0, ans=0.125 2023-06-23 07:24:30,862 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 07:24:43,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1431012.0, ans=0.125 2023-06-23 07:24:53,440 INFO [train.py:996] (2/4) Epoch 8, batch 25050, loss[loss=0.2086, simple_loss=0.2732, pruned_loss=0.072, over 21780.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3106, pruned_loss=0.08736, over 4276281.76 frames. ], batch size: 317, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:25:17,632 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.995e+02 4.487e+02 5.838e+02 7.912e+02 1.332e+03, threshold=1.168e+03, percent-clipped=3.0 2023-06-23 07:25:52,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1431252.0, ans=0.1 2023-06-23 07:26:00,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1431252.0, ans=0.2 2023-06-23 07:26:13,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1431312.0, ans=0.125 2023-06-23 07:26:14,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1431312.0, ans=0.125 2023-06-23 07:26:33,599 INFO [train.py:996] (2/4) Epoch 8, batch 25100, loss[loss=0.2237, simple_loss=0.3096, pruned_loss=0.06891, over 21544.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3047, pruned_loss=0.08602, over 4271039.05 frames. ], batch size: 230, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:27:46,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1431612.0, ans=10.0 2023-06-23 07:28:11,755 INFO [train.py:996] (2/4) Epoch 8, batch 25150, loss[loss=0.2063, simple_loss=0.2941, pruned_loss=0.05929, over 21762.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3084, pruned_loss=0.08328, over 4260984.67 frames. ], batch size: 247, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:28:30,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1431732.0, ans=0.125 2023-06-23 07:28:34,953 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.013e+02 4.444e+02 6.518e+02 1.039e+03 2.142e+03, threshold=1.304e+03, percent-clipped=17.0 2023-06-23 07:28:39,195 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=15.0 2023-06-23 07:28:55,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1431792.0, ans=0.1 2023-06-23 07:28:57,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1431792.0, ans=0.0 2023-06-23 07:29:48,547 INFO [train.py:996] (2/4) Epoch 8, batch 25200, loss[loss=0.2347, simple_loss=0.3304, pruned_loss=0.06947, over 21608.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3085, pruned_loss=0.08157, over 4264319.86 frames. ], batch size: 263, lr: 3.65e-03, grad_scale: 32.0 2023-06-23 07:30:01,420 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=22.5 2023-06-23 07:30:16,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1432032.0, ans=0.2 2023-06-23 07:30:19,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1432032.0, ans=0.125 2023-06-23 07:30:31,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1432092.0, ans=0.125 2023-06-23 07:30:42,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1432152.0, ans=0.125 2023-06-23 07:30:50,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1432152.0, ans=0.0 2023-06-23 07:31:26,046 INFO [train.py:996] (2/4) Epoch 8, batch 25250, loss[loss=0.2323, simple_loss=0.2912, pruned_loss=0.08665, over 21504.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3058, pruned_loss=0.07979, over 4258271.73 frames. ], batch size: 414, lr: 3.65e-03, grad_scale: 32.0 2023-06-23 07:31:49,876 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.012e+02 4.456e+02 5.347e+02 9.796e+02 2.256e+03, threshold=1.069e+03, percent-clipped=12.0 2023-06-23 07:31:56,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1432332.0, ans=0.125 2023-06-23 07:32:13,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1432392.0, ans=0.0 2023-06-23 07:32:13,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1432392.0, ans=0.125 2023-06-23 07:32:56,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1432512.0, ans=0.125 2023-06-23 07:32:58,683 INFO [train.py:996] (2/4) Epoch 8, batch 25300, loss[loss=0.2999, simple_loss=0.3629, pruned_loss=0.1185, over 21575.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3037, pruned_loss=0.07901, over 4242677.65 frames. ], batch size: 414, lr: 3.65e-03, grad_scale: 32.0 2023-06-23 07:34:12,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1432752.0, ans=0.125 2023-06-23 07:34:37,958 INFO [train.py:996] (2/4) Epoch 8, batch 25350, loss[loss=0.1896, simple_loss=0.2733, pruned_loss=0.05299, over 21716.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3046, pruned_loss=0.07839, over 4253943.83 frames. ], batch size: 298, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:34:52,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1432872.0, ans=0.0 2023-06-23 07:35:01,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1432932.0, ans=0.1 2023-06-23 07:35:02,941 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.301e+02 4.631e+02 6.587e+02 1.003e+03 1.652e+03, threshold=1.317e+03, percent-clipped=14.0 2023-06-23 07:35:05,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1432932.0, ans=0.2 2023-06-23 07:35:08,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1432932.0, ans=0.1 2023-06-23 07:35:34,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1433052.0, ans=0.2 2023-06-23 07:36:14,608 INFO [train.py:996] (2/4) Epoch 8, batch 25400, loss[loss=0.2474, simple_loss=0.3079, pruned_loss=0.09343, over 21759.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3001, pruned_loss=0.07729, over 4256328.22 frames. ], batch size: 316, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:36:52,870 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.28 vs. limit=15.0 2023-06-23 07:37:02,071 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-23 07:37:26,994 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.01 vs. limit=22.5 2023-06-23 07:37:44,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1433412.0, ans=0.125 2023-06-23 07:37:51,413 INFO [train.py:996] (2/4) Epoch 8, batch 25450, loss[loss=0.2002, simple_loss=0.299, pruned_loss=0.05072, over 21801.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3002, pruned_loss=0.07787, over 4257647.21 frames. ], batch size: 282, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:38:06,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1433472.0, ans=0.125 2023-06-23 07:38:17,198 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.052e+02 4.125e+02 5.251e+02 6.939e+02 1.396e+03, threshold=1.050e+03, percent-clipped=1.0 2023-06-23 07:38:47,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1433652.0, ans=0.2 2023-06-23 07:39:32,038 INFO [train.py:996] (2/4) Epoch 8, batch 25500, loss[loss=0.3154, simple_loss=0.3833, pruned_loss=0.1238, over 21473.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3007, pruned_loss=0.07497, over 4260434.69 frames. ], batch size: 471, lr: 3.65e-03, grad_scale: 8.0 2023-06-23 07:39:53,879 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.67 vs. limit=10.0 2023-06-23 07:40:11,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1433892.0, ans=0.125 2023-06-23 07:40:19,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1433892.0, ans=0.1 2023-06-23 07:40:39,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1433952.0, ans=0.04949747468305833 2023-06-23 07:41:11,148 INFO [train.py:996] (2/4) Epoch 8, batch 25550, loss[loss=0.2091, simple_loss=0.2855, pruned_loss=0.06633, over 21462.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.309, pruned_loss=0.07598, over 4256734.54 frames. ], batch size: 131, lr: 3.65e-03, grad_scale: 8.0 2023-06-23 07:41:29,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1434072.0, ans=0.0 2023-06-23 07:41:38,526 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.786e+02 4.210e+02 5.304e+02 7.417e+02 2.336e+03, threshold=1.061e+03, percent-clipped=9.0 2023-06-23 07:42:06,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1434192.0, ans=0.04949747468305833 2023-06-23 07:42:06,809 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=15.0 2023-06-23 07:42:33,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1434312.0, ans=0.125 2023-06-23 07:42:55,551 INFO [train.py:996] (2/4) Epoch 8, batch 25600, loss[loss=0.2467, simple_loss=0.3236, pruned_loss=0.08491, over 21885.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3131, pruned_loss=0.07742, over 4253165.91 frames. ], batch size: 371, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:43:07,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1434372.0, ans=0.1 2023-06-23 07:43:39,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1434492.0, ans=0.125 2023-06-23 07:43:43,424 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.25 vs. limit=10.0 2023-06-23 07:44:02,621 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-23 07:44:05,570 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=22.5 2023-06-23 07:44:07,304 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-06-23 07:44:11,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1434612.0, ans=0.0 2023-06-23 07:44:33,947 INFO [train.py:996] (2/4) Epoch 8, batch 25650, loss[loss=0.1971, simple_loss=0.2637, pruned_loss=0.06527, over 21653.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3138, pruned_loss=0.0803, over 4260964.31 frames. ], batch size: 282, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:44:55,720 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.690e+02 5.926e+02 8.067e+02 1.090e+03 2.033e+03, threshold=1.613e+03, percent-clipped=28.0 2023-06-23 07:45:42,183 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-06-23 07:45:46,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1434852.0, ans=0.125 2023-06-23 07:46:11,862 INFO [train.py:996] (2/4) Epoch 8, batch 25700, loss[loss=0.2692, simple_loss=0.3213, pruned_loss=0.1085, over 21712.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3102, pruned_loss=0.08118, over 4266001.51 frames. ], batch size: 441, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:46:17,488 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-23 07:46:48,290 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-23 07:47:14,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1435152.0, ans=0.0 2023-06-23 07:47:18,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1435152.0, ans=0.125 2023-06-23 07:47:19,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1435152.0, ans=0.125 2023-06-23 07:47:46,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1435212.0, ans=0.0 2023-06-23 07:47:52,532 INFO [train.py:996] (2/4) Epoch 8, batch 25750, loss[loss=0.2797, simple_loss=0.3397, pruned_loss=0.1099, over 21810.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3169, pruned_loss=0.0842, over 4271470.02 frames. ], batch size: 118, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:47:54,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1435272.0, ans=0.2 2023-06-23 07:48:16,425 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=22.5 2023-06-23 07:48:17,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1435332.0, ans=0.125 2023-06-23 07:48:17,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1435332.0, ans=0.125 2023-06-23 07:48:25,284 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.333e+02 5.092e+02 6.488e+02 8.589e+02 2.442e+03, threshold=1.298e+03, percent-clipped=2.0 2023-06-23 07:49:00,291 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-06-23 07:49:08,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1435452.0, ans=0.125 2023-06-23 07:49:21,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1435512.0, ans=0.2 2023-06-23 07:49:24,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1435512.0, ans=0.0 2023-06-23 07:49:38,474 INFO [train.py:996] (2/4) Epoch 8, batch 25800, loss[loss=0.3258, simple_loss=0.3929, pruned_loss=0.1294, over 21769.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3295, pruned_loss=0.08867, over 4269951.50 frames. ], batch size: 441, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:49:57,138 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-23 07:50:02,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1435632.0, ans=0.0 2023-06-23 07:50:17,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1435632.0, ans=0.125 2023-06-23 07:50:20,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1435692.0, ans=0.1 2023-06-23 07:50:32,687 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.69 vs. limit=12.0 2023-06-23 07:51:22,164 INFO [train.py:996] (2/4) Epoch 8, batch 25850, loss[loss=0.2369, simple_loss=0.3004, pruned_loss=0.08674, over 20291.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3297, pruned_loss=0.08696, over 4270459.38 frames. ], batch size: 707, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:51:37,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1435932.0, ans=0.0 2023-06-23 07:51:45,485 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.179e+02 4.988e+02 6.409e+02 1.000e+03 3.081e+03, threshold=1.282e+03, percent-clipped=14.0 2023-06-23 07:51:47,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1435932.0, ans=0.125 2023-06-23 07:51:58,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1435992.0, ans=0.0 2023-06-23 07:52:00,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1435992.0, ans=0.1 2023-06-23 07:52:21,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1436052.0, ans=0.0 2023-06-23 07:52:25,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1436052.0, ans=0.1 2023-06-23 07:52:57,066 INFO [train.py:996] (2/4) Epoch 8, batch 25900, loss[loss=0.282, simple_loss=0.37, pruned_loss=0.097, over 21804.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3314, pruned_loss=0.08866, over 4275013.86 frames. ], batch size: 282, lr: 3.65e-03, grad_scale: 16.0 2023-06-23 07:53:03,104 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=15.0 2023-06-23 07:53:10,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1436172.0, ans=0.125 2023-06-23 07:53:20,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1436232.0, ans=0.125 2023-06-23 07:53:27,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1436232.0, ans=0.035 2023-06-23 07:53:34,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1436292.0, ans=0.2 2023-06-23 07:53:40,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1436292.0, ans=0.125 2023-06-23 07:54:01,820 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.89 vs. limit=15.0 2023-06-23 07:54:20,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1436412.0, ans=0.5 2023-06-23 07:54:36,062 INFO [train.py:996] (2/4) Epoch 8, batch 25950, loss[loss=0.2248, simple_loss=0.3098, pruned_loss=0.06992, over 21678.00 frames. ], tot_loss[loss=0.2583, simple_loss=0.3358, pruned_loss=0.09041, over 4277461.76 frames. ], batch size: 231, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 07:54:36,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1436472.0, ans=0.125 2023-06-23 07:54:42,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1436472.0, ans=0.125 2023-06-23 07:55:03,604 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.640e+02 4.822e+02 6.504e+02 9.167e+02 2.432e+03, threshold=1.301e+03, percent-clipped=14.0 2023-06-23 07:56:02,756 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-06-23 07:56:14,688 INFO [train.py:996] (2/4) Epoch 8, batch 26000, loss[loss=0.2843, simple_loss=0.3562, pruned_loss=0.1061, over 21375.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.335, pruned_loss=0.08878, over 4275931.11 frames. ], batch size: 549, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 07:56:27,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1436772.0, ans=0.125 2023-06-23 07:56:27,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1436772.0, ans=0.1 2023-06-23 07:57:04,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1436892.0, ans=0.0 2023-06-23 07:57:51,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1437072.0, ans=0.125 2023-06-23 07:57:57,827 INFO [train.py:996] (2/4) Epoch 8, batch 26050, loss[loss=0.2611, simple_loss=0.3243, pruned_loss=0.09895, over 21865.00 frames. ], tot_loss[loss=0.2564, simple_loss=0.333, pruned_loss=0.08988, over 4279758.83 frames. ], batch size: 124, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 07:58:19,765 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.609e+02 4.589e+02 6.004e+02 7.871e+02 1.709e+03, threshold=1.201e+03, percent-clipped=5.0 2023-06-23 07:59:12,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1437252.0, ans=0.125 2023-06-23 07:59:14,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1437312.0, ans=0.125 2023-06-23 07:59:36,459 INFO [train.py:996] (2/4) Epoch 8, batch 26100, loss[loss=0.2157, simple_loss=0.282, pruned_loss=0.0747, over 21589.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3282, pruned_loss=0.08905, over 4278399.11 frames. ], batch size: 212, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 07:59:40,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1437372.0, ans=15.0 2023-06-23 07:59:54,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1437432.0, ans=0.0 2023-06-23 08:00:25,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1437492.0, ans=0.125 2023-06-23 08:01:16,852 INFO [train.py:996] (2/4) Epoch 8, batch 26150, loss[loss=0.2652, simple_loss=0.3424, pruned_loss=0.09402, over 21437.00 frames. ], tot_loss[loss=0.2522, simple_loss=0.3255, pruned_loss=0.08946, over 4286671.62 frames. ], batch size: 131, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:01:45,497 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.490e+02 4.992e+02 6.219e+02 9.688e+02 1.983e+03, threshold=1.244e+03, percent-clipped=15.0 2023-06-23 08:01:55,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1437732.0, ans=0.125 2023-06-23 08:02:19,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1437852.0, ans=0.125 2023-06-23 08:02:55,504 INFO [train.py:996] (2/4) Epoch 8, batch 26200, loss[loss=0.2579, simple_loss=0.3604, pruned_loss=0.07766, over 21760.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3274, pruned_loss=0.0882, over 4281360.71 frames. ], batch size: 298, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:04:14,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1438212.0, ans=0.125 2023-06-23 08:04:34,802 INFO [train.py:996] (2/4) Epoch 8, batch 26250, loss[loss=0.2365, simple_loss=0.3072, pruned_loss=0.08288, over 21246.00 frames. ], tot_loss[loss=0.2521, simple_loss=0.3298, pruned_loss=0.08719, over 4275068.23 frames. ], batch size: 143, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:05:07,271 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.52 vs. limit=15.0 2023-06-23 08:05:07,705 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.778e+02 4.875e+02 6.519e+02 1.074e+03 2.423e+03, threshold=1.304e+03, percent-clipped=19.0 2023-06-23 08:05:19,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1438392.0, ans=0.0 2023-06-23 08:05:20,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1438392.0, ans=0.1 2023-06-23 08:05:26,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1438392.0, ans=0.125 2023-06-23 08:05:50,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1438512.0, ans=0.1 2023-06-23 08:06:05,058 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.67 vs. limit=6.0 2023-06-23 08:06:12,247 INFO [train.py:996] (2/4) Epoch 8, batch 26300, loss[loss=0.2178, simple_loss=0.2787, pruned_loss=0.07841, over 21240.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3275, pruned_loss=0.08825, over 4284974.99 frames. ], batch size: 608, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:06:41,603 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 08:07:22,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1438752.0, ans=0.0 2023-06-23 08:07:39,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1438812.0, ans=0.125 2023-06-23 08:08:01,847 INFO [train.py:996] (2/4) Epoch 8, batch 26350, loss[loss=0.2789, simple_loss=0.3454, pruned_loss=0.1062, over 21274.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3253, pruned_loss=0.08847, over 4284501.81 frames. ], batch size: 143, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:08:02,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1438872.0, ans=0.1 2023-06-23 08:08:04,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1438872.0, ans=0.125 2023-06-23 08:08:30,345 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.621e+02 4.985e+02 6.232e+02 7.669e+02 1.189e+03, threshold=1.246e+03, percent-clipped=0.0 2023-06-23 08:09:40,176 INFO [train.py:996] (2/4) Epoch 8, batch 26400, loss[loss=0.2424, simple_loss=0.2919, pruned_loss=0.09639, over 21518.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3203, pruned_loss=0.08905, over 4279557.52 frames. ], batch size: 441, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 08:09:48,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1439172.0, ans=0.1 2023-06-23 08:09:48,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1439172.0, ans=0.2 2023-06-23 08:10:04,416 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.64 vs. limit=15.0 2023-06-23 08:10:07,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1439232.0, ans=0.0 2023-06-23 08:10:23,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1439292.0, ans=0.1 2023-06-23 08:11:20,558 INFO [train.py:996] (2/4) Epoch 8, batch 26450, loss[loss=0.2373, simple_loss=0.3051, pruned_loss=0.08471, over 21220.00 frames. ], tot_loss[loss=0.2511, simple_loss=0.3231, pruned_loss=0.08952, over 4278478.06 frames. ], batch size: 159, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:11:50,684 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.705e+02 6.327e+02 8.779e+02 1.313e+03 2.472e+03, threshold=1.756e+03, percent-clipped=25.0 2023-06-23 08:12:30,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1439652.0, ans=0.1 2023-06-23 08:12:53,851 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.71 vs. limit=15.0 2023-06-23 08:12:56,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1439712.0, ans=0.1 2023-06-23 08:13:01,268 INFO [train.py:996] (2/4) Epoch 8, batch 26500, loss[loss=0.1911, simple_loss=0.2478, pruned_loss=0.06722, over 21751.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.324, pruned_loss=0.08785, over 4270481.72 frames. ], batch size: 124, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:14:29,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1440012.0, ans=0.125 2023-06-23 08:14:39,342 INFO [train.py:996] (2/4) Epoch 8, batch 26550, loss[loss=0.2114, simple_loss=0.304, pruned_loss=0.05934, over 21685.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3235, pruned_loss=0.08579, over 4269896.80 frames. ], batch size: 332, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:15:20,121 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.436e+02 5.263e+02 8.028e+02 1.102e+03 2.204e+03, threshold=1.606e+03, percent-clipped=5.0 2023-06-23 08:15:31,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1440192.0, ans=0.125 2023-06-23 08:16:09,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1440312.0, ans=0.125 2023-06-23 08:16:23,283 INFO [train.py:996] (2/4) Epoch 8, batch 26600, loss[loss=0.2755, simple_loss=0.349, pruned_loss=0.101, over 21556.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3218, pruned_loss=0.08204, over 4272510.27 frames. ], batch size: 441, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:16:26,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1440372.0, ans=0.125 2023-06-23 08:16:31,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1440372.0, ans=0.125 2023-06-23 08:16:56,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1440432.0, ans=0.1 2023-06-23 08:17:01,406 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 08:17:11,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1440492.0, ans=0.125 2023-06-23 08:17:20,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1440492.0, ans=0.1 2023-06-23 08:18:02,012 INFO [train.py:996] (2/4) Epoch 8, batch 26650, loss[loss=0.1693, simple_loss=0.2606, pruned_loss=0.03904, over 21667.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3147, pruned_loss=0.08056, over 4255021.66 frames. ], batch size: 391, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:18:36,392 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.026e+02 4.294e+02 5.616e+02 7.721e+02 1.631e+03, threshold=1.123e+03, percent-clipped=1.0 2023-06-23 08:19:02,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1440852.0, ans=0.09899494936611666 2023-06-23 08:19:19,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1440912.0, ans=0.0 2023-06-23 08:19:39,922 INFO [train.py:996] (2/4) Epoch 8, batch 26700, loss[loss=0.2302, simple_loss=0.3015, pruned_loss=0.0795, over 21931.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3067, pruned_loss=0.0775, over 4254261.94 frames. ], batch size: 333, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:19:46,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1440972.0, ans=0.125 2023-06-23 08:20:57,182 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=15.0 2023-06-23 08:21:09,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1441212.0, ans=0.125 2023-06-23 08:21:25,362 INFO [train.py:996] (2/4) Epoch 8, batch 26750, loss[loss=0.2659, simple_loss=0.3494, pruned_loss=0.0912, over 21575.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.307, pruned_loss=0.07642, over 4260188.97 frames. ], batch size: 389, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:21:30,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1441272.0, ans=0.125 2023-06-23 08:21:35,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1441272.0, ans=0.125 2023-06-23 08:21:45,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1441332.0, ans=0.125 2023-06-23 08:21:56,096 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.690e+02 4.314e+02 5.876e+02 8.992e+02 1.662e+03, threshold=1.175e+03, percent-clipped=13.0 2023-06-23 08:22:07,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1441392.0, ans=0.125 2023-06-23 08:22:32,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1441452.0, ans=0.0 2023-06-23 08:22:34,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1441452.0, ans=0.0 2023-06-23 08:23:10,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1441572.0, ans=0.125 2023-06-23 08:23:11,023 INFO [train.py:996] (2/4) Epoch 8, batch 26800, loss[loss=0.2634, simple_loss=0.3352, pruned_loss=0.0958, over 21611.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3118, pruned_loss=0.07866, over 4264643.30 frames. ], batch size: 389, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 08:23:21,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1441572.0, ans=0.125 2023-06-23 08:24:49,790 INFO [train.py:996] (2/4) Epoch 8, batch 26850, loss[loss=0.1867, simple_loss=0.2593, pruned_loss=0.05706, over 15516.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3133, pruned_loss=0.08157, over 4261318.44 frames. ], batch size: 60, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 08:24:52,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1441872.0, ans=0.2 2023-06-23 08:24:53,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1441872.0, ans=0.0 2023-06-23 08:25:14,923 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.645e+02 5.050e+02 6.196e+02 9.210e+02 1.737e+03, threshold=1.239e+03, percent-clipped=8.0 2023-06-23 08:25:33,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1441992.0, ans=0.0 2023-06-23 08:25:47,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1442052.0, ans=0.5 2023-06-23 08:25:59,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1442052.0, ans=0.125 2023-06-23 08:26:00,303 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.91 vs. limit=12.0 2023-06-23 08:26:22,981 INFO [train.py:996] (2/4) Epoch 8, batch 26900, loss[loss=0.2128, simple_loss=0.2682, pruned_loss=0.07871, over 21167.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3046, pruned_loss=0.08086, over 4264581.28 frames. ], batch size: 143, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 08:26:55,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1442232.0, ans=0.125 2023-06-23 08:27:12,559 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.75 vs. limit=10.0 2023-06-23 08:28:02,543 INFO [train.py:996] (2/4) Epoch 8, batch 26950, loss[loss=0.2737, simple_loss=0.357, pruned_loss=0.09523, over 21713.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3037, pruned_loss=0.08142, over 4260958.30 frames. ], batch size: 298, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:28:33,460 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.377e+02 4.817e+02 6.890e+02 1.132e+03 2.322e+03, threshold=1.378e+03, percent-clipped=18.0 2023-06-23 08:29:03,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1442652.0, ans=0.125 2023-06-23 08:29:46,677 INFO [train.py:996] (2/4) Epoch 8, batch 27000, loss[loss=0.2253, simple_loss=0.3197, pruned_loss=0.06543, over 21593.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3048, pruned_loss=0.07974, over 4267660.03 frames. ], batch size: 389, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:29:46,677 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-23 08:30:02,866 INFO [train.py:1028] (2/4) Epoch 8, validation: loss=0.2419, simple_loss=0.3397, pruned_loss=0.07206, over 1796401.00 frames. 2023-06-23 08:30:02,867 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-23 08:30:27,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1442832.0, ans=0.1 2023-06-23 08:30:59,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1442892.0, ans=0.0 2023-06-23 08:31:42,025 INFO [train.py:996] (2/4) Epoch 8, batch 27050, loss[loss=0.2251, simple_loss=0.3132, pruned_loss=0.06854, over 21798.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3062, pruned_loss=0.07613, over 4267206.46 frames. ], batch size: 247, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:32:02,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1443132.0, ans=0.125 2023-06-23 08:32:18,721 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.931e+02 4.261e+02 5.762e+02 7.370e+02 1.710e+03, threshold=1.152e+03, percent-clipped=3.0 2023-06-23 08:32:33,785 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=12.0 2023-06-23 08:32:38,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1443192.0, ans=0.125 2023-06-23 08:33:20,864 INFO [train.py:996] (2/4) Epoch 8, batch 27100, loss[loss=0.2353, simple_loss=0.3322, pruned_loss=0.06917, over 21728.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3084, pruned_loss=0.07758, over 4272537.80 frames. ], batch size: 389, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:34:08,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1443492.0, ans=0.0 2023-06-23 08:34:18,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1443492.0, ans=0.2 2023-06-23 08:34:53,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1443612.0, ans=0.125 2023-06-23 08:35:01,716 INFO [train.py:996] (2/4) Epoch 8, batch 27150, loss[loss=0.2711, simple_loss=0.3626, pruned_loss=0.08983, over 21716.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3192, pruned_loss=0.08093, over 4279755.47 frames. ], batch size: 298, lr: 3.64e-03, grad_scale: 16.0 2023-06-23 08:35:43,344 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.836e+02 5.713e+02 7.787e+02 1.225e+03 2.393e+03, threshold=1.557e+03, percent-clipped=28.0 2023-06-23 08:36:31,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1443912.0, ans=0.125 2023-06-23 08:36:46,641 INFO [train.py:996] (2/4) Epoch 8, batch 27200, loss[loss=0.2571, simple_loss=0.334, pruned_loss=0.0901, over 21481.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3253, pruned_loss=0.08368, over 4272452.68 frames. ], batch size: 211, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 08:37:45,660 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-23 08:37:53,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1444152.0, ans=0.1 2023-06-23 08:38:36,507 INFO [train.py:996] (2/4) Epoch 8, batch 27250, loss[loss=0.286, simple_loss=0.3737, pruned_loss=0.09911, over 20821.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3301, pruned_loss=0.08876, over 4272125.93 frames. ], batch size: 608, lr: 3.64e-03, grad_scale: 32.0 2023-06-23 08:39:09,368 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.379e+02 5.510e+02 6.974e+02 9.879e+02 1.721e+03, threshold=1.395e+03, percent-clipped=1.0 2023-06-23 08:40:11,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1444512.0, ans=0.1 2023-06-23 08:40:17,112 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=15.0 2023-06-23 08:40:17,624 INFO [train.py:996] (2/4) Epoch 8, batch 27300, loss[loss=0.2885, simple_loss=0.3604, pruned_loss=0.1083, over 21255.00 frames. ], tot_loss[loss=0.2569, simple_loss=0.3334, pruned_loss=0.09017, over 4278342.79 frames. ], batch size: 159, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:41:05,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1444692.0, ans=0.0 2023-06-23 08:41:08,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1444692.0, ans=0.0 2023-06-23 08:41:17,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1444692.0, ans=0.1 2023-06-23 08:41:19,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1444752.0, ans=0.125 2023-06-23 08:41:57,258 INFO [train.py:996] (2/4) Epoch 8, batch 27350, loss[loss=0.2714, simple_loss=0.343, pruned_loss=0.09993, over 21646.00 frames. ], tot_loss[loss=0.2593, simple_loss=0.3366, pruned_loss=0.09102, over 4280578.20 frames. ], batch size: 112, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:42:28,128 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.496e+02 4.684e+02 5.886e+02 7.664e+02 1.698e+03, threshold=1.177e+03, percent-clipped=3.0 2023-06-23 08:43:04,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1445052.0, ans=0.125 2023-06-23 08:43:16,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1445052.0, ans=0.125 2023-06-23 08:43:40,291 INFO [train.py:996] (2/4) Epoch 8, batch 27400, loss[loss=0.2283, simple_loss=0.2882, pruned_loss=0.08427, over 21628.00 frames. ], tot_loss[loss=0.2563, simple_loss=0.3317, pruned_loss=0.09047, over 4287124.97 frames. ], batch size: 414, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:44:02,399 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.79 vs. limit=15.0 2023-06-23 08:44:10,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1445232.0, ans=0.0 2023-06-23 08:44:56,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1445352.0, ans=0.125 2023-06-23 08:45:19,301 INFO [train.py:996] (2/4) Epoch 8, batch 27450, loss[loss=0.2468, simple_loss=0.3374, pruned_loss=0.07815, over 21226.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.325, pruned_loss=0.08865, over 4280130.66 frames. ], batch size: 548, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:45:50,316 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.374e+02 5.145e+02 6.858e+02 8.934e+02 1.227e+03, threshold=1.372e+03, percent-clipped=2.0 2023-06-23 08:46:43,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1445712.0, ans=0.1 2023-06-23 08:46:55,690 INFO [train.py:996] (2/4) Epoch 8, batch 27500, loss[loss=0.29, simple_loss=0.3508, pruned_loss=0.1146, over 21445.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3241, pruned_loss=0.08943, over 4285403.98 frames. ], batch size: 131, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:47:12,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1445832.0, ans=0.0 2023-06-23 08:47:38,217 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 08:47:58,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1445952.0, ans=0.125 2023-06-23 08:47:59,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1445952.0, ans=0.125 2023-06-23 08:48:18,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1446012.0, ans=0.0 2023-06-23 08:48:31,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1446012.0, ans=0.125 2023-06-23 08:48:34,500 INFO [train.py:996] (2/4) Epoch 8, batch 27550, loss[loss=0.2423, simple_loss=0.3053, pruned_loss=0.08967, over 21495.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3177, pruned_loss=0.08546, over 4291712.71 frames. ], batch size: 389, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:48:38,767 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.92 vs. limit=15.0 2023-06-23 08:48:47,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1446072.0, ans=0.0 2023-06-23 08:49:05,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1446132.0, ans=0.0 2023-06-23 08:49:06,585 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.272e+02 4.176e+02 5.018e+02 7.145e+02 2.103e+03, threshold=1.004e+03, percent-clipped=5.0 2023-06-23 08:50:07,962 INFO [train.py:996] (2/4) Epoch 8, batch 27600, loss[loss=0.1987, simple_loss=0.2697, pruned_loss=0.0639, over 21758.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3134, pruned_loss=0.08513, over 4274743.37 frames. ], batch size: 300, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:50:13,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1446372.0, ans=0.125 2023-06-23 08:50:16,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1446372.0, ans=0.125 2023-06-23 08:51:45,609 INFO [train.py:996] (2/4) Epoch 8, batch 27650, loss[loss=0.2182, simple_loss=0.2868, pruned_loss=0.0748, over 21217.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.307, pruned_loss=0.0843, over 4274561.93 frames. ], batch size: 143, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 08:51:47,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1446672.0, ans=0.2 2023-06-23 08:52:19,284 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.399e+02 4.844e+02 6.403e+02 8.598e+02 1.573e+03, threshold=1.281e+03, percent-clipped=18.0 2023-06-23 08:52:22,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1446792.0, ans=0.2 2023-06-23 08:52:47,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1446852.0, ans=0.125 2023-06-23 08:53:06,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1446912.0, ans=0.125 2023-06-23 08:53:21,689 INFO [train.py:996] (2/4) Epoch 8, batch 27700, loss[loss=0.2523, simple_loss=0.3392, pruned_loss=0.08265, over 21696.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3069, pruned_loss=0.08171, over 4265496.59 frames. ], batch size: 298, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 08:53:49,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1447032.0, ans=0.2 2023-06-23 08:55:00,508 INFO [train.py:996] (2/4) Epoch 8, batch 27750, loss[loss=0.2253, simple_loss=0.3121, pruned_loss=0.06924, over 21851.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3105, pruned_loss=0.08096, over 4270020.84 frames. ], batch size: 316, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 08:55:05,769 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.68 vs. limit=15.0 2023-06-23 08:55:09,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1447272.0, ans=0.125 2023-06-23 08:55:30,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1447332.0, ans=0.04949747468305833 2023-06-23 08:55:31,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1447332.0, ans=0.125 2023-06-23 08:55:32,818 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.317e+02 5.055e+02 6.711e+02 8.629e+02 1.749e+03, threshold=1.342e+03, percent-clipped=9.0 2023-06-23 08:55:48,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1447392.0, ans=0.09899494936611666 2023-06-23 08:56:35,725 INFO [train.py:996] (2/4) Epoch 8, batch 27800, loss[loss=0.2384, simple_loss=0.2992, pruned_loss=0.08874, over 21859.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3086, pruned_loss=0.08136, over 4281637.56 frames. ], batch size: 282, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 08:56:47,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1447572.0, ans=0.125 2023-06-23 08:56:49,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1447572.0, ans=0.04949747468305833 2023-06-23 08:57:30,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1447692.0, ans=0.2 2023-06-23 08:57:43,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1447752.0, ans=0.0 2023-06-23 08:58:15,606 INFO [train.py:996] (2/4) Epoch 8, batch 27850, loss[loss=0.2545, simple_loss=0.3256, pruned_loss=0.09171, over 21730.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3093, pruned_loss=0.08349, over 4289619.98 frames. ], batch size: 389, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 08:58:16,707 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=22.5 2023-06-23 08:58:20,254 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=22.5 2023-06-23 08:58:38,023 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 08:58:50,984 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.261e+02 4.347e+02 5.210e+02 6.936e+02 1.592e+03, threshold=1.042e+03, percent-clipped=2.0 2023-06-23 08:59:10,359 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.69 vs. limit=15.0 2023-06-23 08:59:16,764 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.13 vs. limit=15.0 2023-06-23 08:59:33,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1448052.0, ans=0.125 2023-06-23 08:59:57,466 INFO [train.py:996] (2/4) Epoch 8, batch 27900, loss[loss=0.2637, simple_loss=0.381, pruned_loss=0.07326, over 21194.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3187, pruned_loss=0.08419, over 4289995.02 frames. ], batch size: 548, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 08:59:58,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1448172.0, ans=0.125 2023-06-23 08:59:59,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1448172.0, ans=10.0 2023-06-23 09:00:27,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1448232.0, ans=0.125 2023-06-23 09:01:34,332 INFO [train.py:996] (2/4) Epoch 8, batch 27950, loss[loss=0.2905, simple_loss=0.3729, pruned_loss=0.104, over 21460.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3175, pruned_loss=0.08026, over 4291173.53 frames. ], batch size: 471, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 09:01:54,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1448532.0, ans=0.0 2023-06-23 09:02:02,249 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=15.0 2023-06-23 09:02:07,650 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=15.0 2023-06-23 09:02:08,156 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.302e+02 4.594e+02 6.671e+02 9.534e+02 1.876e+03, threshold=1.334e+03, percent-clipped=19.0 2023-06-23 09:02:17,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1448592.0, ans=0.125 2023-06-23 09:03:02,018 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:03:07,996 INFO [train.py:996] (2/4) Epoch 8, batch 28000, loss[loss=0.2225, simple_loss=0.2942, pruned_loss=0.07541, over 21552.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3143, pruned_loss=0.07771, over 4284854.38 frames. ], batch size: 131, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 09:03:23,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1448772.0, ans=0.125 2023-06-23 09:03:48,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1448892.0, ans=0.0 2023-06-23 09:04:13,315 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:04:16,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1448952.0, ans=0.125 2023-06-23 09:04:51,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1449072.0, ans=0.0 2023-06-23 09:04:51,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1449072.0, ans=0.2 2023-06-23 09:04:52,708 INFO [train.py:996] (2/4) Epoch 8, batch 28050, loss[loss=0.2271, simple_loss=0.2973, pruned_loss=0.07843, over 21777.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3113, pruned_loss=0.07847, over 4289324.09 frames. ], batch size: 298, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 09:05:12,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1449132.0, ans=0.025 2023-06-23 09:05:19,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1449132.0, ans=0.0 2023-06-23 09:05:26,543 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.011e+02 4.954e+02 6.052e+02 8.048e+02 2.120e+03, threshold=1.210e+03, percent-clipped=2.0 2023-06-23 09:05:40,847 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.61 vs. limit=12.0 2023-06-23 09:06:05,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1449312.0, ans=0.125 2023-06-23 09:06:22,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn1.whiten.whitening_limit, batch_count=1449312.0, ans=22.5 2023-06-23 09:06:27,340 INFO [train.py:996] (2/4) Epoch 8, batch 28100, loss[loss=0.2187, simple_loss=0.2744, pruned_loss=0.08146, over 21285.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3119, pruned_loss=0.07914, over 4281786.56 frames. ], batch size: 159, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 09:06:50,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1449432.0, ans=0.125 2023-06-23 09:07:08,525 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:07:55,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1449612.0, ans=0.0 2023-06-23 09:08:06,361 INFO [train.py:996] (2/4) Epoch 8, batch 28150, loss[loss=0.2044, simple_loss=0.2672, pruned_loss=0.07082, over 21550.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3065, pruned_loss=0.07942, over 4275945.48 frames. ], batch size: 263, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 09:08:21,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1449672.0, ans=0.125 2023-06-23 09:08:39,165 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.413e+02 4.985e+02 7.502e+02 1.116e+03 2.390e+03, threshold=1.500e+03, percent-clipped=18.0 2023-06-23 09:08:49,950 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.71 vs. limit=5.0 2023-06-23 09:09:30,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1449912.0, ans=0.125 2023-06-23 09:09:44,374 INFO [train.py:996] (2/4) Epoch 8, batch 28200, loss[loss=0.2341, simple_loss=0.2915, pruned_loss=0.08829, over 21383.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3041, pruned_loss=0.08132, over 4270447.93 frames. ], batch size: 211, lr: 3.63e-03, grad_scale: 32.0 2023-06-23 09:10:32,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1450092.0, ans=0.2 2023-06-23 09:10:36,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer_na.min_abs, batch_count=1450092.0, ans=0.02 2023-06-23 09:10:36,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1450092.0, ans=0.125 2023-06-23 09:10:40,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1450152.0, ans=0.125 2023-06-23 09:11:27,080 INFO [train.py:996] (2/4) Epoch 8, batch 28250, loss[loss=0.2429, simple_loss=0.308, pruned_loss=0.08895, over 21893.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3085, pruned_loss=0.08439, over 4273264.37 frames. ], batch size: 317, lr: 3.63e-03, grad_scale: 8.0 2023-06-23 09:12:04,944 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.687e+02 5.319e+02 7.100e+02 8.712e+02 1.908e+03, threshold=1.420e+03, percent-clipped=3.0 2023-06-23 09:12:05,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1450392.0, ans=0.2 2023-06-23 09:12:47,441 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.85 vs. limit=22.5 2023-06-23 09:13:05,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1450572.0, ans=0.125 2023-06-23 09:13:06,463 INFO [train.py:996] (2/4) Epoch 8, batch 28300, loss[loss=0.2124, simple_loss=0.3169, pruned_loss=0.05393, over 21215.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3063, pruned_loss=0.08175, over 4266712.96 frames. ], batch size: 548, lr: 3.63e-03, grad_scale: 8.0 2023-06-23 09:13:10,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1450572.0, ans=0.125 2023-06-23 09:13:15,627 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.95 vs. limit=10.0 2023-06-23 09:13:28,300 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.88 vs. limit=15.0 2023-06-23 09:14:28,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1450812.0, ans=0.125 2023-06-23 09:14:44,999 INFO [train.py:996] (2/4) Epoch 8, batch 28350, loss[loss=0.1981, simple_loss=0.2658, pruned_loss=0.06524, over 21281.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3016, pruned_loss=0.07548, over 4264213.00 frames. ], batch size: 159, lr: 3.63e-03, grad_scale: 8.0 2023-06-23 09:15:21,050 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.828e+02 5.599e+02 8.860e+02 1.294e+03 2.489e+03, threshold=1.772e+03, percent-clipped=23.0 2023-06-23 09:15:27,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1450992.0, ans=0.125 2023-06-23 09:15:30,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1450992.0, ans=10.0 2023-06-23 09:15:36,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1450992.0, ans=0.2 2023-06-23 09:15:55,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1451052.0, ans=0.0 2023-06-23 09:16:11,414 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.42 vs. limit=15.0 2023-06-23 09:16:23,415 INFO [train.py:996] (2/4) Epoch 8, batch 28400, loss[loss=0.2419, simple_loss=0.3059, pruned_loss=0.08892, over 21450.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.299, pruned_loss=0.07537, over 4264222.13 frames. ], batch size: 211, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 09:16:44,555 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=15.0 2023-06-23 09:17:11,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1451292.0, ans=0.125 2023-06-23 09:18:03,287 INFO [train.py:996] (2/4) Epoch 8, batch 28450, loss[loss=0.2253, simple_loss=0.2996, pruned_loss=0.07546, over 20675.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3039, pruned_loss=0.07934, over 4270831.97 frames. ], batch size: 607, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 09:18:19,159 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-23 09:18:20,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1451532.0, ans=0.125 2023-06-23 09:18:23,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1451532.0, ans=0.125 2023-06-23 09:18:41,077 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.722e+02 5.622e+02 7.842e+02 1.295e+03 2.358e+03, threshold=1.568e+03, percent-clipped=7.0 2023-06-23 09:18:43,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1451592.0, ans=0.1 2023-06-23 09:18:57,564 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.37 vs. limit=10.0 2023-06-23 09:19:39,542 INFO [train.py:996] (2/4) Epoch 8, batch 28500, loss[loss=0.2844, simple_loss=0.3572, pruned_loss=0.1059, over 21811.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3056, pruned_loss=0.08145, over 4277179.17 frames. ], batch size: 118, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 09:19:59,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1451832.0, ans=0.125 2023-06-23 09:20:16,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1451832.0, ans=0.2 2023-06-23 09:20:22,660 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.40 vs. limit=6.0 2023-06-23 09:20:28,817 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:20:33,672 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:20:35,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1451892.0, ans=0.125 2023-06-23 09:20:38,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1451892.0, ans=0.125 2023-06-23 09:21:16,165 INFO [train.py:996] (2/4) Epoch 8, batch 28550, loss[loss=0.4143, simple_loss=0.4722, pruned_loss=0.1782, over 21452.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3142, pruned_loss=0.0852, over 4279673.19 frames. ], batch size: 507, lr: 3.63e-03, grad_scale: 16.0 2023-06-23 09:21:18,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1452072.0, ans=0.1 2023-06-23 09:21:20,183 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-06-23 09:22:02,860 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.502e+02 4.663e+02 6.262e+02 9.623e+02 1.798e+03, threshold=1.252e+03, percent-clipped=1.0 2023-06-23 09:22:03,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1452192.0, ans=0.1 2023-06-23 09:23:02,374 INFO [train.py:996] (2/4) Epoch 8, batch 28600, loss[loss=0.2863, simple_loss=0.3496, pruned_loss=0.1116, over 20717.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3218, pruned_loss=0.08792, over 4273979.57 frames. ], batch size: 607, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:23:10,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1452372.0, ans=0.0 2023-06-23 09:23:39,087 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-23 09:23:49,072 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.60 vs. limit=8.0 2023-06-23 09:24:40,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1452672.0, ans=0.0 2023-06-23 09:24:40,362 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:24:41,370 INFO [train.py:996] (2/4) Epoch 8, batch 28650, loss[loss=0.2111, simple_loss=0.2796, pruned_loss=0.0713, over 21673.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3141, pruned_loss=0.08566, over 4274083.99 frames. ], batch size: 333, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:25:10,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1452732.0, ans=0.125 2023-06-23 09:25:23,235 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.145e+02 4.493e+02 5.758e+02 7.930e+02 1.580e+03, threshold=1.152e+03, percent-clipped=4.0 2023-06-23 09:25:26,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_ff3.min_abs, batch_count=1452792.0, ans=0.2 2023-06-23 09:25:29,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1452792.0, ans=0.2 2023-06-23 09:25:33,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1452792.0, ans=0.2 2023-06-23 09:25:41,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1452852.0, ans=0.125 2023-06-23 09:25:59,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1452912.0, ans=0.125 2023-06-23 09:26:26,161 INFO [train.py:996] (2/4) Epoch 8, batch 28700, loss[loss=0.2361, simple_loss=0.3161, pruned_loss=0.07809, over 21901.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3118, pruned_loss=0.08597, over 4278134.44 frames. ], batch size: 372, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:26:30,320 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.42 vs. limit=10.0 2023-06-23 09:26:39,807 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=22.5 2023-06-23 09:26:42,942 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.32 vs. limit=22.5 2023-06-23 09:26:57,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1453032.0, ans=0.1 2023-06-23 09:27:18,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1453152.0, ans=0.0 2023-06-23 09:27:55,731 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:28:05,309 INFO [train.py:996] (2/4) Epoch 8, batch 28750, loss[loss=0.2735, simple_loss=0.3413, pruned_loss=0.1028, over 21759.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3124, pruned_loss=0.08655, over 4278711.37 frames. ], batch size: 112, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:28:14,387 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.16 vs. limit=8.0 2023-06-23 09:28:23,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1453272.0, ans=10.0 2023-06-23 09:28:24,884 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-06-23 09:28:41,962 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.805e+02 4.984e+02 6.274e+02 9.092e+02 1.737e+03, threshold=1.255e+03, percent-clipped=10.0 2023-06-23 09:29:49,438 INFO [train.py:996] (2/4) Epoch 8, batch 28800, loss[loss=0.2841, simple_loss=0.3556, pruned_loss=0.1063, over 20703.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3162, pruned_loss=0.08662, over 4279506.45 frames. ], batch size: 608, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 09:29:55,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1453572.0, ans=0.125 2023-06-23 09:30:54,664 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=22.5 2023-06-23 09:31:29,213 INFO [train.py:996] (2/4) Epoch 8, batch 28850, loss[loss=0.2779, simple_loss=0.3352, pruned_loss=0.1103, over 21546.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3181, pruned_loss=0.08791, over 4276789.25 frames. ], batch size: 471, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:31:50,051 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.39 vs. limit=6.0 2023-06-23 09:32:00,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1453992.0, ans=0.0 2023-06-23 09:32:03,477 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.517e+02 4.921e+02 6.393e+02 7.769e+02 1.909e+03, threshold=1.279e+03, percent-clipped=3.0 2023-06-23 09:32:54,638 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=15.0 2023-06-23 09:33:11,071 INFO [train.py:996] (2/4) Epoch 8, batch 28900, loss[loss=0.2957, simple_loss=0.3677, pruned_loss=0.1118, over 21520.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3218, pruned_loss=0.09047, over 4275202.25 frames. ], batch size: 471, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:33:14,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1454172.0, ans=0.125 2023-06-23 09:33:20,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1454172.0, ans=0.0 2023-06-23 09:34:47,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1454412.0, ans=0.0 2023-06-23 09:34:52,374 INFO [train.py:996] (2/4) Epoch 8, batch 28950, loss[loss=0.2143, simple_loss=0.3108, pruned_loss=0.05889, over 21837.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3265, pruned_loss=0.09077, over 4270982.68 frames. ], batch size: 316, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:35:41,154 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.677e+02 4.837e+02 6.969e+02 9.888e+02 2.996e+03, threshold=1.394e+03, percent-clipped=10.0 2023-06-23 09:35:50,421 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=15.0 2023-06-23 09:36:13,236 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=22.5 2023-06-23 09:36:32,714 INFO [train.py:996] (2/4) Epoch 8, batch 29000, loss[loss=0.2762, simple_loss=0.3412, pruned_loss=0.1056, over 21999.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3301, pruned_loss=0.08998, over 4266953.71 frames. ], batch size: 317, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:36:54,083 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.53 vs. limit=6.0 2023-06-23 09:37:27,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1454892.0, ans=0.2 2023-06-23 09:37:40,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1454952.0, ans=0.2 2023-06-23 09:37:42,210 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-23 09:38:10,314 INFO [train.py:996] (2/4) Epoch 8, batch 29050, loss[loss=0.2354, simple_loss=0.3059, pruned_loss=0.08248, over 21526.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3276, pruned_loss=0.09064, over 4275754.33 frames. ], batch size: 131, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:38:38,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1455132.0, ans=0.125 2023-06-23 09:39:01,855 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.533e+02 4.892e+02 6.390e+02 8.567e+02 1.270e+03, threshold=1.278e+03, percent-clipped=0.0 2023-06-23 09:39:48,117 INFO [train.py:996] (2/4) Epoch 8, batch 29100, loss[loss=0.269, simple_loss=0.307, pruned_loss=0.1155, over 21429.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3189, pruned_loss=0.08881, over 4269745.29 frames. ], batch size: 509, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:40:11,963 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 09:40:41,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1455492.0, ans=0.1 2023-06-23 09:40:42,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1455492.0, ans=0.1 2023-06-23 09:41:26,816 INFO [train.py:996] (2/4) Epoch 8, batch 29150, loss[loss=0.2309, simple_loss=0.2965, pruned_loss=0.08268, over 21257.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3182, pruned_loss=0.08722, over 4271666.75 frames. ], batch size: 159, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:41:27,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1455672.0, ans=0.125 2023-06-23 09:42:14,897 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.362e+02 4.552e+02 5.999e+02 9.339e+02 2.396e+03, threshold=1.200e+03, percent-clipped=6.0 2023-06-23 09:42:21,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1455792.0, ans=0.0 2023-06-23 09:42:21,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1455792.0, ans=0.0 2023-06-23 09:42:39,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1455852.0, ans=0.07 2023-06-23 09:42:40,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1455852.0, ans=0.2 2023-06-23 09:43:01,018 INFO [train.py:996] (2/4) Epoch 8, batch 29200, loss[loss=0.2242, simple_loss=0.2855, pruned_loss=0.08147, over 21557.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3121, pruned_loss=0.08527, over 4270151.27 frames. ], batch size: 414, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 09:43:15,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1455972.0, ans=0.2 2023-06-23 09:43:35,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1456032.0, ans=0.0 2023-06-23 09:44:14,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1456152.0, ans=0.125 2023-06-23 09:44:29,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1456212.0, ans=0.2 2023-06-23 09:44:50,161 INFO [train.py:996] (2/4) Epoch 8, batch 29250, loss[loss=0.2076, simple_loss=0.2783, pruned_loss=0.06839, over 21773.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3099, pruned_loss=0.08243, over 4278374.18 frames. ], batch size: 118, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 09:44:59,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1456272.0, ans=0.1 2023-06-23 09:45:13,998 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=15.0 2023-06-23 09:45:24,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1456332.0, ans=0.125 2023-06-23 09:45:30,126 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.12 vs. limit=15.0 2023-06-23 09:45:31,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1456392.0, ans=0.125 2023-06-23 09:45:33,749 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.540e+02 4.714e+02 6.019e+02 9.609e+02 2.170e+03, threshold=1.204e+03, percent-clipped=18.0 2023-06-23 09:45:35,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1456392.0, ans=0.1 2023-06-23 09:45:57,385 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-23 09:46:33,647 INFO [train.py:996] (2/4) Epoch 8, batch 29300, loss[loss=0.2575, simple_loss=0.3092, pruned_loss=0.103, over 21345.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3106, pruned_loss=0.08139, over 4270728.91 frames. ], batch size: 507, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 09:46:34,668 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-06-23 09:47:18,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1456692.0, ans=0.0 2023-06-23 09:47:56,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1456812.0, ans=0.1 2023-06-23 09:48:17,853 INFO [train.py:996] (2/4) Epoch 8, batch 29350, loss[loss=0.2211, simple_loss=0.2895, pruned_loss=0.07634, over 21248.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.306, pruned_loss=0.08012, over 4272307.93 frames. ], batch size: 144, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:48:52,796 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.303e+02 4.822e+02 6.215e+02 9.294e+02 1.604e+03, threshold=1.243e+03, percent-clipped=12.0 2023-06-23 09:48:54,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1456992.0, ans=0.125 2023-06-23 09:48:57,201 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=22.5 2023-06-23 09:49:11,756 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.47 vs. limit=12.0 2023-06-23 09:49:56,821 INFO [train.py:996] (2/4) Epoch 8, batch 29400, loss[loss=0.2073, simple_loss=0.2774, pruned_loss=0.06863, over 21663.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3056, pruned_loss=0.0785, over 4267022.06 frames. ], batch size: 247, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:50:33,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1457292.0, ans=0.5 2023-06-23 09:51:02,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1457352.0, ans=0.125 2023-06-23 09:51:35,725 INFO [train.py:996] (2/4) Epoch 8, batch 29450, loss[loss=0.2959, simple_loss=0.3573, pruned_loss=0.1172, over 21723.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3027, pruned_loss=0.07747, over 4268328.25 frames. ], batch size: 351, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:51:52,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1457532.0, ans=0.0 2023-06-23 09:52:11,400 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.254e+02 6.189e+02 1.171e+03 1.650e+03 2.483e+03, threshold=2.343e+03, percent-clipped=48.0 2023-06-23 09:53:09,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1457712.0, ans=0.125 2023-06-23 09:53:14,223 INFO [train.py:996] (2/4) Epoch 8, batch 29500, loss[loss=0.2196, simple_loss=0.284, pruned_loss=0.0776, over 21340.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3087, pruned_loss=0.08148, over 4270820.50 frames. ], batch size: 176, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:53:42,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1457832.0, ans=0.125 2023-06-23 09:53:58,381 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-06-23 09:54:07,569 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.49 vs. limit=15.0 2023-06-23 09:54:43,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1458012.0, ans=0.2 2023-06-23 09:54:52,624 INFO [train.py:996] (2/4) Epoch 8, batch 29550, loss[loss=0.2646, simple_loss=0.3352, pruned_loss=0.09697, over 21874.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3086, pruned_loss=0.08336, over 4280071.01 frames. ], batch size: 107, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 09:55:17,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1458132.0, ans=0.125 2023-06-23 09:55:17,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1458132.0, ans=0.2 2023-06-23 09:55:21,547 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.72 vs. limit=6.0 2023-06-23 09:55:28,667 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.866e+02 5.921e+02 7.933e+02 1.124e+03 2.184e+03, threshold=1.587e+03, percent-clipped=0.0 2023-06-23 09:55:39,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1458192.0, ans=0.1 2023-06-23 09:56:28,481 INFO [train.py:996] (2/4) Epoch 8, batch 29600, loss[loss=0.2509, simple_loss=0.3397, pruned_loss=0.08106, over 21665.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3148, pruned_loss=0.08564, over 4277292.89 frames. ], batch size: 263, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 09:56:39,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1458372.0, ans=10.0 2023-06-23 09:56:50,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1458432.0, ans=15.0 2023-06-23 09:58:02,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1458612.0, ans=0.2 2023-06-23 09:58:06,713 INFO [train.py:996] (2/4) Epoch 8, batch 29650, loss[loss=0.2204, simple_loss=0.2944, pruned_loss=0.07318, over 21117.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.312, pruned_loss=0.08234, over 4279813.69 frames. ], batch size: 608, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 09:58:46,216 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.053e+02 5.356e+02 7.008e+02 1.123e+03 3.687e+03, threshold=1.402e+03, percent-clipped=10.0 2023-06-23 09:59:14,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1458852.0, ans=0.125 2023-06-23 09:59:20,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1458852.0, ans=0.0 2023-06-23 09:59:25,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1458852.0, ans=0.125 2023-06-23 09:59:29,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1458912.0, ans=0.2 2023-06-23 09:59:30,478 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=22.5 2023-06-23 09:59:40,317 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.62 vs. limit=15.0 2023-06-23 09:59:45,969 INFO [train.py:996] (2/4) Epoch 8, batch 29700, loss[loss=0.2597, simple_loss=0.3723, pruned_loss=0.07352, over 19809.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3121, pruned_loss=0.08225, over 4279969.15 frames. ], batch size: 702, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 10:00:08,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1459032.0, ans=0.125 2023-06-23 10:00:13,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1459032.0, ans=0.125 2023-06-23 10:00:15,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1459032.0, ans=0.125 2023-06-23 10:01:10,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1459212.0, ans=0.125 2023-06-23 10:01:19,981 INFO [train.py:996] (2/4) Epoch 8, batch 29750, loss[loss=0.256, simple_loss=0.3134, pruned_loss=0.09931, over 21437.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.318, pruned_loss=0.08256, over 4277426.86 frames. ], batch size: 144, lr: 3.62e-03, grad_scale: 32.0 2023-06-23 10:01:22,673 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.37 vs. limit=10.0 2023-06-23 10:01:39,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1459332.0, ans=0.125 2023-06-23 10:02:05,842 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.467e+02 4.705e+02 6.606e+02 1.008e+03 2.185e+03, threshold=1.321e+03, percent-clipped=11.0 2023-06-23 10:02:49,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1459512.0, ans=0.125 2023-06-23 10:02:58,144 INFO [train.py:996] (2/4) Epoch 8, batch 29800, loss[loss=0.2548, simple_loss=0.319, pruned_loss=0.09531, over 21366.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3197, pruned_loss=0.08394, over 4289576.49 frames. ], batch size: 159, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 10:03:12,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1459632.0, ans=0.07 2023-06-23 10:03:14,383 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.20 vs. limit=12.0 2023-06-23 10:03:57,008 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-23 10:04:30,506 INFO [train.py:996] (2/4) Epoch 8, batch 29850, loss[loss=0.2134, simple_loss=0.2828, pruned_loss=0.07196, over 21830.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.315, pruned_loss=0.08099, over 4292179.34 frames. ], batch size: 247, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 10:04:40,650 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-23 10:05:16,479 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.132e+02 5.162e+02 6.765e+02 9.039e+02 1.623e+03, threshold=1.353e+03, percent-clipped=3.0 2023-06-23 10:05:28,233 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1459992.0, ans=0.125 2023-06-23 10:06:04,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1460112.0, ans=0.125 2023-06-23 10:06:08,505 INFO [train.py:996] (2/4) Epoch 8, batch 29900, loss[loss=0.2814, simple_loss=0.3429, pruned_loss=0.1099, over 21593.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3142, pruned_loss=0.08238, over 4293630.26 frames. ], batch size: 263, lr: 3.62e-03, grad_scale: 16.0 2023-06-23 10:06:47,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1460232.0, ans=0.1 2023-06-23 10:07:16,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1460352.0, ans=0.125 2023-06-23 10:07:42,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1460412.0, ans=0.125 2023-06-23 10:07:48,912 INFO [train.py:996] (2/4) Epoch 8, batch 29950, loss[loss=0.256, simple_loss=0.3317, pruned_loss=0.09009, over 21334.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3183, pruned_loss=0.08672, over 4287481.98 frames. ], batch size: 143, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:08:44,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1460592.0, ans=0.125 2023-06-23 10:08:45,487 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.472e+02 5.581e+02 7.305e+02 1.013e+03 2.177e+03, threshold=1.461e+03, percent-clipped=7.0 2023-06-23 10:08:50,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1460592.0, ans=0.1 2023-06-23 10:08:58,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1460652.0, ans=0.09899494936611666 2023-06-23 10:09:34,031 INFO [train.py:996] (2/4) Epoch 8, batch 30000, loss[loss=0.2205, simple_loss=0.3208, pruned_loss=0.06009, over 21617.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3202, pruned_loss=0.08713, over 4283829.99 frames. ], batch size: 414, lr: 3.61e-03, grad_scale: 32.0 2023-06-23 10:09:34,032 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-23 10:09:54,205 INFO [train.py:1028] (2/4) Epoch 8, validation: loss=0.244, simple_loss=0.3443, pruned_loss=0.07188, over 1796401.00 frames. 2023-06-23 10:09:54,206 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-23 10:10:28,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1460832.0, ans=0.125 2023-06-23 10:10:37,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1460892.0, ans=0.0 2023-06-23 10:10:37,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1460892.0, ans=0.0 2023-06-23 10:11:46,626 INFO [train.py:996] (2/4) Epoch 8, batch 30050, loss[loss=0.356, simple_loss=0.4466, pruned_loss=0.1328, over 21388.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3245, pruned_loss=0.08372, over 4279385.15 frames. ], batch size: 507, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:12:25,994 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.441e+02 4.815e+02 7.305e+02 9.705e+02 3.214e+03, threshold=1.461e+03, percent-clipped=9.0 2023-06-23 10:12:26,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1461192.0, ans=0.125 2023-06-23 10:12:37,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1461192.0, ans=0.125 2023-06-23 10:13:15,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1461312.0, ans=0.125 2023-06-23 10:13:26,310 INFO [train.py:996] (2/4) Epoch 8, batch 30100, loss[loss=0.2116, simple_loss=0.2769, pruned_loss=0.07315, over 21489.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3219, pruned_loss=0.08295, over 4280663.79 frames. ], batch size: 212, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:13:37,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1461372.0, ans=0.125 2023-06-23 10:13:44,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1461432.0, ans=0.125 2023-06-23 10:13:55,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1461432.0, ans=0.2 2023-06-23 10:13:55,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1461432.0, ans=0.0 2023-06-23 10:14:51,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1461612.0, ans=0.2 2023-06-23 10:15:05,461 INFO [train.py:996] (2/4) Epoch 8, batch 30150, loss[loss=0.2751, simple_loss=0.3362, pruned_loss=0.107, over 21789.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3182, pruned_loss=0.08469, over 4277579.74 frames. ], batch size: 441, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:15:59,653 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.389e+02 4.498e+02 5.519e+02 7.633e+02 1.440e+03, threshold=1.104e+03, percent-clipped=0.0 2023-06-23 10:16:12,170 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.74 vs. limit=12.0 2023-06-23 10:16:29,681 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:16:47,017 INFO [train.py:996] (2/4) Epoch 8, batch 30200, loss[loss=0.2799, simple_loss=0.3682, pruned_loss=0.09586, over 21362.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3212, pruned_loss=0.0833, over 4280959.41 frames. ], batch size: 507, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:16:51,597 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.81 vs. limit=6.0 2023-06-23 10:18:27,128 INFO [train.py:996] (2/4) Epoch 8, batch 30250, loss[loss=0.3722, simple_loss=0.4485, pruned_loss=0.1479, over 21509.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3286, pruned_loss=0.08558, over 4275399.46 frames. ], batch size: 471, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:18:40,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1462272.0, ans=0.2 2023-06-23 10:19:20,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1462392.0, ans=0.95 2023-06-23 10:19:25,063 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.543e+02 5.739e+02 8.338e+02 1.276e+03 3.132e+03, threshold=1.668e+03, percent-clipped=33.0 2023-06-23 10:19:35,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1462452.0, ans=0.07 2023-06-23 10:19:45,397 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-23 10:19:48,189 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=15.0 2023-06-23 10:19:57,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1462512.0, ans=0.125 2023-06-23 10:20:00,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1462512.0, ans=0.125 2023-06-23 10:20:10,514 INFO [train.py:996] (2/4) Epoch 8, batch 30300, loss[loss=0.2106, simple_loss=0.2747, pruned_loss=0.07321, over 21397.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3265, pruned_loss=0.08528, over 4277885.80 frames. ], batch size: 389, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:20:10,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1462572.0, ans=0.1 2023-06-23 10:20:17,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1462572.0, ans=0.0 2023-06-23 10:21:11,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1462692.0, ans=0.04949747468305833 2023-06-23 10:21:50,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1462812.0, ans=0.125 2023-06-23 10:22:08,467 INFO [train.py:996] (2/4) Epoch 8, batch 30350, loss[loss=0.2656, simple_loss=0.3488, pruned_loss=0.09117, over 21731.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.326, pruned_loss=0.0869, over 4270083.25 frames. ], batch size: 298, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:22:14,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1462872.0, ans=0.2 2023-06-23 10:22:18,625 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-23 10:22:20,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1462872.0, ans=0.1 2023-06-23 10:22:38,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1462992.0, ans=0.0 2023-06-23 10:22:43,063 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.418e+02 5.113e+02 8.159e+02 1.296e+03 2.782e+03, threshold=1.632e+03, percent-clipped=10.0 2023-06-23 10:22:48,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1462992.0, ans=0.0 2023-06-23 10:23:12,302 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-23 10:23:17,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1463112.0, ans=0.125 2023-06-23 10:23:21,507 INFO [train.py:996] (2/4) Epoch 8, batch 30400, loss[loss=0.2038, simple_loss=0.2606, pruned_loss=0.07352, over 20268.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3183, pruned_loss=0.08443, over 4257763.84 frames. ], batch size: 703, lr: 3.61e-03, grad_scale: 32.0 2023-06-23 10:23:36,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1463232.0, ans=0.2 2023-06-23 10:23:42,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1463232.0, ans=0.0 2023-06-23 10:24:45,423 INFO [train.py:996] (2/4) Epoch 8, batch 30450, loss[loss=0.2777, simple_loss=0.4028, pruned_loss=0.07631, over 19830.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3203, pruned_loss=0.08388, over 4198498.28 frames. ], batch size: 702, lr: 3.61e-03, grad_scale: 16.0 2023-06-23 10:24:47,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1463472.0, ans=0.0 2023-06-23 10:25:18,807 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=15.0 2023-06-23 10:25:24,739 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.983e+02 7.886e+02 1.299e+03 2.180e+03 7.301e+03, threshold=2.598e+03, percent-clipped=35.0 2023-06-23 10:27:25,954 INFO [train.py:996] (2/4) Epoch 9, batch 0, loss[loss=0.2185, simple_loss=0.2897, pruned_loss=0.07363, over 21623.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2897, pruned_loss=0.07363, over 21623.00 frames. ], batch size: 298, lr: 3.39e-03, grad_scale: 32.0 2023-06-23 10:27:25,954 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-23 10:27:41,477 INFO [train.py:1028] (2/4) Epoch 9, validation: loss=0.2407, simple_loss=0.3498, pruned_loss=0.06579, over 1796401.00 frames. 2023-06-23 10:27:41,477 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-23 10:27:52,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1463742.0, ans=0.125 2023-06-23 10:28:35,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1463862.0, ans=0.125 2023-06-23 10:29:21,243 INFO [train.py:996] (2/4) Epoch 9, batch 50, loss[loss=0.235, simple_loss=0.3179, pruned_loss=0.07606, over 21760.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3289, pruned_loss=0.08629, over 972295.33 frames. ], batch size: 247, lr: 3.39e-03, grad_scale: 32.0 2023-06-23 10:29:21,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1464042.0, ans=0.0 2023-06-23 10:29:41,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1464102.0, ans=0.05 2023-06-23 10:30:05,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1464162.0, ans=0.2 2023-06-23 10:30:22,212 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.385e+02 5.823e+02 9.334e+02 1.610e+03 5.016e+03, threshold=1.867e+03, percent-clipped=15.0 2023-06-23 10:30:23,422 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.72 vs. limit=22.5 2023-06-23 10:30:58,861 INFO [train.py:996] (2/4) Epoch 9, batch 100, loss[loss=0.2687, simple_loss=0.362, pruned_loss=0.08772, over 21638.00 frames. ], tot_loss[loss=0.2605, simple_loss=0.3424, pruned_loss=0.08927, over 1693859.09 frames. ], batch size: 414, lr: 3.39e-03, grad_scale: 32.0 2023-06-23 10:30:59,827 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=22.5 2023-06-23 10:31:00,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1464342.0, ans=0.0 2023-06-23 10:32:23,880 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=22.5 2023-06-23 10:32:33,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1464582.0, ans=0.125 2023-06-23 10:32:35,503 INFO [train.py:996] (2/4) Epoch 9, batch 150, loss[loss=0.2301, simple_loss=0.3067, pruned_loss=0.07675, over 21908.00 frames. ], tot_loss[loss=0.2594, simple_loss=0.3418, pruned_loss=0.08857, over 2260138.32 frames. ], batch size: 316, lr: 3.39e-03, grad_scale: 16.0 2023-06-23 10:33:15,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1464762.0, ans=0.1 2023-06-23 10:33:18,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1464762.0, ans=0.1 2023-06-23 10:33:38,727 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.525e+02 5.241e+02 6.632e+02 9.762e+02 2.000e+03, threshold=1.326e+03, percent-clipped=1.0 2023-06-23 10:34:08,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1464882.0, ans=0.1 2023-06-23 10:34:09,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1464882.0, ans=0.0 2023-06-23 10:34:12,640 INFO [train.py:996] (2/4) Epoch 9, batch 200, loss[loss=0.2274, simple_loss=0.2939, pruned_loss=0.08044, over 21696.00 frames. ], tot_loss[loss=0.2554, simple_loss=0.3366, pruned_loss=0.08714, over 2688724.18 frames. ], batch size: 230, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:34:28,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1465002.0, ans=0.2 2023-06-23 10:34:28,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1465002.0, ans=0.0 2023-06-23 10:34:34,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1465002.0, ans=0.1 2023-06-23 10:34:43,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1465002.0, ans=10.0 2023-06-23 10:34:46,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1465062.0, ans=0.125 2023-06-23 10:35:17,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1465122.0, ans=0.5 2023-06-23 10:35:50,359 INFO [train.py:996] (2/4) Epoch 9, batch 250, loss[loss=0.2381, simple_loss=0.298, pruned_loss=0.08909, over 21502.00 frames. ], tot_loss[loss=0.2544, simple_loss=0.3334, pruned_loss=0.08768, over 3040868.89 frames. ], batch size: 194, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:36:45,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1465362.0, ans=0.1 2023-06-23 10:36:55,168 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.614e+02 4.793e+02 6.542e+02 9.601e+02 1.948e+03, threshold=1.308e+03, percent-clipped=7.0 2023-06-23 10:37:00,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1465422.0, ans=0.125 2023-06-23 10:37:14,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1465482.0, ans=0.0 2023-06-23 10:37:29,652 INFO [train.py:996] (2/4) Epoch 9, batch 300, loss[loss=0.3135, simple_loss=0.4264, pruned_loss=0.1003, over 19785.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3282, pruned_loss=0.08649, over 3308844.45 frames. ], batch size: 702, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:38:36,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1465722.0, ans=0.95 2023-06-23 10:38:39,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1465722.0, ans=6.0 2023-06-23 10:38:48,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1465722.0, ans=0.1 2023-06-23 10:39:11,145 INFO [train.py:996] (2/4) Epoch 9, batch 350, loss[loss=0.207, simple_loss=0.2717, pruned_loss=0.07112, over 21213.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3207, pruned_loss=0.08324, over 3515306.48 frames. ], batch size: 144, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:40:01,373 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.36 vs. limit=10.0 2023-06-23 10:40:16,456 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.272e+02 5.700e+02 8.166e+02 1.374e+03 3.481e+03, threshold=1.633e+03, percent-clipped=26.0 2023-06-23 10:40:33,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1466022.0, ans=0.2 2023-06-23 10:40:41,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1466082.0, ans=0.125 2023-06-23 10:40:52,631 INFO [train.py:996] (2/4) Epoch 9, batch 400, loss[loss=0.1787, simple_loss=0.2444, pruned_loss=0.05653, over 21206.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3131, pruned_loss=0.0819, over 3676841.60 frames. ], batch size: 159, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 10:41:07,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1466202.0, ans=0.04949747468305833 2023-06-23 10:41:07,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1466202.0, ans=0.0 2023-06-23 10:42:19,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1466382.0, ans=0.125 2023-06-23 10:42:35,009 INFO [train.py:996] (2/4) Epoch 9, batch 450, loss[loss=0.2587, simple_loss=0.3332, pruned_loss=0.09217, over 21981.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3112, pruned_loss=0.08018, over 3812680.16 frames. ], batch size: 113, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 10:42:39,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1466442.0, ans=0.1 2023-06-23 10:42:43,786 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.77 vs. limit=10.0 2023-06-23 10:43:40,415 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.611e+02 7.104e+02 9.718e+02 1.338e+03 3.704e+03, threshold=1.944e+03, percent-clipped=17.0 2023-06-23 10:44:09,225 INFO [train.py:996] (2/4) Epoch 9, batch 500, loss[loss=0.2016, simple_loss=0.2946, pruned_loss=0.05426, over 21691.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.313, pruned_loss=0.07924, over 3921788.93 frames. ], batch size: 298, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:44:36,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1466802.0, ans=0.2 2023-06-23 10:45:47,791 INFO [train.py:996] (2/4) Epoch 9, batch 550, loss[loss=0.2768, simple_loss=0.3675, pruned_loss=0.09311, over 21761.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3179, pruned_loss=0.07855, over 3999751.32 frames. ], batch size: 351, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:46:10,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1467102.0, ans=0.125 2023-06-23 10:46:53,871 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.317e+02 4.598e+02 6.514e+02 1.038e+03 2.454e+03, threshold=1.303e+03, percent-clipped=6.0 2023-06-23 10:47:22,483 INFO [train.py:996] (2/4) Epoch 9, batch 600, loss[loss=0.2671, simple_loss=0.3236, pruned_loss=0.1053, over 21813.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3188, pruned_loss=0.07917, over 4070425.75 frames. ], batch size: 508, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:47:27,646 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 10:48:12,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1467462.0, ans=0.0 2023-06-23 10:48:59,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1467642.0, ans=0.125 2023-06-23 10:49:00,646 INFO [train.py:996] (2/4) Epoch 9, batch 650, loss[loss=0.2145, simple_loss=0.2702, pruned_loss=0.07939, over 20065.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3198, pruned_loss=0.07993, over 4116370.79 frames. ], batch size: 704, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:49:37,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1467702.0, ans=0.125 2023-06-23 10:50:04,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1467822.0, ans=0.0 2023-06-23 10:50:06,996 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.298e+02 4.773e+02 6.710e+02 1.032e+03 2.196e+03, threshold=1.342e+03, percent-clipped=13.0 2023-06-23 10:50:18,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1467882.0, ans=0.95 2023-06-23 10:50:27,043 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-23 10:50:36,052 INFO [train.py:996] (2/4) Epoch 9, batch 700, loss[loss=0.2561, simple_loss=0.3332, pruned_loss=0.08951, over 21499.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3165, pruned_loss=0.07932, over 4150573.45 frames. ], batch size: 211, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:51:27,922 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.31 vs. limit=10.0 2023-06-23 10:51:33,418 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=15.0 2023-06-23 10:52:09,588 INFO [train.py:996] (2/4) Epoch 9, batch 750, loss[loss=0.2308, simple_loss=0.2955, pruned_loss=0.083, over 21840.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.316, pruned_loss=0.08057, over 4185261.30 frames. ], batch size: 118, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:53:15,060 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.543e+02 5.132e+02 8.530e+02 1.237e+03 2.839e+03, threshold=1.706e+03, percent-clipped=17.0 2023-06-23 10:53:29,067 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.83 vs. limit=10.0 2023-06-23 10:53:44,249 INFO [train.py:996] (2/4) Epoch 9, batch 800, loss[loss=0.2304, simple_loss=0.2879, pruned_loss=0.08648, over 21464.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3128, pruned_loss=0.08133, over 4206289.36 frames. ], batch size: 195, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 10:53:51,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1468542.0, ans=0.0 2023-06-23 10:53:55,126 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.92 vs. limit=22.5 2023-06-23 10:54:39,843 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.45 vs. limit=6.0 2023-06-23 10:54:52,410 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-23 10:55:03,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1468782.0, ans=0.0 2023-06-23 10:55:13,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1468782.0, ans=0.125 2023-06-23 10:55:19,760 INFO [train.py:996] (2/4) Epoch 9, batch 850, loss[loss=0.2731, simple_loss=0.3292, pruned_loss=0.1084, over 21605.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.312, pruned_loss=0.08151, over 4222852.01 frames. ], batch size: 471, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 10:55:57,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1468902.0, ans=0.1 2023-06-23 10:56:10,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1468962.0, ans=0.0 2023-06-23 10:56:25,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1468962.0, ans=0.0 2023-06-23 10:56:30,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1469022.0, ans=0.125 2023-06-23 10:56:31,575 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.626e+02 5.922e+02 9.431e+02 1.406e+03 2.564e+03, threshold=1.886e+03, percent-clipped=15.0 2023-06-23 10:56:44,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1469082.0, ans=0.0 2023-06-23 10:57:05,020 INFO [train.py:996] (2/4) Epoch 9, batch 900, loss[loss=0.2182, simple_loss=0.2713, pruned_loss=0.08254, over 21304.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3108, pruned_loss=0.08137, over 4235521.07 frames. ], batch size: 160, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:57:16,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1469142.0, ans=0.125 2023-06-23 10:57:59,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1469262.0, ans=0.0 2023-06-23 10:58:45,615 INFO [train.py:996] (2/4) Epoch 9, batch 950, loss[loss=0.2064, simple_loss=0.2682, pruned_loss=0.07229, over 21328.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3099, pruned_loss=0.08052, over 4248150.57 frames. ], batch size: 143, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 10:59:47,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1469622.0, ans=0.0 2023-06-23 10:59:53,223 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.620e+02 5.465e+02 8.299e+02 1.252e+03 2.692e+03, threshold=1.660e+03, percent-clipped=4.0 2023-06-23 11:00:04,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1469682.0, ans=0.2 2023-06-23 11:00:25,500 INFO [train.py:996] (2/4) Epoch 9, batch 1000, loss[loss=0.2196, simple_loss=0.2904, pruned_loss=0.07441, over 21377.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3104, pruned_loss=0.08117, over 4258135.22 frames. ], batch size: 159, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:01:01,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1469802.0, ans=0.125 2023-06-23 11:01:19,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1469862.0, ans=0.0 2023-06-23 11:02:11,531 INFO [train.py:996] (2/4) Epoch 9, batch 1050, loss[loss=0.237, simple_loss=0.3006, pruned_loss=0.08675, over 21642.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3105, pruned_loss=0.08123, over 4264343.73 frames. ], batch size: 263, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:03:15,043 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.334e+02 4.887e+02 6.781e+02 8.518e+02 2.404e+03, threshold=1.356e+03, percent-clipped=1.0 2023-06-23 11:03:58,862 INFO [train.py:996] (2/4) Epoch 9, batch 1100, loss[loss=0.2561, simple_loss=0.3423, pruned_loss=0.08499, over 21693.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3119, pruned_loss=0.08095, over 4268376.58 frames. ], batch size: 351, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:04:00,147 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.58 vs. limit=15.0 2023-06-23 11:04:02,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1470342.0, ans=0.125 2023-06-23 11:04:41,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1470462.0, ans=0.0 2023-06-23 11:04:46,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1470462.0, ans=0.125 2023-06-23 11:04:59,083 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.42 vs. limit=15.0 2023-06-23 11:05:43,342 INFO [train.py:996] (2/4) Epoch 9, batch 1150, loss[loss=0.1675, simple_loss=0.2534, pruned_loss=0.04085, over 21416.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3122, pruned_loss=0.08133, over 4276534.78 frames. ], batch size: 211, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:06:43,161 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.848e+02 5.352e+02 7.597e+02 1.030e+03 2.056e+03, threshold=1.519e+03, percent-clipped=12.0 2023-06-23 11:07:02,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1470822.0, ans=0.1 2023-06-23 11:07:25,427 INFO [train.py:996] (2/4) Epoch 9, batch 1200, loss[loss=0.2096, simple_loss=0.2881, pruned_loss=0.06554, over 21583.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3142, pruned_loss=0.08249, over 4279142.77 frames. ], batch size: 230, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 11:07:29,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1470942.0, ans=0.0 2023-06-23 11:09:01,910 INFO [train.py:996] (2/4) Epoch 9, batch 1250, loss[loss=0.2228, simple_loss=0.2976, pruned_loss=0.07395, over 21260.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3146, pruned_loss=0.08228, over 4284882.52 frames. ], batch size: 159, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 11:09:22,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1471302.0, ans=0.0 2023-06-23 11:09:31,990 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2023-06-23 11:10:01,577 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.789e+02 4.897e+02 6.693e+02 9.449e+02 1.847e+03, threshold=1.339e+03, percent-clipped=0.0 2023-06-23 11:10:41,428 INFO [train.py:996] (2/4) Epoch 9, batch 1300, loss[loss=0.2542, simple_loss=0.3378, pruned_loss=0.08534, over 21833.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3154, pruned_loss=0.08235, over 4286838.21 frames. ], batch size: 316, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:10:41,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1471542.0, ans=0.125 2023-06-23 11:11:05,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1471602.0, ans=0.5 2023-06-23 11:11:34,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1471722.0, ans=0.2 2023-06-23 11:11:42,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1471722.0, ans=0.125 2023-06-23 11:12:14,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1471782.0, ans=0.125 2023-06-23 11:12:15,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1471842.0, ans=0.2 2023-06-23 11:12:16,939 INFO [train.py:996] (2/4) Epoch 9, batch 1350, loss[loss=0.2463, simple_loss=0.3167, pruned_loss=0.088, over 21882.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3152, pruned_loss=0.08247, over 4283548.19 frames. ], batch size: 391, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:12:21,235 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-23 11:12:27,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1471842.0, ans=0.125 2023-06-23 11:13:02,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1471962.0, ans=0.125 2023-06-23 11:13:16,509 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.586e+02 4.783e+02 6.688e+02 9.049e+02 1.938e+03, threshold=1.338e+03, percent-clipped=9.0 2023-06-23 11:13:39,666 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.12 vs. limit=15.0 2023-06-23 11:13:57,400 INFO [train.py:996] (2/4) Epoch 9, batch 1400, loss[loss=0.1958, simple_loss=0.2843, pruned_loss=0.0537, over 21506.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3143, pruned_loss=0.08307, over 4291274.25 frames. ], batch size: 211, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:14:04,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1472142.0, ans=0.2 2023-06-23 11:14:23,185 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.24 vs. limit=6.0 2023-06-23 11:14:36,231 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.77 vs. limit=22.5 2023-06-23 11:15:35,224 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.67 vs. limit=12.0 2023-06-23 11:15:39,586 INFO [train.py:996] (2/4) Epoch 9, batch 1450, loss[loss=0.2904, simple_loss=0.353, pruned_loss=0.1139, over 16207.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3151, pruned_loss=0.0841, over 4292248.78 frames. ], batch size: 61, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:16:44,339 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.412e+02 5.469e+02 7.594e+02 1.040e+03 1.854e+03, threshold=1.519e+03, percent-clipped=12.0 2023-06-23 11:17:20,409 INFO [train.py:996] (2/4) Epoch 9, batch 1500, loss[loss=0.2185, simple_loss=0.2886, pruned_loss=0.07415, over 21817.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3149, pruned_loss=0.08505, over 4295929.67 frames. ], batch size: 247, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:18:30,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1472922.0, ans=0.0 2023-06-23 11:19:02,987 INFO [train.py:996] (2/4) Epoch 9, batch 1550, loss[loss=0.2103, simple_loss=0.2959, pruned_loss=0.06232, over 21662.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3125, pruned_loss=0.08323, over 4291458.30 frames. ], batch size: 247, lr: 3.38e-03, grad_scale: 16.0 2023-06-23 11:19:15,956 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=12.0 2023-06-23 11:20:14,282 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.435e+02 5.299e+02 6.765e+02 1.096e+03 1.841e+03, threshold=1.353e+03, percent-clipped=3.0 2023-06-23 11:20:18,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1473222.0, ans=0.2 2023-06-23 11:20:40,610 INFO [train.py:996] (2/4) Epoch 9, batch 1600, loss[loss=0.2552, simple_loss=0.3452, pruned_loss=0.08262, over 21815.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3107, pruned_loss=0.0824, over 4286991.72 frames. ], batch size: 371, lr: 3.38e-03, grad_scale: 32.0 2023-06-23 11:21:34,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1473462.0, ans=0.125 2023-06-23 11:22:23,015 INFO [train.py:996] (2/4) Epoch 9, batch 1650, loss[loss=0.3039, simple_loss=0.3636, pruned_loss=0.1221, over 21749.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3109, pruned_loss=0.08279, over 4288633.53 frames. ], batch size: 441, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:22:23,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1473642.0, ans=0.1 2023-06-23 11:22:25,495 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-23 11:22:41,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1473642.0, ans=0.2 2023-06-23 11:22:47,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1473702.0, ans=0.125 2023-06-23 11:22:49,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1473702.0, ans=0.125 2023-06-23 11:23:41,187 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.489e+02 5.690e+02 7.589e+02 1.047e+03 2.202e+03, threshold=1.518e+03, percent-clipped=10.0 2023-06-23 11:23:41,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1473822.0, ans=0.125 2023-06-23 11:24:04,331 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.38 vs. limit=15.0 2023-06-23 11:24:05,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1473942.0, ans=0.125 2023-06-23 11:24:06,461 INFO [train.py:996] (2/4) Epoch 9, batch 1700, loss[loss=0.2673, simple_loss=0.3265, pruned_loss=0.104, over 21804.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3158, pruned_loss=0.08507, over 4289194.93 frames. ], batch size: 441, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:24:24,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1473942.0, ans=0.125 2023-06-23 11:24:24,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1473942.0, ans=0.0 2023-06-23 11:24:24,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1473942.0, ans=0.0 2023-06-23 11:24:56,064 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.01 vs. limit=12.0 2023-06-23 11:25:05,827 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=22.5 2023-06-23 11:25:12,765 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.43 vs. limit=22.5 2023-06-23 11:25:54,625 INFO [train.py:996] (2/4) Epoch 9, batch 1750, loss[loss=0.2308, simple_loss=0.2949, pruned_loss=0.08339, over 21697.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3149, pruned_loss=0.08308, over 4280931.39 frames. ], batch size: 263, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:25:57,378 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-23 11:26:02,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1474242.0, ans=0.125 2023-06-23 11:26:45,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1474362.0, ans=0.07 2023-06-23 11:26:48,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1474362.0, ans=0.5 2023-06-23 11:26:49,119 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-23 11:27:13,754 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.110e+02 6.517e+02 8.827e+02 1.421e+03 2.550e+03, threshold=1.765e+03, percent-clipped=23.0 2023-06-23 11:27:20,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1474482.0, ans=0.125 2023-06-23 11:27:37,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1474542.0, ans=0.2 2023-06-23 11:27:43,580 INFO [train.py:996] (2/4) Epoch 9, batch 1800, loss[loss=0.236, simple_loss=0.3344, pruned_loss=0.0688, over 21687.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3136, pruned_loss=0.0808, over 4279784.80 frames. ], batch size: 298, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:27:59,518 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.68 vs. limit=15.0 2023-06-23 11:28:05,697 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=15.0 2023-06-23 11:28:31,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1474662.0, ans=0.125 2023-06-23 11:28:35,248 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=12.0 2023-06-23 11:28:54,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1474722.0, ans=0.2 2023-06-23 11:29:04,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1474782.0, ans=0.1 2023-06-23 11:29:10,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1474782.0, ans=0.0 2023-06-23 11:29:18,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1474842.0, ans=0.1 2023-06-23 11:29:25,346 INFO [train.py:996] (2/4) Epoch 9, batch 1850, loss[loss=0.2624, simple_loss=0.3342, pruned_loss=0.09527, over 20696.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3154, pruned_loss=0.07925, over 4274778.52 frames. ], batch size: 607, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:29:50,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1474902.0, ans=0.125 2023-06-23 11:29:52,474 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.51 vs. limit=10.0 2023-06-23 11:29:56,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1474902.0, ans=0.0 2023-06-23 11:30:10,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1474962.0, ans=0.125 2023-06-23 11:30:13,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1474962.0, ans=0.0 2023-06-23 11:30:28,066 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-23 11:30:37,024 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.456e+02 5.587e+02 8.108e+02 1.184e+03 2.810e+03, threshold=1.622e+03, percent-clipped=5.0 2023-06-23 11:30:45,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1475082.0, ans=0.125 2023-06-23 11:30:45,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1475082.0, ans=0.125 2023-06-23 11:30:50,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1475082.0, ans=0.2 2023-06-23 11:30:56,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1475082.0, ans=0.025 2023-06-23 11:31:11,665 INFO [train.py:996] (2/4) Epoch 9, batch 1900, loss[loss=0.2347, simple_loss=0.3215, pruned_loss=0.07392, over 21402.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3153, pruned_loss=0.07977, over 4271816.40 frames. ], batch size: 211, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:31:38,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1475202.0, ans=0.1 2023-06-23 11:31:41,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1475202.0, ans=0.05 2023-06-23 11:31:43,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1475202.0, ans=0.125 2023-06-23 11:32:02,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1475262.0, ans=0.0 2023-06-23 11:32:18,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1475322.0, ans=0.035 2023-06-23 11:32:21,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1475322.0, ans=0.125 2023-06-23 11:32:58,394 INFO [train.py:996] (2/4) Epoch 9, batch 1950, loss[loss=0.2064, simple_loss=0.28, pruned_loss=0.06643, over 21688.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3105, pruned_loss=0.07816, over 4275884.58 frames. ], batch size: 333, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:34:00,317 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.571e+02 6.151e+02 9.472e+02 1.342e+03 2.834e+03, threshold=1.894e+03, percent-clipped=13.0 2023-06-23 11:34:04,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1475622.0, ans=0.0 2023-06-23 11:34:11,045 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2023-06-23 11:34:33,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1475682.0, ans=0.125 2023-06-23 11:34:36,577 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.45 vs. limit=15.0 2023-06-23 11:34:40,441 INFO [train.py:996] (2/4) Epoch 9, batch 2000, loss[loss=0.2089, simple_loss=0.2844, pruned_loss=0.06665, over 21628.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3067, pruned_loss=0.07736, over 4278312.47 frames. ], batch size: 263, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:34:40,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1475742.0, ans=0.0 2023-06-23 11:34:42,852 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=15.0 2023-06-23 11:36:16,262 INFO [train.py:996] (2/4) Epoch 9, batch 2050, loss[loss=0.2041, simple_loss=0.2755, pruned_loss=0.06642, over 21333.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3059, pruned_loss=0.07755, over 4278784.71 frames. ], batch size: 176, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:36:24,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1476042.0, ans=0.125 2023-06-23 11:37:17,614 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.689e+02 5.522e+02 6.873e+02 9.848e+02 2.030e+03, threshold=1.375e+03, percent-clipped=1.0 2023-06-23 11:37:18,548 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.16 vs. limit=15.0 2023-06-23 11:37:56,776 INFO [train.py:996] (2/4) Epoch 9, batch 2100, loss[loss=0.2455, simple_loss=0.3119, pruned_loss=0.0895, over 21744.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3118, pruned_loss=0.08021, over 4282091.93 frames. ], batch size: 282, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:38:11,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1476402.0, ans=0.125 2023-06-23 11:38:27,407 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.37 vs. limit=15.0 2023-06-23 11:38:55,605 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.46 vs. limit=22.5 2023-06-23 11:39:20,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1476582.0, ans=0.0 2023-06-23 11:39:27,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1476582.0, ans=0.1 2023-06-23 11:39:35,639 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:39:39,879 INFO [train.py:996] (2/4) Epoch 9, batch 2150, loss[loss=0.2358, simple_loss=0.3143, pruned_loss=0.0786, over 21169.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3113, pruned_loss=0.08015, over 4282283.21 frames. ], batch size: 159, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:39:40,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1476642.0, ans=0.0 2023-06-23 11:39:43,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1476642.0, ans=0.0 2023-06-23 11:40:36,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1476822.0, ans=0.1 2023-06-23 11:40:42,321 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.083e+02 6.104e+02 8.851e+02 1.376e+03 2.645e+03, threshold=1.770e+03, percent-clipped=25.0 2023-06-23 11:41:21,788 INFO [train.py:996] (2/4) Epoch 9, batch 2200, loss[loss=0.2196, simple_loss=0.313, pruned_loss=0.06312, over 21861.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3119, pruned_loss=0.08035, over 4286340.94 frames. ], batch size: 371, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:41:41,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1477002.0, ans=0.0 2023-06-23 11:41:51,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1477002.0, ans=0.1 2023-06-23 11:43:02,282 INFO [train.py:996] (2/4) Epoch 9, batch 2250, loss[loss=0.1954, simple_loss=0.2613, pruned_loss=0.06476, over 14738.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3094, pruned_loss=0.07835, over 4270971.06 frames. ], batch size: 61, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:43:15,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1477242.0, ans=0.125 2023-06-23 11:44:09,156 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.634e+02 5.493e+02 8.283e+02 1.333e+03 2.509e+03, threshold=1.657e+03, percent-clipped=6.0 2023-06-23 11:44:25,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1477482.0, ans=0.0 2023-06-23 11:44:37,504 INFO [train.py:996] (2/4) Epoch 9, batch 2300, loss[loss=0.2403, simple_loss=0.297, pruned_loss=0.09176, over 21428.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3069, pruned_loss=0.07893, over 4259119.04 frames. ], batch size: 389, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:46:13,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1477782.0, ans=0.0 2023-06-23 11:46:18,219 INFO [train.py:996] (2/4) Epoch 9, batch 2350, loss[loss=0.2037, simple_loss=0.272, pruned_loss=0.06769, over 21743.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3029, pruned_loss=0.07839, over 4260223.83 frames. ], batch size: 112, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:46:20,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1477842.0, ans=0.125 2023-06-23 11:46:29,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1477842.0, ans=0.125 2023-06-23 11:46:36,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1477842.0, ans=0.0 2023-06-23 11:46:38,418 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 11:47:00,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1477962.0, ans=0.0 2023-06-23 11:47:01,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1477962.0, ans=0.1 2023-06-23 11:47:18,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1478022.0, ans=0.2 2023-06-23 11:47:36,707 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.626e+02 5.310e+02 7.234e+02 1.027e+03 2.720e+03, threshold=1.447e+03, percent-clipped=6.0 2023-06-23 11:47:53,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1478082.0, ans=0.0 2023-06-23 11:48:06,154 INFO [train.py:996] (2/4) Epoch 9, batch 2400, loss[loss=0.2235, simple_loss=0.2892, pruned_loss=0.07891, over 21690.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3074, pruned_loss=0.08125, over 4266162.47 frames. ], batch size: 282, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:48:27,315 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.97 vs. limit=15.0 2023-06-23 11:49:39,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1478382.0, ans=0.04949747468305833 2023-06-23 11:49:39,572 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.59 vs. limit=15.0 2023-06-23 11:49:43,571 INFO [train.py:996] (2/4) Epoch 9, batch 2450, loss[loss=0.2181, simple_loss=0.2732, pruned_loss=0.08148, over 21466.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3119, pruned_loss=0.08411, over 4274967.28 frames. ], batch size: 212, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:49:49,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1478442.0, ans=0.0 2023-06-23 11:50:55,446 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.836e+02 5.454e+02 8.636e+02 1.143e+03 3.101e+03, threshold=1.727e+03, percent-clipped=10.0 2023-06-23 11:50:58,313 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-23 11:51:24,310 INFO [train.py:996] (2/4) Epoch 9, batch 2500, loss[loss=0.2371, simple_loss=0.3306, pruned_loss=0.07182, over 21683.00 frames. ], tot_loss[loss=0.24, simple_loss=0.312, pruned_loss=0.08405, over 4263268.53 frames. ], batch size: 298, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 11:51:52,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1478802.0, ans=0.125 2023-06-23 11:51:58,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1478862.0, ans=0.1 2023-06-23 11:52:05,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1478862.0, ans=0.2 2023-06-23 11:52:10,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1478862.0, ans=0.125 2023-06-23 11:52:48,974 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-23 11:53:05,810 INFO [train.py:996] (2/4) Epoch 9, batch 2550, loss[loss=0.2051, simple_loss=0.3015, pruned_loss=0.05436, over 21605.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3099, pruned_loss=0.08149, over 4255878.21 frames. ], batch size: 230, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:53:06,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1479042.0, ans=0.125 2023-06-23 11:53:12,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1479042.0, ans=0.0 2023-06-23 11:53:20,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1479102.0, ans=0.125 2023-06-23 11:53:37,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1479162.0, ans=15.0 2023-06-23 11:53:48,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1479162.0, ans=0.0 2023-06-23 11:54:19,356 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.829e+02 7.204e+02 9.571e+02 1.455e+03 2.660e+03, threshold=1.914e+03, percent-clipped=10.0 2023-06-23 11:54:37,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1479282.0, ans=0.125 2023-06-23 11:54:46,747 INFO [train.py:996] (2/4) Epoch 9, batch 2600, loss[loss=0.244, simple_loss=0.317, pruned_loss=0.08551, over 21575.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3124, pruned_loss=0.08245, over 4260349.54 frames. ], batch size: 415, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:55:03,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1479402.0, ans=0.125 2023-06-23 11:55:26,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1479462.0, ans=0.125 2023-06-23 11:55:33,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1479522.0, ans=0.125 2023-06-23 11:56:01,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1479522.0, ans=0.125 2023-06-23 11:56:28,271 INFO [train.py:996] (2/4) Epoch 9, batch 2650, loss[loss=0.2466, simple_loss=0.3291, pruned_loss=0.08209, over 21776.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3127, pruned_loss=0.08404, over 4270634.58 frames. ], batch size: 332, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:56:28,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1479642.0, ans=0.0 2023-06-23 11:56:39,050 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-23 11:56:44,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1479702.0, ans=0.2 2023-06-23 11:56:55,174 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.74 vs. limit=22.5 2023-06-23 11:57:08,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1479762.0, ans=0.1 2023-06-23 11:57:37,854 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.783e+02 6.164e+02 7.850e+02 1.193e+03 2.220e+03, threshold=1.570e+03, percent-clipped=3.0 2023-06-23 11:58:05,282 INFO [train.py:996] (2/4) Epoch 9, batch 2700, loss[loss=0.1636, simple_loss=0.2279, pruned_loss=0.04963, over 21345.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3096, pruned_loss=0.08249, over 4269481.99 frames. ], batch size: 159, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:58:41,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1480062.0, ans=0.1 2023-06-23 11:59:19,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1480122.0, ans=0.125 2023-06-23 11:59:23,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1480182.0, ans=0.0 2023-06-23 11:59:25,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1480182.0, ans=0.125 2023-06-23 11:59:43,039 INFO [train.py:996] (2/4) Epoch 9, batch 2750, loss[loss=0.2304, simple_loss=0.3053, pruned_loss=0.0777, over 21249.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3088, pruned_loss=0.08197, over 4270618.51 frames. ], batch size: 143, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 11:59:43,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1480242.0, ans=0.05 2023-06-23 11:59:45,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1480242.0, ans=0.2 2023-06-23 11:59:45,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1480242.0, ans=0.0 2023-06-23 11:59:54,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1480242.0, ans=0.125 2023-06-23 11:59:55,390 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-23 12:00:03,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1480302.0, ans=0.2 2023-06-23 12:00:07,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1480302.0, ans=0.0 2023-06-23 12:00:10,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1480302.0, ans=0.09899494936611666 2023-06-23 12:00:15,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1480362.0, ans=0.125 2023-06-23 12:00:18,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1480362.0, ans=0.125 2023-06-23 12:00:58,180 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.035e+02 5.584e+02 7.738e+02 1.130e+03 2.409e+03, threshold=1.548e+03, percent-clipped=8.0 2023-06-23 12:01:22,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1480482.0, ans=0.125 2023-06-23 12:01:27,090 INFO [train.py:996] (2/4) Epoch 9, batch 2800, loss[loss=0.2899, simple_loss=0.3577, pruned_loss=0.111, over 21766.00 frames. ], tot_loss[loss=0.241, simple_loss=0.314, pruned_loss=0.08401, over 4273639.17 frames. ], batch size: 282, lr: 3.37e-03, grad_scale: 32.0 2023-06-23 12:01:35,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1480542.0, ans=0.125 2023-06-23 12:01:44,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1480602.0, ans=0.125 2023-06-23 12:01:54,696 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=22.5 2023-06-23 12:02:32,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1480722.0, ans=0.125 2023-06-23 12:03:09,783 INFO [train.py:996] (2/4) Epoch 9, batch 2850, loss[loss=0.1673, simple_loss=0.2212, pruned_loss=0.05668, over 21251.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3137, pruned_loss=0.08435, over 4267125.50 frames. ], batch size: 143, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:04:01,816 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.88 vs. limit=12.0 2023-06-23 12:04:25,143 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.732e+02 5.955e+02 8.768e+02 1.383e+03 2.997e+03, threshold=1.754e+03, percent-clipped=21.0 2023-06-23 12:04:50,670 INFO [train.py:996] (2/4) Epoch 9, batch 2900, loss[loss=0.258, simple_loss=0.3167, pruned_loss=0.09966, over 21736.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3137, pruned_loss=0.08438, over 4268789.04 frames. ], batch size: 473, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:05:15,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1481202.0, ans=0.2 2023-06-23 12:06:19,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1481382.0, ans=0.1 2023-06-23 12:06:25,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1481382.0, ans=0.125 2023-06-23 12:06:31,729 INFO [train.py:996] (2/4) Epoch 9, batch 2950, loss[loss=0.2398, simple_loss=0.3121, pruned_loss=0.08377, over 21418.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3163, pruned_loss=0.08456, over 4274925.92 frames. ], batch size: 144, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:06:40,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1481442.0, ans=0.0 2023-06-23 12:06:40,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1481442.0, ans=0.125 2023-06-23 12:06:43,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1481442.0, ans=0.05 2023-06-23 12:06:45,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1481442.0, ans=0.125 2023-06-23 12:06:45,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1481442.0, ans=0.2 2023-06-23 12:07:15,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1481562.0, ans=10.0 2023-06-23 12:07:24,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1481562.0, ans=0.1 2023-06-23 12:07:48,167 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.698e+02 5.526e+02 7.206e+02 1.005e+03 1.804e+03, threshold=1.441e+03, percent-clipped=1.0 2023-06-23 12:08:08,688 INFO [train.py:996] (2/4) Epoch 9, batch 3000, loss[loss=0.2649, simple_loss=0.3456, pruned_loss=0.09213, over 21444.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3206, pruned_loss=0.0859, over 4282540.77 frames. ], batch size: 131, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:08:08,689 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-23 12:08:22,882 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.4221, 2.9061, 1.3562, 1.4062], device='cuda:2') 2023-06-23 12:08:24,858 INFO [train.py:1028] (2/4) Epoch 9, validation: loss=0.2522, simple_loss=0.3459, pruned_loss=0.07924, over 1796401.00 frames. 2023-06-23 12:08:24,859 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-23 12:08:25,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1481742.0, ans=0.125 2023-06-23 12:08:25,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1481742.0, ans=0.125 2023-06-23 12:08:27,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1481742.0, ans=0.125 2023-06-23 12:10:09,600 INFO [train.py:996] (2/4) Epoch 9, batch 3050, loss[loss=0.2195, simple_loss=0.2837, pruned_loss=0.07765, over 21782.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3192, pruned_loss=0.08347, over 4278303.90 frames. ], batch size: 112, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:10:25,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1482042.0, ans=0.025 2023-06-23 12:10:47,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1482102.0, ans=0.125 2023-06-23 12:11:07,804 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.29 vs. limit=10.0 2023-06-23 12:11:15,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1482222.0, ans=0.0 2023-06-23 12:11:24,317 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.590e+02 6.164e+02 8.184e+02 1.174e+03 2.237e+03, threshold=1.637e+03, percent-clipped=13.0 2023-06-23 12:11:40,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1482282.0, ans=0.125 2023-06-23 12:11:44,958 INFO [train.py:996] (2/4) Epoch 9, batch 3100, loss[loss=0.2436, simple_loss=0.3369, pruned_loss=0.0752, over 21602.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3193, pruned_loss=0.08293, over 4270759.10 frames. ], batch size: 441, lr: 3.37e-03, grad_scale: 16.0 2023-06-23 12:12:39,890 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.57 vs. limit=22.5 2023-06-23 12:12:41,433 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.12 vs. limit=10.0 2023-06-23 12:12:55,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1482522.0, ans=0.0 2023-06-23 12:13:06,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1482522.0, ans=0.125 2023-06-23 12:13:36,645 INFO [train.py:996] (2/4) Epoch 9, batch 3150, loss[loss=0.2975, simple_loss=0.3588, pruned_loss=0.1181, over 21327.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.322, pruned_loss=0.08375, over 4264489.15 frames. ], batch size: 143, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:13:55,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1482642.0, ans=0.04949747468305833 2023-06-23 12:14:03,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1482702.0, ans=0.1 2023-06-23 12:14:13,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1482702.0, ans=15.0 2023-06-23 12:14:47,703 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.159e+02 5.985e+02 8.535e+02 1.297e+03 2.485e+03, threshold=1.707e+03, percent-clipped=14.0 2023-06-23 12:14:56,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1482882.0, ans=0.125 2023-06-23 12:15:24,468 INFO [train.py:996] (2/4) Epoch 9, batch 3200, loss[loss=0.2049, simple_loss=0.2794, pruned_loss=0.06521, over 21219.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3212, pruned_loss=0.08352, over 4261690.61 frames. ], batch size: 159, lr: 3.36e-03, grad_scale: 32.0 2023-06-23 12:15:53,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1483002.0, ans=0.125 2023-06-23 12:15:59,258 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=15.0 2023-06-23 12:16:03,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1483062.0, ans=0.125 2023-06-23 12:16:11,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1483062.0, ans=0.0 2023-06-23 12:16:13,883 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-23 12:16:47,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1483182.0, ans=0.125 2023-06-23 12:16:59,988 INFO [train.py:996] (2/4) Epoch 9, batch 3250, loss[loss=0.2273, simple_loss=0.2986, pruned_loss=0.078, over 21877.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3217, pruned_loss=0.08554, over 4267355.19 frames. ], batch size: 372, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:17:02,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1483242.0, ans=0.0 2023-06-23 12:17:08,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1483242.0, ans=0.1 2023-06-23 12:17:09,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1483242.0, ans=0.0 2023-06-23 12:17:23,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1483302.0, ans=0.1 2023-06-23 12:17:46,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1483362.0, ans=0.125 2023-06-23 12:17:56,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1483422.0, ans=0.2 2023-06-23 12:18:08,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1483422.0, ans=0.2 2023-06-23 12:18:19,929 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.506e+02 4.912e+02 6.771e+02 1.025e+03 2.208e+03, threshold=1.354e+03, percent-clipped=1.0 2023-06-23 12:18:44,453 INFO [train.py:996] (2/4) Epoch 9, batch 3300, loss[loss=0.2577, simple_loss=0.3417, pruned_loss=0.08688, over 21561.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3174, pruned_loss=0.08554, over 4267927.50 frames. ], batch size: 414, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:18:49,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1483542.0, ans=0.1 2023-06-23 12:18:56,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1483542.0, ans=0.035 2023-06-23 12:19:48,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1483722.0, ans=0.2 2023-06-23 12:20:25,115 INFO [train.py:996] (2/4) Epoch 9, batch 3350, loss[loss=0.2655, simple_loss=0.3305, pruned_loss=0.1003, over 21356.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3188, pruned_loss=0.08496, over 4262170.73 frames. ], batch size: 176, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:20:26,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1483842.0, ans=0.0 2023-06-23 12:20:28,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1483842.0, ans=0.125 2023-06-23 12:20:44,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1483902.0, ans=0.1 2023-06-23 12:21:40,404 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.733e+02 6.031e+02 9.696e+02 1.341e+03 2.502e+03, threshold=1.939e+03, percent-clipped=21.0 2023-06-23 12:22:04,013 INFO [train.py:996] (2/4) Epoch 9, batch 3400, loss[loss=0.2897, simple_loss=0.3602, pruned_loss=0.1096, over 21482.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3184, pruned_loss=0.08579, over 4277006.28 frames. ], batch size: 507, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:22:09,466 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=15.0 2023-06-23 12:22:31,349 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:23:43,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1484442.0, ans=0.125 2023-06-23 12:23:44,220 INFO [train.py:996] (2/4) Epoch 9, batch 3450, loss[loss=0.2786, simple_loss=0.3443, pruned_loss=0.1064, over 21862.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3123, pruned_loss=0.08457, over 4277446.54 frames. ], batch size: 372, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:23:51,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1484442.0, ans=0.0 2023-06-23 12:24:12,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1484502.0, ans=0.125 2023-06-23 12:24:26,678 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.35 vs. limit=15.0 2023-06-23 12:24:54,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1484622.0, ans=0.2 2023-06-23 12:25:01,097 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.83 vs. limit=10.0 2023-06-23 12:25:01,672 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.990e+02 5.635e+02 8.079e+02 1.246e+03 2.546e+03, threshold=1.616e+03, percent-clipped=4.0 2023-06-23 12:25:19,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1484742.0, ans=0.0 2023-06-23 12:25:21,096 INFO [train.py:996] (2/4) Epoch 9, batch 3500, loss[loss=0.3002, simple_loss=0.3755, pruned_loss=0.1125, over 21728.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.321, pruned_loss=0.08832, over 4281164.24 frames. ], batch size: 441, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:25:54,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1484802.0, ans=0.2 2023-06-23 12:25:54,824 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.08 vs. limit=15.0 2023-06-23 12:25:57,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1484862.0, ans=0.0 2023-06-23 12:26:41,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1484982.0, ans=0.125 2023-06-23 12:26:55,736 INFO [train.py:996] (2/4) Epoch 9, batch 3550, loss[loss=0.201, simple_loss=0.291, pruned_loss=0.05548, over 21008.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3245, pruned_loss=0.09017, over 4274210.62 frames. ], batch size: 608, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:27:39,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1485162.0, ans=0.125 2023-06-23 12:28:07,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1485222.0, ans=0.125 2023-06-23 12:28:10,183 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.810e+02 5.500e+02 7.334e+02 1.032e+03 1.807e+03, threshold=1.467e+03, percent-clipped=3.0 2023-06-23 12:28:15,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1485282.0, ans=0.125 2023-06-23 12:28:15,688 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:28:29,464 INFO [train.py:996] (2/4) Epoch 9, batch 3600, loss[loss=0.2605, simple_loss=0.3208, pruned_loss=0.1001, over 21253.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.3192, pruned_loss=0.08865, over 4268368.53 frames. ], batch size: 143, lr: 3.36e-03, grad_scale: 32.0 2023-06-23 12:28:57,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1485402.0, ans=0.1 2023-06-23 12:29:13,293 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=15.0 2023-06-23 12:29:38,299 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.21 vs. limit=22.5 2023-06-23 12:29:44,819 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.87 vs. limit=6.0 2023-06-23 12:30:06,774 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=15.0 2023-06-23 12:30:08,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1485582.0, ans=0.125 2023-06-23 12:30:11,160 INFO [train.py:996] (2/4) Epoch 9, batch 3650, loss[loss=0.2831, simple_loss=0.354, pruned_loss=0.1061, over 21420.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3192, pruned_loss=0.08781, over 4267851.54 frames. ], batch size: 471, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:31:21,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1485822.0, ans=0.125 2023-06-23 12:31:22,418 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.44 vs. limit=6.0 2023-06-23 12:31:34,330 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.982e+02 5.571e+02 7.883e+02 1.166e+03 2.519e+03, threshold=1.577e+03, percent-clipped=13.0 2023-06-23 12:31:52,183 INFO [train.py:996] (2/4) Epoch 9, batch 3700, loss[loss=0.2624, simple_loss=0.3255, pruned_loss=0.09968, over 21556.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3185, pruned_loss=0.0868, over 4268390.10 frames. ], batch size: 131, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:32:07,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1485942.0, ans=0.125 2023-06-23 12:32:47,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1486062.0, ans=0.0 2023-06-23 12:32:55,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1486062.0, ans=0.125 2023-06-23 12:32:58,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=1486122.0, ans=0.02 2023-06-23 12:33:03,699 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:33:24,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1486182.0, ans=0.125 2023-06-23 12:33:41,722 INFO [train.py:996] (2/4) Epoch 9, batch 3750, loss[loss=0.2973, simple_loss=0.3932, pruned_loss=0.1007, over 19874.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3199, pruned_loss=0.08732, over 4269623.13 frames. ], batch size: 702, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:33:47,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1486242.0, ans=0.2 2023-06-23 12:34:03,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1486302.0, ans=0.125 2023-06-23 12:34:54,204 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.489e+02 5.333e+02 7.689e+02 1.174e+03 2.476e+03, threshold=1.538e+03, percent-clipped=10.0 2023-06-23 12:35:02,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1486482.0, ans=0.1 2023-06-23 12:35:08,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1486482.0, ans=0.125 2023-06-23 12:35:08,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1486482.0, ans=0.125 2023-06-23 12:35:22,201 INFO [train.py:996] (2/4) Epoch 9, batch 3800, loss[loss=0.2483, simple_loss=0.3251, pruned_loss=0.08578, over 21738.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3162, pruned_loss=0.08477, over 4272725.68 frames. ], batch size: 351, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:35:25,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1486542.0, ans=0.1 2023-06-23 12:35:29,549 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.37 vs. limit=22.5 2023-06-23 12:36:02,697 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-23 12:36:20,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1486722.0, ans=0.125 2023-06-23 12:36:38,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1486782.0, ans=0.1 2023-06-23 12:36:56,376 INFO [train.py:996] (2/4) Epoch 9, batch 3850, loss[loss=0.201, simple_loss=0.2595, pruned_loss=0.07123, over 21156.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3135, pruned_loss=0.08551, over 4272758.65 frames. ], batch size: 176, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:36:58,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1486842.0, ans=0.125 2023-06-23 12:37:37,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1486962.0, ans=0.1 2023-06-23 12:38:03,151 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.513e+02 4.770e+02 6.158e+02 8.423e+02 1.897e+03, threshold=1.232e+03, percent-clipped=2.0 2023-06-23 12:38:14,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1487082.0, ans=0.125 2023-06-23 12:38:25,940 INFO [train.py:996] (2/4) Epoch 9, batch 3900, loss[loss=0.2469, simple_loss=0.3042, pruned_loss=0.09482, over 21666.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3091, pruned_loss=0.08494, over 4271377.57 frames. ], batch size: 414, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:39:09,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1487262.0, ans=0.0 2023-06-23 12:39:12,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1487262.0, ans=0.125 2023-06-23 12:39:25,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1487322.0, ans=0.0 2023-06-23 12:39:26,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1487322.0, ans=0.125 2023-06-23 12:39:56,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.whiten.whitening_limit, batch_count=1487382.0, ans=15.0 2023-06-23 12:40:09,680 INFO [train.py:996] (2/4) Epoch 9, batch 3950, loss[loss=0.1947, simple_loss=0.2865, pruned_loss=0.05146, over 21800.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3122, pruned_loss=0.08452, over 4279883.00 frames. ], batch size: 371, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:41:00,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1487622.0, ans=0.1 2023-06-23 12:41:16,566 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.422e+02 5.095e+02 8.161e+02 1.017e+03 2.071e+03, threshold=1.632e+03, percent-clipped=17.0 2023-06-23 12:41:29,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1487682.0, ans=0.0 2023-06-23 12:41:34,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1487682.0, ans=0.1 2023-06-23 12:41:49,061 INFO [train.py:996] (2/4) Epoch 9, batch 4000, loss[loss=0.2689, simple_loss=0.3561, pruned_loss=0.0909, over 20744.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3055, pruned_loss=0.08122, over 4267152.09 frames. ], batch size: 608, lr: 3.36e-03, grad_scale: 32.0 2023-06-23 12:42:28,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1487862.0, ans=0.125 2023-06-23 12:43:11,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1487982.0, ans=0.0 2023-06-23 12:43:29,585 INFO [train.py:996] (2/4) Epoch 9, batch 4050, loss[loss=0.2287, simple_loss=0.2964, pruned_loss=0.08052, over 21358.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3046, pruned_loss=0.07949, over 4271622.38 frames. ], batch size: 176, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:43:29,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1488042.0, ans=0.125 2023-06-23 12:43:57,018 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.84 vs. limit=10.0 2023-06-23 12:44:26,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1488222.0, ans=0.0 2023-06-23 12:44:46,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1488222.0, ans=0.125 2023-06-23 12:44:48,945 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.596e+02 4.877e+02 6.874e+02 9.034e+02 2.185e+03, threshold=1.375e+03, percent-clipped=7.0 2023-06-23 12:44:57,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1488282.0, ans=0.125 2023-06-23 12:45:09,744 INFO [train.py:996] (2/4) Epoch 9, batch 4100, loss[loss=0.2694, simple_loss=0.3336, pruned_loss=0.1025, over 21939.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3064, pruned_loss=0.07936, over 4280101.44 frames. ], batch size: 107, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:45:44,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1488402.0, ans=0.04949747468305833 2023-06-23 12:46:02,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1488522.0, ans=0.125 2023-06-23 12:46:54,913 INFO [train.py:996] (2/4) Epoch 9, batch 4150, loss[loss=0.1895, simple_loss=0.2911, pruned_loss=0.04397, over 21631.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3059, pruned_loss=0.07622, over 4267349.53 frames. ], batch size: 247, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:47:03,612 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.66 vs. limit=15.0 2023-06-23 12:47:16,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1488702.0, ans=0.0 2023-06-23 12:47:22,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1488702.0, ans=0.0 2023-06-23 12:47:29,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1488762.0, ans=0.1 2023-06-23 12:47:58,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1488822.0, ans=0.125 2023-06-23 12:48:10,698 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.593e+02 5.949e+02 7.437e+02 1.328e+03 3.049e+03, threshold=1.487e+03, percent-clipped=21.0 2023-06-23 12:48:32,518 INFO [train.py:996] (2/4) Epoch 9, batch 4200, loss[loss=0.2412, simple_loss=0.3062, pruned_loss=0.08812, over 21481.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3063, pruned_loss=0.07636, over 4255787.73 frames. ], batch size: 212, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:48:36,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1488942.0, ans=0.2 2023-06-23 12:49:23,099 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-23 12:49:25,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1489062.0, ans=0.125 2023-06-23 12:49:46,116 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-06-23 12:50:14,810 INFO [train.py:996] (2/4) Epoch 9, batch 4250, loss[loss=0.2597, simple_loss=0.3291, pruned_loss=0.09516, over 21415.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.311, pruned_loss=0.07769, over 4254972.01 frames. ], batch size: 194, lr: 3.36e-03, grad_scale: 8.0 2023-06-23 12:50:40,967 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-06-23 12:51:33,143 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 12:51:36,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1489422.0, ans=0.0 2023-06-23 12:51:43,117 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.972e+02 6.108e+02 8.578e+02 1.174e+03 2.664e+03, threshold=1.716e+03, percent-clipped=12.0 2023-06-23 12:51:43,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1489482.0, ans=0.2 2023-06-23 12:51:58,832 INFO [train.py:996] (2/4) Epoch 9, batch 4300, loss[loss=0.2131, simple_loss=0.2794, pruned_loss=0.07344, over 21195.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.32, pruned_loss=0.08078, over 4259915.58 frames. ], batch size: 608, lr: 3.36e-03, grad_scale: 8.0 2023-06-23 12:52:33,208 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-23 12:53:13,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1489722.0, ans=0.0 2023-06-23 12:53:33,971 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-23 12:53:40,703 INFO [train.py:996] (2/4) Epoch 9, batch 4350, loss[loss=0.2025, simple_loss=0.274, pruned_loss=0.06553, over 21376.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3197, pruned_loss=0.08017, over 4261255.14 frames. ], batch size: 131, lr: 3.36e-03, grad_scale: 8.0 2023-06-23 12:54:04,085 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-23 12:54:46,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1490022.0, ans=0.125 2023-06-23 12:55:05,708 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.397e+02 5.616e+02 8.582e+02 1.448e+03 3.184e+03, threshold=1.716e+03, percent-clipped=15.0 2023-06-23 12:55:30,737 INFO [train.py:996] (2/4) Epoch 9, batch 4400, loss[loss=0.2363, simple_loss=0.3254, pruned_loss=0.07354, over 19963.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3143, pruned_loss=0.07948, over 4255226.62 frames. ], batch size: 702, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:55:31,843 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.67 vs. limit=15.0 2023-06-23 12:55:34,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1490142.0, ans=0.125 2023-06-23 12:57:12,856 INFO [train.py:996] (2/4) Epoch 9, batch 4450, loss[loss=0.2568, simple_loss=0.3319, pruned_loss=0.09087, over 21280.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3224, pruned_loss=0.08192, over 4258513.24 frames. ], batch size: 159, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:58:08,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1490562.0, ans=0.125 2023-06-23 12:58:39,554 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.863e+02 6.461e+02 1.015e+03 1.659e+03 5.524e+03, threshold=2.029e+03, percent-clipped=20.0 2023-06-23 12:58:54,092 INFO [train.py:996] (2/4) Epoch 9, batch 4500, loss[loss=0.255, simple_loss=0.3209, pruned_loss=0.09453, over 21870.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3255, pruned_loss=0.08454, over 4268033.74 frames. ], batch size: 124, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 12:58:54,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1490742.0, ans=0.2 2023-06-23 12:59:20,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1490802.0, ans=0.125 2023-06-23 13:00:17,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1490982.0, ans=0.1 2023-06-23 13:00:19,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1490982.0, ans=0.0 2023-06-23 13:00:42,544 INFO [train.py:996] (2/4) Epoch 9, batch 4550, loss[loss=0.3043, simple_loss=0.3679, pruned_loss=0.1203, over 21207.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3269, pruned_loss=0.08476, over 4270597.31 frames. ], batch size: 143, lr: 3.36e-03, grad_scale: 16.0 2023-06-23 13:00:55,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1491042.0, ans=0.125 2023-06-23 13:01:10,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1491102.0, ans=0.125 2023-06-23 13:01:22,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1491162.0, ans=0.0 2023-06-23 13:02:03,736 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.689e+02 4.940e+02 6.297e+02 8.588e+02 1.962e+03, threshold=1.259e+03, percent-clipped=0.0 2023-06-23 13:02:06,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1491282.0, ans=0.04949747468305833 2023-06-23 13:02:27,867 INFO [train.py:996] (2/4) Epoch 9, batch 4600, loss[loss=0.2472, simple_loss=0.3195, pruned_loss=0.08749, over 21708.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3272, pruned_loss=0.08579, over 4273603.91 frames. ], batch size: 441, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:02:59,273 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:03:34,654 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=12.71 vs. limit=15.0 2023-06-23 13:04:01,930 INFO [train.py:996] (2/4) Epoch 9, batch 4650, loss[loss=0.1871, simple_loss=0.2624, pruned_loss=0.05588, over 20158.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3215, pruned_loss=0.08397, over 4276543.55 frames. ], batch size: 703, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:04:26,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1491702.0, ans=0.1 2023-06-23 13:04:45,398 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:05:03,361 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-06-23 13:05:16,741 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.483e+02 4.910e+02 6.100e+02 8.336e+02 1.525e+03, threshold=1.220e+03, percent-clipped=3.0 2023-06-23 13:05:35,275 INFO [train.py:996] (2/4) Epoch 9, batch 4700, loss[loss=0.2196, simple_loss=0.2826, pruned_loss=0.07825, over 21784.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3123, pruned_loss=0.08242, over 4278661.93 frames. ], batch size: 317, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:05:39,326 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-23 13:05:47,357 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.45 vs. limit=15.0 2023-06-23 13:06:32,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1492122.0, ans=0.125 2023-06-23 13:07:07,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1492182.0, ans=0.0 2023-06-23 13:07:13,898 INFO [train.py:996] (2/4) Epoch 9, batch 4750, loss[loss=0.2008, simple_loss=0.2614, pruned_loss=0.07015, over 21594.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3065, pruned_loss=0.08245, over 4274360.94 frames. ], batch size: 298, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:07:29,119 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=12.0 2023-06-23 13:07:30,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1492302.0, ans=0.125 2023-06-23 13:07:33,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1492302.0, ans=0.0 2023-06-23 13:07:33,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1492302.0, ans=0.2 2023-06-23 13:08:20,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1492422.0, ans=0.0 2023-06-23 13:08:34,845 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.263e+02 4.594e+02 6.434e+02 8.978e+02 1.748e+03, threshold=1.287e+03, percent-clipped=12.0 2023-06-23 13:08:53,501 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.55 vs. limit=22.5 2023-06-23 13:08:54,107 INFO [train.py:996] (2/4) Epoch 9, batch 4800, loss[loss=0.2135, simple_loss=0.3105, pruned_loss=0.05825, over 21782.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3062, pruned_loss=0.08248, over 4281165.62 frames. ], batch size: 351, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:08:57,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1492542.0, ans=0.125 2023-06-23 13:09:17,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1492602.0, ans=0.5 2023-06-23 13:09:32,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1492662.0, ans=0.125 2023-06-23 13:10:32,154 INFO [train.py:996] (2/4) Epoch 9, batch 4850, loss[loss=0.2419, simple_loss=0.3145, pruned_loss=0.08468, over 21855.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3055, pruned_loss=0.08194, over 4280343.53 frames. ], batch size: 124, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:11:04,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1492902.0, ans=0.125 2023-06-23 13:11:18,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1492962.0, ans=0.0 2023-06-23 13:11:38,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1493022.0, ans=0.125 2023-06-23 13:11:39,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=1493022.0, ans=15.0 2023-06-23 13:11:52,596 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.403e+02 5.446e+02 6.914e+02 1.031e+03 2.241e+03, threshold=1.383e+03, percent-clipped=12.0 2023-06-23 13:12:09,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1493142.0, ans=0.0 2023-06-23 13:12:10,720 INFO [train.py:996] (2/4) Epoch 9, batch 4900, loss[loss=0.2225, simple_loss=0.2719, pruned_loss=0.08656, over 20207.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3077, pruned_loss=0.08261, over 4276536.27 frames. ], batch size: 703, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:12:14,346 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:12:28,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1493202.0, ans=0.125 2023-06-23 13:13:08,083 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.50 vs. limit=15.0 2023-06-23 13:13:46,633 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.21 vs. limit=22.5 2023-06-23 13:13:50,326 INFO [train.py:996] (2/4) Epoch 9, batch 4950, loss[loss=0.2115, simple_loss=0.3017, pruned_loss=0.06068, over 21788.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3124, pruned_loss=0.08042, over 4283720.00 frames. ], batch size: 282, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:13:58,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1493442.0, ans=0.125 2023-06-23 13:14:19,764 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.73 vs. limit=15.0 2023-06-23 13:15:16,659 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.074e+02 4.885e+02 7.048e+02 1.090e+03 2.586e+03, threshold=1.410e+03, percent-clipped=12.0 2023-06-23 13:15:29,191 INFO [train.py:996] (2/4) Epoch 9, batch 5000, loss[loss=0.2446, simple_loss=0.3093, pruned_loss=0.08994, over 21314.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3104, pruned_loss=0.07695, over 4284588.28 frames. ], batch size: 143, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:15:40,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1493742.0, ans=0.125 2023-06-23 13:15:42,578 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=16.72 vs. limit=22.5 2023-06-23 13:15:45,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1493802.0, ans=0.125 2023-06-23 13:16:25,244 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:16:58,729 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2023-06-23 13:17:02,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1493982.0, ans=0.1 2023-06-23 13:17:07,592 INFO [train.py:996] (2/4) Epoch 9, batch 5050, loss[loss=0.2519, simple_loss=0.3197, pruned_loss=0.09203, over 21847.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3105, pruned_loss=0.07892, over 4286748.29 frames. ], batch size: 118, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:17:33,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1494102.0, ans=0.125 2023-06-23 13:17:40,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1494102.0, ans=0.0 2023-06-23 13:17:41,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1494102.0, ans=0.0 2023-06-23 13:18:19,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1494222.0, ans=0.125 2023-06-23 13:18:24,389 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.65 vs. limit=6.0 2023-06-23 13:18:28,087 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.800e+02 5.012e+02 6.560e+02 1.020e+03 2.026e+03, threshold=1.312e+03, percent-clipped=12.0 2023-06-23 13:18:37,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1494282.0, ans=0.125 2023-06-23 13:18:45,147 INFO [train.py:996] (2/4) Epoch 9, batch 5100, loss[loss=0.2052, simple_loss=0.2812, pruned_loss=0.06463, over 21822.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3095, pruned_loss=0.07986, over 4293045.67 frames. ], batch size: 332, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:18:56,786 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 13:20:00,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1494522.0, ans=0.125 2023-06-23 13:20:25,859 INFO [train.py:996] (2/4) Epoch 9, batch 5150, loss[loss=0.2686, simple_loss=0.3379, pruned_loss=0.09969, over 21732.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3103, pruned_loss=0.08033, over 4287513.29 frames. ], batch size: 441, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:20:33,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1494642.0, ans=0.2 2023-06-23 13:21:53,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1494882.0, ans=0.125 2023-06-23 13:21:54,187 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.473e+02 5.169e+02 7.852e+02 1.261e+03 2.554e+03, threshold=1.570e+03, percent-clipped=23.0 2023-06-23 13:22:07,026 INFO [train.py:996] (2/4) Epoch 9, batch 5200, loss[loss=0.2294, simple_loss=0.3133, pruned_loss=0.07274, over 21275.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3149, pruned_loss=0.08101, over 4275662.06 frames. ], batch size: 159, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:22:48,249 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.09 vs. limit=22.5 2023-06-23 13:23:45,409 INFO [train.py:996] (2/4) Epoch 9, batch 5250, loss[loss=0.2313, simple_loss=0.3195, pruned_loss=0.07156, over 21831.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3196, pruned_loss=0.07934, over 4268707.53 frames. ], batch size: 316, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:24:33,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1495362.0, ans=0.125 2023-06-23 13:24:56,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1495422.0, ans=0.1 2023-06-23 13:24:58,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1495422.0, ans=0.2 2023-06-23 13:25:00,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1495422.0, ans=0.125 2023-06-23 13:25:00,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1495422.0, ans=0.125 2023-06-23 13:25:10,469 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.544e+02 5.638e+02 7.794e+02 1.189e+03 2.542e+03, threshold=1.559e+03, percent-clipped=12.0 2023-06-23 13:25:27,937 INFO [train.py:996] (2/4) Epoch 9, batch 5300, loss[loss=0.2238, simple_loss=0.3445, pruned_loss=0.05157, over 19806.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3183, pruned_loss=0.07948, over 4267279.35 frames. ], batch size: 702, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:25:29,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1495542.0, ans=0.0 2023-06-23 13:25:32,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1495542.0, ans=0.125 2023-06-23 13:25:32,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1495542.0, ans=0.0 2023-06-23 13:26:32,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1495722.0, ans=0.125 2023-06-23 13:26:35,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1495722.0, ans=0.125 2023-06-23 13:27:01,883 INFO [train.py:996] (2/4) Epoch 9, batch 5350, loss[loss=0.2344, simple_loss=0.2946, pruned_loss=0.0871, over 21954.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3161, pruned_loss=0.08117, over 4282568.00 frames. ], batch size: 316, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:27:23,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1495842.0, ans=0.125 2023-06-23 13:27:34,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1495902.0, ans=0.2 2023-06-23 13:28:10,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1496022.0, ans=0.0 2023-06-23 13:28:20,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1496022.0, ans=0.125 2023-06-23 13:28:21,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1496022.0, ans=0.0 2023-06-23 13:28:28,209 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.718e+02 6.687e+02 9.729e+02 1.336e+03 3.211e+03, threshold=1.946e+03, percent-clipped=15.0 2023-06-23 13:28:46,028 INFO [train.py:996] (2/4) Epoch 9, batch 5400, loss[loss=0.24, simple_loss=0.3083, pruned_loss=0.08585, over 21885.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3155, pruned_loss=0.08231, over 4283009.37 frames. ], batch size: 414, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:29:15,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1496202.0, ans=0.125 2023-06-23 13:29:23,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1496262.0, ans=0.09899494936611666 2023-06-23 13:30:30,277 INFO [train.py:996] (2/4) Epoch 9, batch 5450, loss[loss=0.2402, simple_loss=0.3317, pruned_loss=0.0743, over 21742.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3145, pruned_loss=0.08036, over 4286764.46 frames. ], batch size: 247, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:30:38,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1496442.0, ans=0.1 2023-06-23 13:31:24,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1496562.0, ans=0.0 2023-06-23 13:31:48,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1496682.0, ans=0.125 2023-06-23 13:31:48,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1496682.0, ans=0.5 2023-06-23 13:31:53,797 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.509e+02 5.097e+02 7.378e+02 1.209e+03 3.523e+03, threshold=1.476e+03, percent-clipped=4.0 2023-06-23 13:32:09,879 INFO [train.py:996] (2/4) Epoch 9, batch 5500, loss[loss=0.1893, simple_loss=0.2718, pruned_loss=0.05345, over 21088.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3184, pruned_loss=0.07731, over 4274359.44 frames. ], batch size: 143, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:32:10,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1496742.0, ans=0.125 2023-06-23 13:33:34,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1496982.0, ans=0.2 2023-06-23 13:33:50,878 INFO [train.py:996] (2/4) Epoch 9, batch 5550, loss[loss=0.2435, simple_loss=0.3333, pruned_loss=0.07688, over 21803.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3178, pruned_loss=0.0748, over 4276704.95 frames. ], batch size: 282, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:34:26,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1497102.0, ans=0.1 2023-06-23 13:35:17,412 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.428e+02 4.782e+02 7.322e+02 1.097e+03 2.363e+03, threshold=1.464e+03, percent-clipped=11.0 2023-06-23 13:35:38,006 INFO [train.py:996] (2/4) Epoch 9, batch 5600, loss[loss=0.2199, simple_loss=0.3101, pruned_loss=0.06489, over 20119.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3153, pruned_loss=0.0729, over 4272217.80 frames. ], batch size: 702, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:35:55,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1497402.0, ans=0.1 2023-06-23 13:36:19,956 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=22.5 2023-06-23 13:37:09,248 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.93 vs. limit=15.0 2023-06-23 13:37:10,928 INFO [train.py:996] (2/4) Epoch 9, batch 5650, loss[loss=0.2276, simple_loss=0.314, pruned_loss=0.07063, over 21256.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3162, pruned_loss=0.07442, over 4274889.40 frames. ], batch size: 176, lr: 3.35e-03, grad_scale: 32.0 2023-06-23 13:37:39,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1497702.0, ans=0.2 2023-06-23 13:37:49,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1497762.0, ans=0.04949747468305833 2023-06-23 13:38:24,858 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=15.0 2023-06-23 13:38:34,961 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.803e+02 5.906e+02 7.888e+02 1.290e+03 2.997e+03, threshold=1.578e+03, percent-clipped=20.0 2023-06-23 13:38:46,232 INFO [train.py:996] (2/4) Epoch 9, batch 5700, loss[loss=0.2654, simple_loss=0.3836, pruned_loss=0.07359, over 21203.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3159, pruned_loss=0.0764, over 4279725.24 frames. ], batch size: 548, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:38:49,049 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.06 vs. limit=6.0 2023-06-23 13:39:14,911 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-23 13:39:41,336 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-06-23 13:40:19,719 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.05 vs. limit=6.0 2023-06-23 13:40:30,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1498242.0, ans=0.2 2023-06-23 13:40:31,245 INFO [train.py:996] (2/4) Epoch 9, batch 5750, loss[loss=0.2197, simple_loss=0.3111, pruned_loss=0.06409, over 21691.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3105, pruned_loss=0.0737, over 4274580.84 frames. ], batch size: 351, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:40:44,850 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.71 vs. limit=15.0 2023-06-23 13:41:35,447 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-23 13:41:47,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1498482.0, ans=0.2 2023-06-23 13:41:52,307 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.170e+02 4.651e+02 7.542e+02 1.104e+03 3.145e+03, threshold=1.508e+03, percent-clipped=9.0 2023-06-23 13:42:06,746 INFO [train.py:996] (2/4) Epoch 9, batch 5800, loss[loss=0.2355, simple_loss=0.3403, pruned_loss=0.06532, over 21739.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3101, pruned_loss=0.07277, over 4264659.10 frames. ], batch size: 351, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:42:12,039 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=22.5 2023-06-23 13:42:25,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1498542.0, ans=0.125 2023-06-23 13:42:26,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1498602.0, ans=0.1 2023-06-23 13:43:04,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_abs, batch_count=1498662.0, ans=0.5 2023-06-23 13:43:24,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1498722.0, ans=0.125 2023-06-23 13:43:42,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1498782.0, ans=0.125 2023-06-23 13:43:52,433 INFO [train.py:996] (2/4) Epoch 9, batch 5850, loss[loss=0.1579, simple_loss=0.248, pruned_loss=0.0339, over 21421.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3088, pruned_loss=0.06875, over 4265895.23 frames. ], batch size: 131, lr: 3.35e-03, grad_scale: 8.0 2023-06-23 13:43:59,754 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.81 vs. limit=10.0 2023-06-23 13:44:12,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1498902.0, ans=0.125 2023-06-23 13:44:31,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1498902.0, ans=0.0 2023-06-23 13:45:03,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1499022.0, ans=0.125 2023-06-23 13:45:18,555 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.120e+02 4.169e+02 5.972e+02 8.890e+02 1.873e+03, threshold=1.194e+03, percent-clipped=6.0 2023-06-23 13:45:31,298 INFO [train.py:996] (2/4) Epoch 9, batch 5900, loss[loss=0.2021, simple_loss=0.2752, pruned_loss=0.06449, over 21704.00 frames. ], tot_loss[loss=0.2138, simple_loss=0.3007, pruned_loss=0.06347, over 4268192.70 frames. ], batch size: 230, lr: 3.35e-03, grad_scale: 8.0 2023-06-23 13:45:41,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1499142.0, ans=0.0 2023-06-23 13:46:23,689 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2023-06-23 13:47:10,314 INFO [train.py:996] (2/4) Epoch 9, batch 5950, loss[loss=0.2159, simple_loss=0.3398, pruned_loss=0.04596, over 19760.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.3002, pruned_loss=0.06637, over 4271125.41 frames. ], batch size: 703, lr: 3.35e-03, grad_scale: 8.0 2023-06-23 13:47:42,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1499502.0, ans=0.125 2023-06-23 13:48:39,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1499682.0, ans=0.125 2023-06-23 13:48:40,987 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.364e+02 5.592e+02 7.986e+02 1.183e+03 2.385e+03, threshold=1.597e+03, percent-clipped=25.0 2023-06-23 13:48:48,921 INFO [train.py:996] (2/4) Epoch 9, batch 6000, loss[loss=0.2434, simple_loss=0.295, pruned_loss=0.0959, over 21451.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2975, pruned_loss=0.06985, over 4276511.90 frames. ], batch size: 441, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:48:48,922 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-23 13:49:03,299 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.3.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.1367, 2.2906, 1.8381, 2.6208, 1.4756, 2.4895, 2.2194, 2.1791], device='cuda:2') 2023-06-23 13:49:10,223 INFO [train.py:1028] (2/4) Epoch 9, validation: loss=0.2648, simple_loss=0.3557, pruned_loss=0.08691, over 1796401.00 frames. 2023-06-23 13:49:10,223 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-23 13:50:34,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1499982.0, ans=0.1 2023-06-23 13:50:49,136 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.33 vs. limit=15.0 2023-06-23 13:50:50,963 INFO [train.py:996] (2/4) Epoch 9, batch 6050, loss[loss=0.2009, simple_loss=0.2641, pruned_loss=0.06883, over 21279.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.2938, pruned_loss=0.07171, over 4271805.92 frames. ], batch size: 176, lr: 3.35e-03, grad_scale: 16.0 2023-06-23 13:51:39,271 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.23 vs. limit=15.0 2023-06-23 13:51:39,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1500162.0, ans=0.0 2023-06-23 13:51:43,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1500222.0, ans=0.125 2023-06-23 13:52:15,742 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.157e+02 5.140e+02 6.887e+02 9.775e+02 3.553e+03, threshold=1.377e+03, percent-clipped=5.0 2023-06-23 13:52:28,816 INFO [train.py:996] (2/4) Epoch 9, batch 6100, loss[loss=0.2468, simple_loss=0.3132, pruned_loss=0.09022, over 21823.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.292, pruned_loss=0.0706, over 4275461.04 frames. ], batch size: 282, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 13:52:37,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1500342.0, ans=0.0 2023-06-23 13:53:24,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1500522.0, ans=0.125 2023-06-23 13:53:29,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1500522.0, ans=0.125 2023-06-23 13:53:34,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1500522.0, ans=15.0 2023-06-23 13:53:40,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1500582.0, ans=0.0 2023-06-23 13:54:01,335 INFO [train.py:996] (2/4) Epoch 9, batch 6150, loss[loss=0.2143, simple_loss=0.2923, pruned_loss=0.06809, over 21483.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2953, pruned_loss=0.07397, over 4286354.19 frames. ], batch size: 212, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 13:54:38,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1500702.0, ans=0.0 2023-06-23 13:55:06,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1500822.0, ans=0.2 2023-06-23 13:55:32,766 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.545e+02 5.300e+02 7.269e+02 1.178e+03 2.947e+03, threshold=1.454e+03, percent-clipped=13.0 2023-06-23 13:55:46,197 INFO [train.py:996] (2/4) Epoch 9, batch 6200, loss[loss=0.1817, simple_loss=0.2534, pruned_loss=0.05497, over 21187.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2998, pruned_loss=0.07522, over 4283108.92 frames. ], batch size: 143, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 13:55:48,578 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.24 vs. limit=15.0 2023-06-23 13:55:51,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1500942.0, ans=0.1 2023-06-23 13:55:59,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1500942.0, ans=0.05 2023-06-23 13:56:49,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1501122.0, ans=0.125 2023-06-23 13:56:51,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1501122.0, ans=0.125 2023-06-23 13:57:25,609 INFO [train.py:996] (2/4) Epoch 9, batch 6250, loss[loss=0.2241, simple_loss=0.2876, pruned_loss=0.08026, over 21212.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3025, pruned_loss=0.07386, over 4279926.07 frames. ], batch size: 608, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 13:58:02,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1501362.0, ans=0.0 2023-06-23 13:58:28,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1501422.0, ans=0.1 2023-06-23 13:58:56,444 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.492e+02 5.754e+02 9.579e+02 1.636e+03 2.645e+03, threshold=1.916e+03, percent-clipped=27.0 2023-06-23 13:59:04,188 INFO [train.py:996] (2/4) Epoch 9, batch 6300, loss[loss=0.2544, simple_loss=0.3215, pruned_loss=0.09368, over 21730.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3063, pruned_loss=0.07339, over 4277299.74 frames. ], batch size: 389, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 13:59:23,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1501602.0, ans=0.125 2023-06-23 14:00:01,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1501722.0, ans=0.0 2023-06-23 14:00:45,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1501782.0, ans=0.125 2023-06-23 14:00:49,555 INFO [train.py:996] (2/4) Epoch 9, batch 6350, loss[loss=0.2352, simple_loss=0.3034, pruned_loss=0.0835, over 21455.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3099, pruned_loss=0.07783, over 4280102.73 frames. ], batch size: 548, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:00:58,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1501842.0, ans=0.2 2023-06-23 14:02:03,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1502022.0, ans=0.0 2023-06-23 14:02:23,998 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.408e+02 6.300e+02 8.863e+02 1.224e+03 2.908e+03, threshold=1.773e+03, percent-clipped=5.0 2023-06-23 14:02:27,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1502082.0, ans=0.1 2023-06-23 14:02:32,232 INFO [train.py:996] (2/4) Epoch 9, batch 6400, loss[loss=0.2887, simple_loss=0.3667, pruned_loss=0.1054, over 21542.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3164, pruned_loss=0.08242, over 4271542.85 frames. ], batch size: 414, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:02:32,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1502142.0, ans=0.1 2023-06-23 14:03:10,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1502202.0, ans=0.125 2023-06-23 14:03:45,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1502322.0, ans=0.0 2023-06-23 14:03:47,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1502322.0, ans=0.2 2023-06-23 14:04:07,086 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=22.5 2023-06-23 14:04:09,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1502442.0, ans=0.0 2023-06-23 14:04:10,695 INFO [train.py:996] (2/4) Epoch 9, batch 6450, loss[loss=0.2402, simple_loss=0.3093, pruned_loss=0.08558, over 21826.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.318, pruned_loss=0.08139, over 4278133.29 frames. ], batch size: 371, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:04:23,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1502442.0, ans=0.0 2023-06-23 14:04:25,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1502442.0, ans=0.07 2023-06-23 14:04:28,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1502442.0, ans=0.0 2023-06-23 14:04:49,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1502502.0, ans=0.125 2023-06-23 14:05:14,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1502622.0, ans=0.125 2023-06-23 14:05:42,798 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.914e+02 5.544e+02 7.373e+02 1.174e+03 2.232e+03, threshold=1.475e+03, percent-clipped=4.0 2023-06-23 14:05:46,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1502682.0, ans=0.125 2023-06-23 14:05:51,129 INFO [train.py:996] (2/4) Epoch 9, batch 6500, loss[loss=0.2069, simple_loss=0.2808, pruned_loss=0.06645, over 21398.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3108, pruned_loss=0.07966, over 4282545.23 frames. ], batch size: 194, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:07:01,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1502922.0, ans=0.125 2023-06-23 14:07:10,125 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.78 vs. limit=15.0 2023-06-23 14:07:17,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1502982.0, ans=0.125 2023-06-23 14:07:35,326 INFO [train.py:996] (2/4) Epoch 9, batch 6550, loss[loss=0.2093, simple_loss=0.3012, pruned_loss=0.05874, over 21609.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3128, pruned_loss=0.07932, over 4278975.59 frames. ], batch size: 263, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:07:35,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1503042.0, ans=0.0 2023-06-23 14:07:35,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1503042.0, ans=0.125 2023-06-23 14:07:35,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1503042.0, ans=0.0 2023-06-23 14:07:59,551 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-06-23 14:08:05,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1503102.0, ans=0.0 2023-06-23 14:08:39,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1503222.0, ans=0.125 2023-06-23 14:09:02,144 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.667e+02 5.627e+02 7.547e+02 1.040e+03 2.189e+03, threshold=1.509e+03, percent-clipped=8.0 2023-06-23 14:09:15,127 INFO [train.py:996] (2/4) Epoch 9, batch 6600, loss[loss=0.2019, simple_loss=0.2647, pruned_loss=0.06953, over 21744.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3058, pruned_loss=0.07861, over 4285939.95 frames. ], batch size: 112, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:10:09,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1503462.0, ans=0.1 2023-06-23 14:10:19,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1503522.0, ans=0.125 2023-06-23 14:10:55,766 INFO [train.py:996] (2/4) Epoch 9, batch 6650, loss[loss=0.1849, simple_loss=0.2688, pruned_loss=0.05051, over 21798.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2997, pruned_loss=0.07574, over 4276702.94 frames. ], batch size: 352, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:11:02,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1503642.0, ans=0.125 2023-06-23 14:11:43,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1503762.0, ans=0.125 2023-06-23 14:12:22,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1503882.0, ans=0.025 2023-06-23 14:12:25,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1503882.0, ans=0.0 2023-06-23 14:12:29,794 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.782e+02 5.915e+02 8.562e+02 1.227e+03 3.234e+03, threshold=1.712e+03, percent-clipped=18.0 2023-06-23 14:12:36,181 INFO [train.py:996] (2/4) Epoch 9, batch 6700, loss[loss=0.2196, simple_loss=0.294, pruned_loss=0.07259, over 21636.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2947, pruned_loss=0.07477, over 4270000.08 frames. ], batch size: 415, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:12:43,621 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-23 14:12:44,882 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2023-06-23 14:12:57,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1504002.0, ans=0.0 2023-06-23 14:13:58,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1504182.0, ans=0.2 2023-06-23 14:14:14,333 INFO [train.py:996] (2/4) Epoch 9, batch 6750, loss[loss=0.2618, simple_loss=0.3163, pruned_loss=0.1037, over 21333.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2935, pruned_loss=0.07545, over 4273041.93 frames. ], batch size: 159, lr: 3.34e-03, grad_scale: 8.0 2023-06-23 14:14:16,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1504242.0, ans=10.0 2023-06-23 14:15:10,303 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.12 vs. limit=22.5 2023-06-23 14:15:48,505 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.896e+02 6.568e+02 9.733e+02 1.340e+03 2.605e+03, threshold=1.947e+03, percent-clipped=12.0 2023-06-23 14:15:53,568 INFO [train.py:996] (2/4) Epoch 9, batch 6800, loss[loss=0.2252, simple_loss=0.2859, pruned_loss=0.08223, over 21494.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2971, pruned_loss=0.07761, over 4260998.37 frames. ], batch size: 389, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:15:58,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1504542.0, ans=0.2 2023-06-23 14:16:18,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1504602.0, ans=0.125 2023-06-23 14:16:42,368 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.73 vs. limit=22.5 2023-06-23 14:17:18,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1504782.0, ans=0.125 2023-06-23 14:17:32,354 INFO [train.py:996] (2/4) Epoch 9, batch 6850, loss[loss=0.2514, simple_loss=0.3131, pruned_loss=0.09485, over 21764.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2952, pruned_loss=0.07869, over 4273895.92 frames. ], batch size: 441, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:17:35,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1504842.0, ans=0.2 2023-06-23 14:17:36,161 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2023-06-23 14:17:43,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1504842.0, ans=0.1 2023-06-23 14:17:48,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1504902.0, ans=0.2 2023-06-23 14:17:52,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1504902.0, ans=0.125 2023-06-23 14:18:08,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1504962.0, ans=0.0 2023-06-23 14:18:23,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1504962.0, ans=0.0 2023-06-23 14:18:30,015 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.99 vs. limit=15.0 2023-06-23 14:18:47,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1505022.0, ans=0.125 2023-06-23 14:18:50,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1505082.0, ans=0.125 2023-06-23 14:19:07,208 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.795e+02 4.770e+02 6.261e+02 9.211e+02 1.923e+03, threshold=1.252e+03, percent-clipped=0.0 2023-06-23 14:19:12,176 INFO [train.py:996] (2/4) Epoch 9, batch 6900, loss[loss=0.2198, simple_loss=0.2874, pruned_loss=0.07608, over 21874.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2956, pruned_loss=0.07764, over 4279289.79 frames. ], batch size: 351, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:20:20,744 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.38 vs. limit=12.0 2023-06-23 14:20:43,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1505382.0, ans=0.0 2023-06-23 14:20:45,250 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2023-06-23 14:20:46,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1505382.0, ans=0.035 2023-06-23 14:20:51,938 INFO [train.py:996] (2/4) Epoch 9, batch 6950, loss[loss=0.2468, simple_loss=0.3147, pruned_loss=0.08948, over 21626.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2985, pruned_loss=0.07548, over 4282918.09 frames. ], batch size: 263, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:21:05,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1505442.0, ans=0.0 2023-06-23 14:21:11,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1505502.0, ans=0.125 2023-06-23 14:21:16,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1505502.0, ans=0.125 2023-06-23 14:21:23,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1505502.0, ans=0.1 2023-06-23 14:21:48,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1505562.0, ans=0.125 2023-06-23 14:22:13,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1505682.0, ans=0.125 2023-06-23 14:22:15,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1505682.0, ans=0.04949747468305833 2023-06-23 14:22:26,501 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.060e+02 5.568e+02 8.095e+02 1.122e+03 2.896e+03, threshold=1.619e+03, percent-clipped=20.0 2023-06-23 14:22:30,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1505742.0, ans=0.0 2023-06-23 14:22:30,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1505742.0, ans=0.125 2023-06-23 14:22:31,427 INFO [train.py:996] (2/4) Epoch 9, batch 7000, loss[loss=0.2314, simple_loss=0.2836, pruned_loss=0.08964, over 21310.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3008, pruned_loss=0.07745, over 4279567.08 frames. ], batch size: 159, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:22:32,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1505742.0, ans=0.125 2023-06-23 14:22:39,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1505742.0, ans=0.0 2023-06-23 14:23:10,297 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:24:16,394 INFO [train.py:996] (2/4) Epoch 9, batch 7050, loss[loss=0.2307, simple_loss=0.3041, pruned_loss=0.07867, over 21159.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2984, pruned_loss=0.07598, over 4276944.81 frames. ], batch size: 607, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:25:23,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1506222.0, ans=0.0 2023-06-23 14:25:50,179 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.514e+02 5.176e+02 7.948e+02 1.176e+03 2.286e+03, threshold=1.590e+03, percent-clipped=9.0 2023-06-23 14:25:55,055 INFO [train.py:996] (2/4) Epoch 9, batch 7100, loss[loss=0.1942, simple_loss=0.2778, pruned_loss=0.05536, over 21681.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3029, pruned_loss=0.0776, over 4272521.00 frames. ], batch size: 298, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:25:56,148 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.64 vs. limit=10.0 2023-06-23 14:26:00,637 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.57 vs. limit=15.0 2023-06-23 14:26:14,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1506402.0, ans=0.2 2023-06-23 14:26:56,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1506522.0, ans=0.125 2023-06-23 14:27:20,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1506582.0, ans=15.0 2023-06-23 14:27:35,241 INFO [train.py:996] (2/4) Epoch 9, batch 7150, loss[loss=0.1514, simple_loss=0.2238, pruned_loss=0.03952, over 21282.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.3001, pruned_loss=0.07533, over 4275897.46 frames. ], batch size: 176, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:27:37,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1506642.0, ans=0.125 2023-06-23 14:28:24,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1506762.0, ans=0.0 2023-06-23 14:28:41,339 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-06-23 14:28:42,753 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.07 vs. limit=15.0 2023-06-23 14:29:04,196 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=15.0 2023-06-23 14:29:08,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1506882.0, ans=0.1 2023-06-23 14:29:10,902 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.602e+02 5.826e+02 7.818e+02 1.087e+03 2.405e+03, threshold=1.564e+03, percent-clipped=10.0 2023-06-23 14:29:20,986 INFO [train.py:996] (2/4) Epoch 9, batch 7200, loss[loss=0.2439, simple_loss=0.2984, pruned_loss=0.09466, over 21275.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.303, pruned_loss=0.0783, over 4281326.64 frames. ], batch size: 159, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:29:35,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1506942.0, ans=0.125 2023-06-23 14:30:00,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1507002.0, ans=0.125 2023-06-23 14:30:08,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1507062.0, ans=0.1 2023-06-23 14:30:29,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1507122.0, ans=0.0 2023-06-23 14:31:00,732 INFO [train.py:996] (2/4) Epoch 9, batch 7250, loss[loss=0.2212, simple_loss=0.2806, pruned_loss=0.08086, over 21586.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.2992, pruned_loss=0.07917, over 4287074.32 frames. ], batch size: 298, lr: 3.34e-03, grad_scale: 32.0 2023-06-23 14:31:11,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1507242.0, ans=0.1 2023-06-23 14:31:23,926 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.76 vs. limit=6.0 2023-06-23 14:31:27,318 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=15.0 2023-06-23 14:31:40,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1507302.0, ans=0.2 2023-06-23 14:31:42,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1507362.0, ans=0.1 2023-06-23 14:31:55,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1507362.0, ans=0.04949747468305833 2023-06-23 14:32:22,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1507482.0, ans=0.0 2023-06-23 14:32:37,016 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.651e+02 4.870e+02 5.610e+02 7.177e+02 1.494e+03, threshold=1.122e+03, percent-clipped=0.0 2023-06-23 14:32:37,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1507482.0, ans=0.125 2023-06-23 14:32:44,948 INFO [train.py:996] (2/4) Epoch 9, batch 7300, loss[loss=0.2086, simple_loss=0.2759, pruned_loss=0.07067, over 21669.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2939, pruned_loss=0.0786, over 4285002.57 frames. ], batch size: 333, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:33:49,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1507722.0, ans=0.0 2023-06-23 14:34:21,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1507782.0, ans=0.125 2023-06-23 14:34:25,781 INFO [train.py:996] (2/4) Epoch 9, batch 7350, loss[loss=0.2732, simple_loss=0.3285, pruned_loss=0.1089, over 21455.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.2938, pruned_loss=0.07974, over 4281929.56 frames. ], batch size: 194, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:36:02,845 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.713e+02 6.185e+02 8.335e+02 1.224e+03 2.285e+03, threshold=1.667e+03, percent-clipped=37.0 2023-06-23 14:36:06,143 INFO [train.py:996] (2/4) Epoch 9, batch 7400, loss[loss=0.2139, simple_loss=0.2933, pruned_loss=0.0672, over 21560.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.2999, pruned_loss=0.08131, over 4280484.94 frames. ], batch size: 230, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:36:10,656 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.51 vs. limit=10.0 2023-06-23 14:36:58,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1508262.0, ans=0.1 2023-06-23 14:37:31,717 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.22 vs. limit=15.0 2023-06-23 14:37:47,569 INFO [train.py:996] (2/4) Epoch 9, batch 7450, loss[loss=0.2188, simple_loss=0.2779, pruned_loss=0.07979, over 15529.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.2985, pruned_loss=0.08068, over 4259809.38 frames. ], batch size: 62, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:37:52,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1508442.0, ans=0.0 2023-06-23 14:38:03,425 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.08 vs. limit=15.0 2023-06-23 14:38:35,039 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-23 14:39:26,564 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.876e+02 5.448e+02 8.462e+02 1.438e+03 2.608e+03, threshold=1.692e+03, percent-clipped=12.0 2023-06-23 14:39:35,271 INFO [train.py:996] (2/4) Epoch 9, batch 7500, loss[loss=0.279, simple_loss=0.3797, pruned_loss=0.08916, over 21915.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3065, pruned_loss=0.08243, over 4266622.53 frames. ], batch size: 372, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:39:41,215 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-23 14:40:15,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1508862.0, ans=0.0 2023-06-23 14:40:38,554 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=15.0 2023-06-23 14:40:49,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1508922.0, ans=0.025 2023-06-23 14:40:50,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1508922.0, ans=0.125 2023-06-23 14:40:56,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1508982.0, ans=0.1 2023-06-23 14:41:16,334 INFO [train.py:996] (2/4) Epoch 9, batch 7550, loss[loss=0.2447, simple_loss=0.3415, pruned_loss=0.07394, over 21656.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3124, pruned_loss=0.08075, over 4269258.87 frames. ], batch size: 414, lr: 3.34e-03, grad_scale: 16.0 2023-06-23 14:41:18,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1509042.0, ans=0.0 2023-06-23 14:41:48,733 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.78 vs. limit=6.0 2023-06-23 14:41:52,118 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.70 vs. limit=10.0 2023-06-23 14:42:48,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1509282.0, ans=0.2 2023-06-23 14:42:52,862 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.575e+02 5.410e+02 7.103e+02 1.048e+03 2.085e+03, threshold=1.421e+03, percent-clipped=3.0 2023-06-23 14:42:53,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1509282.0, ans=10.0 2023-06-23 14:42:56,182 INFO [train.py:996] (2/4) Epoch 9, batch 7600, loss[loss=0.2239, simple_loss=0.2916, pruned_loss=0.0781, over 21327.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3115, pruned_loss=0.07925, over 4275122.05 frames. ], batch size: 159, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 14:43:19,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1509402.0, ans=0.2 2023-06-23 14:43:42,073 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 14:44:09,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1509522.0, ans=0.125 2023-06-23 14:44:29,682 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.76 vs. limit=15.0 2023-06-23 14:44:37,240 INFO [train.py:996] (2/4) Epoch 9, batch 7650, loss[loss=0.2444, simple_loss=0.3176, pruned_loss=0.08563, over 21878.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3098, pruned_loss=0.08071, over 4288938.48 frames. ], batch size: 118, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 14:44:55,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1509642.0, ans=0.035 2023-06-23 14:44:58,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1509702.0, ans=0.2 2023-06-23 14:45:07,068 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.60 vs. limit=15.0 2023-06-23 14:45:16,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1509762.0, ans=0.2 2023-06-23 14:45:16,691 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-23 14:45:49,673 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=15.0 2023-06-23 14:45:49,744 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.49 vs. limit=15.0 2023-06-23 14:46:11,179 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-23 14:46:14,838 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-06-23 14:46:15,407 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.771e+02 5.554e+02 6.899e+02 1.039e+03 2.407e+03, threshold=1.380e+03, percent-clipped=12.0 2023-06-23 14:46:18,613 INFO [train.py:996] (2/4) Epoch 9, batch 7700, loss[loss=0.2289, simple_loss=0.3043, pruned_loss=0.07677, over 21782.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3121, pruned_loss=0.08324, over 4293838.60 frames. ], batch size: 247, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 14:46:30,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1509942.0, ans=0.0 2023-06-23 14:47:51,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1510182.0, ans=0.2 2023-06-23 14:48:00,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1510182.0, ans=0.05 2023-06-23 14:48:05,168 INFO [train.py:996] (2/4) Epoch 9, batch 7750, loss[loss=0.331, simple_loss=0.4264, pruned_loss=0.1178, over 21510.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3172, pruned_loss=0.08321, over 4288178.95 frames. ], batch size: 471, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 14:48:05,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1510242.0, ans=0.0 2023-06-23 14:49:40,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1510482.0, ans=0.0 2023-06-23 14:49:44,533 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.753e+02 6.023e+02 8.792e+02 1.462e+03 2.647e+03, threshold=1.758e+03, percent-clipped=26.0 2023-06-23 14:49:46,172 INFO [train.py:996] (2/4) Epoch 9, batch 7800, loss[loss=0.1785, simple_loss=0.2078, pruned_loss=0.07462, over 16618.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3193, pruned_loss=0.08399, over 4273187.87 frames. ], batch size: 61, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 14:50:34,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1510662.0, ans=0.125 2023-06-23 14:50:46,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1510662.0, ans=0.1 2023-06-23 14:50:51,638 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.17 vs. limit=10.0 2023-06-23 14:51:24,550 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-23 14:51:25,366 INFO [train.py:996] (2/4) Epoch 9, batch 7850, loss[loss=0.2324, simple_loss=0.2767, pruned_loss=0.09399, over 21333.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3134, pruned_loss=0.0832, over 4264496.67 frames. ], batch size: 177, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 14:51:38,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1510842.0, ans=0.1 2023-06-23 14:52:15,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1510962.0, ans=0.125 2023-06-23 14:53:05,532 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.699e+02 5.401e+02 8.491e+02 1.335e+03 3.211e+03, threshold=1.698e+03, percent-clipped=14.0 2023-06-23 14:53:07,030 INFO [train.py:996] (2/4) Epoch 9, batch 7900, loss[loss=0.253, simple_loss=0.3515, pruned_loss=0.07727, over 21777.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3097, pruned_loss=0.0826, over 4260294.04 frames. ], batch size: 351, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 14:54:48,981 INFO [train.py:996] (2/4) Epoch 9, batch 7950, loss[loss=0.2492, simple_loss=0.3359, pruned_loss=0.0812, over 21649.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3119, pruned_loss=0.08174, over 4253447.59 frames. ], batch size: 389, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 14:56:20,936 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.17 vs. limit=15.0 2023-06-23 14:56:40,207 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.215e+02 6.156e+02 9.056e+02 1.636e+03 2.892e+03, threshold=1.811e+03, percent-clipped=22.0 2023-06-23 14:56:41,919 INFO [train.py:996] (2/4) Epoch 9, batch 8000, loss[loss=0.3035, simple_loss=0.3904, pruned_loss=0.1083, over 21432.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3169, pruned_loss=0.08424, over 4257489.51 frames. ], batch size: 507, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 14:56:49,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1511742.0, ans=0.125 2023-06-23 14:56:51,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1511742.0, ans=0.0 2023-06-23 14:57:46,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1511922.0, ans=0.0 2023-06-23 14:58:02,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1511922.0, ans=0.125 2023-06-23 14:58:32,840 INFO [train.py:996] (2/4) Epoch 9, batch 8050, loss[loss=0.3358, simple_loss=0.4125, pruned_loss=0.1296, over 21504.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3224, pruned_loss=0.08498, over 4257471.91 frames. ], batch size: 471, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 14:58:33,970 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.52 vs. limit=22.5 2023-06-23 14:59:19,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.min_positive, batch_count=1512162.0, ans=0.05 2023-06-23 14:59:22,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1512162.0, ans=0.0 2023-06-23 14:59:24,948 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=12.0 2023-06-23 14:59:47,372 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.71 vs. limit=10.0 2023-06-23 15:00:10,911 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.139e+02 6.350e+02 8.777e+02 1.241e+03 2.449e+03, threshold=1.755e+03, percent-clipped=9.0 2023-06-23 15:00:12,597 INFO [train.py:996] (2/4) Epoch 9, batch 8100, loss[loss=0.2401, simple_loss=0.3245, pruned_loss=0.07787, over 21903.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3201, pruned_loss=0.08517, over 4264098.35 frames. ], batch size: 118, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:01:08,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1512462.0, ans=0.1 2023-06-23 15:01:39,063 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.70 vs. limit=6.0 2023-06-23 15:01:58,192 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.42 vs. limit=10.0 2023-06-23 15:02:01,801 INFO [train.py:996] (2/4) Epoch 9, batch 8150, loss[loss=0.2384, simple_loss=0.3289, pruned_loss=0.07397, over 21757.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3285, pruned_loss=0.08676, over 4260922.52 frames. ], batch size: 332, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:02:11,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1512642.0, ans=0.0 2023-06-23 15:03:28,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1512882.0, ans=0.0 2023-06-23 15:03:35,458 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.66 vs. limit=6.0 2023-06-23 15:03:40,780 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.771e+02 6.551e+02 1.054e+03 1.725e+03 4.751e+03, threshold=2.109e+03, percent-clipped=24.0 2023-06-23 15:03:40,802 INFO [train.py:996] (2/4) Epoch 9, batch 8200, loss[loss=0.1953, simple_loss=0.2555, pruned_loss=0.06753, over 21194.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3201, pruned_loss=0.08376, over 4266041.67 frames. ], batch size: 159, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:04:12,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1513002.0, ans=0.1 2023-06-23 15:05:11,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1513182.0, ans=0.125 2023-06-23 15:05:22,544 INFO [train.py:996] (2/4) Epoch 9, batch 8250, loss[loss=0.2273, simple_loss=0.3189, pruned_loss=0.06779, over 21784.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3161, pruned_loss=0.08252, over 4261692.57 frames. ], batch size: 282, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:06:27,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1513422.0, ans=0.5 2023-06-23 15:06:43,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1513422.0, ans=0.2 2023-06-23 15:07:01,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1513482.0, ans=0.125 2023-06-23 15:07:04,484 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.470e+02 6.606e+02 8.935e+02 1.467e+03 2.616e+03, threshold=1.787e+03, percent-clipped=8.0 2023-06-23 15:07:04,507 INFO [train.py:996] (2/4) Epoch 9, batch 8300, loss[loss=0.2211, simple_loss=0.3073, pruned_loss=0.06745, over 21720.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3141, pruned_loss=0.07938, over 4260327.39 frames. ], batch size: 332, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:08:18,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1513722.0, ans=0.125 2023-06-23 15:08:35,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1513782.0, ans=0.1 2023-06-23 15:08:49,802 INFO [train.py:996] (2/4) Epoch 9, batch 8350, loss[loss=0.1982, simple_loss=0.2805, pruned_loss=0.05798, over 21369.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3123, pruned_loss=0.07764, over 4269629.02 frames. ], batch size: 131, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:09:26,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1513962.0, ans=0.125 2023-06-23 15:09:46,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1513962.0, ans=0.125 2023-06-23 15:10:16,558 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:10:28,410 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.68 vs. limit=15.0 2023-06-23 15:10:30,567 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.376e+02 4.465e+02 5.586e+02 8.616e+02 2.675e+03, threshold=1.117e+03, percent-clipped=3.0 2023-06-23 15:10:30,588 INFO [train.py:996] (2/4) Epoch 9, batch 8400, loss[loss=0.1811, simple_loss=0.2785, pruned_loss=0.04187, over 21750.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3103, pruned_loss=0.07513, over 4274951.80 frames. ], batch size: 351, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:11:36,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1514322.0, ans=0.125 2023-06-23 15:11:47,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1514382.0, ans=0.2 2023-06-23 15:12:09,840 INFO [train.py:996] (2/4) Epoch 9, batch 8450, loss[loss=0.2524, simple_loss=0.3211, pruned_loss=0.09184, over 17099.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3085, pruned_loss=0.07495, over 4275604.08 frames. ], batch size: 60, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:12:49,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1514562.0, ans=0.125 2023-06-23 15:13:13,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1514622.0, ans=0.125 2023-06-23 15:13:28,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1514682.0, ans=0.125 2023-06-23 15:13:48,687 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-23 15:13:49,142 INFO [train.py:996] (2/4) Epoch 9, batch 8500, loss[loss=0.2321, simple_loss=0.3011, pruned_loss=0.08161, over 21633.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.305, pruned_loss=0.07557, over 4276336.29 frames. ], batch size: 391, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:13:50,605 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.348e+02 5.860e+02 7.972e+02 1.284e+03 3.475e+03, threshold=1.594e+03, percent-clipped=30.0 2023-06-23 15:13:52,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1514742.0, ans=0.125 2023-06-23 15:14:17,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer_na.min_abs, batch_count=1514802.0, ans=0.02 2023-06-23 15:14:21,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1514802.0, ans=0.0 2023-06-23 15:14:25,173 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.81 vs. limit=15.0 2023-06-23 15:15:18,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1514982.0, ans=0.125 2023-06-23 15:15:29,029 INFO [train.py:996] (2/4) Epoch 9, batch 8550, loss[loss=0.2599, simple_loss=0.3474, pruned_loss=0.08624, over 21746.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3094, pruned_loss=0.0785, over 4273234.30 frames. ], batch size: 351, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:15:37,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1515042.0, ans=0.1 2023-06-23 15:16:01,991 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.38 vs. limit=15.0 2023-06-23 15:16:56,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1515282.0, ans=0.125 2023-06-23 15:16:57,426 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=15.0 2023-06-23 15:17:02,931 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.51 vs. limit=15.0 2023-06-23 15:17:16,067 INFO [train.py:996] (2/4) Epoch 9, batch 8600, loss[loss=0.2418, simple_loss=0.3708, pruned_loss=0.05638, over 19834.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3177, pruned_loss=0.08072, over 4261696.53 frames. ], batch size: 702, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:17:17,699 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.589e+02 6.156e+02 8.850e+02 1.190e+03 2.823e+03, threshold=1.770e+03, percent-clipped=15.0 2023-06-23 15:17:58,629 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.24 vs. limit=15.0 2023-06-23 15:18:00,426 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.90 vs. limit=15.0 2023-06-23 15:18:32,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1515522.0, ans=0.1 2023-06-23 15:18:50,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1515582.0, ans=0.125 2023-06-23 15:18:58,338 INFO [train.py:996] (2/4) Epoch 9, batch 8650, loss[loss=0.2097, simple_loss=0.3112, pruned_loss=0.05412, over 21643.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3239, pruned_loss=0.0812, over 4269670.62 frames. ], batch size: 441, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:19:00,616 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.50 vs. limit=12.0 2023-06-23 15:19:50,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1515762.0, ans=0.2 2023-06-23 15:20:29,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1515882.0, ans=0.07 2023-06-23 15:20:37,553 INFO [train.py:996] (2/4) Epoch 9, batch 8700, loss[loss=0.2267, simple_loss=0.278, pruned_loss=0.08772, over 20267.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3152, pruned_loss=0.07838, over 4272365.03 frames. ], batch size: 703, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:20:39,032 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.609e+02 5.219e+02 7.580e+02 1.289e+03 2.063e+03, threshold=1.516e+03, percent-clipped=5.0 2023-06-23 15:20:59,102 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.91 vs. limit=15.0 2023-06-23 15:21:21,211 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.86 vs. limit=15.0 2023-06-23 15:21:39,319 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.31 vs. limit=15.0 2023-06-23 15:21:56,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1516182.0, ans=0.125 2023-06-23 15:22:09,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1516182.0, ans=0.5 2023-06-23 15:22:16,386 INFO [train.py:996] (2/4) Epoch 9, batch 8750, loss[loss=0.2473, simple_loss=0.3288, pruned_loss=0.08289, over 21467.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3114, pruned_loss=0.0798, over 4280004.69 frames. ], batch size: 548, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:22:28,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1516242.0, ans=0.125 2023-06-23 15:22:56,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1516302.0, ans=0.2 2023-06-23 15:23:08,139 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=15.0 2023-06-23 15:23:39,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1516422.0, ans=0.0 2023-06-23 15:23:59,297 INFO [train.py:996] (2/4) Epoch 9, batch 8800, loss[loss=0.258, simple_loss=0.3774, pruned_loss=0.06929, over 19818.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3188, pruned_loss=0.08237, over 4277742.77 frames. ], batch size: 702, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:24:00,928 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.692e+02 5.630e+02 7.362e+02 1.054e+03 2.858e+03, threshold=1.472e+03, percent-clipped=8.0 2023-06-23 15:24:33,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1516602.0, ans=0.125 2023-06-23 15:24:43,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1516662.0, ans=0.125 2023-06-23 15:25:22,748 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.47 vs. limit=15.0 2023-06-23 15:25:43,310 INFO [train.py:996] (2/4) Epoch 9, batch 8850, loss[loss=0.2694, simple_loss=0.3444, pruned_loss=0.09716, over 21683.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.327, pruned_loss=0.08369, over 4274040.55 frames. ], batch size: 441, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:26:02,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1516902.0, ans=0.125 2023-06-23 15:26:32,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1516962.0, ans=0.125 2023-06-23 15:26:58,289 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.38 vs. limit=15.0 2023-06-23 15:27:23,366 INFO [train.py:996] (2/4) Epoch 9, batch 8900, loss[loss=0.2135, simple_loss=0.2897, pruned_loss=0.06869, over 21624.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.323, pruned_loss=0.08354, over 4270787.74 frames. ], batch size: 298, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:27:30,221 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.986e+02 5.861e+02 8.789e+02 1.394e+03 2.613e+03, threshold=1.758e+03, percent-clipped=19.0 2023-06-23 15:27:33,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1517142.0, ans=0.2 2023-06-23 15:27:37,729 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.89 vs. limit=15.0 2023-06-23 15:28:25,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1517262.0, ans=0.1 2023-06-23 15:28:39,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1517322.0, ans=0.1 2023-06-23 15:29:08,559 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=12.13 vs. limit=22.5 2023-06-23 15:29:10,568 INFO [train.py:996] (2/4) Epoch 9, batch 8950, loss[loss=0.2597, simple_loss=0.3669, pruned_loss=0.07626, over 21192.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3242, pruned_loss=0.08302, over 4273070.96 frames. ], batch size: 549, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:29:15,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1517442.0, ans=0.125 2023-06-23 15:29:18,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1517442.0, ans=0.125 2023-06-23 15:30:49,404 INFO [train.py:996] (2/4) Epoch 9, batch 9000, loss[loss=0.2114, simple_loss=0.2915, pruned_loss=0.06561, over 21670.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3172, pruned_loss=0.08189, over 4266310.03 frames. ], batch size: 282, lr: 3.33e-03, grad_scale: 32.0 2023-06-23 15:30:49,405 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-23 15:31:06,589 INFO [train.py:1028] (2/4) Epoch 9, validation: loss=0.258, simple_loss=0.3541, pruned_loss=0.08091, over 1796401.00 frames. 2023-06-23 15:31:06,590 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-23 15:31:08,192 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.792e+02 6.929e+02 1.126e+03 1.882e+03 3.988e+03, threshold=2.252e+03, percent-clipped=24.0 2023-06-23 15:31:18,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1517742.0, ans=0.125 2023-06-23 15:31:49,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1517862.0, ans=0.0 2023-06-23 15:32:17,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1517922.0, ans=0.2 2023-06-23 15:32:22,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1517922.0, ans=0.125 2023-06-23 15:32:43,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1517982.0, ans=0.0 2023-06-23 15:32:53,763 INFO [train.py:996] (2/4) Epoch 9, batch 9050, loss[loss=0.2322, simple_loss=0.3089, pruned_loss=0.07776, over 21436.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3125, pruned_loss=0.07789, over 4268494.02 frames. ], batch size: 194, lr: 3.33e-03, grad_scale: 16.0 2023-06-23 15:32:54,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1518042.0, ans=0.05 2023-06-23 15:33:14,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1518102.0, ans=0.1 2023-06-23 15:33:19,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1518102.0, ans=0.0 2023-06-23 15:33:46,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1518162.0, ans=0.2 2023-06-23 15:34:18,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1518282.0, ans=0.0 2023-06-23 15:34:23,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1518282.0, ans=0.125 2023-06-23 15:34:39,758 INFO [train.py:996] (2/4) Epoch 9, batch 9100, loss[loss=0.2267, simple_loss=0.3203, pruned_loss=0.06656, over 21642.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3167, pruned_loss=0.07968, over 4267251.72 frames. ], batch size: 263, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:34:42,875 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.536e+02 5.248e+02 7.167e+02 1.150e+03 2.223e+03, threshold=1.433e+03, percent-clipped=0.0 2023-06-23 15:34:48,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1518342.0, ans=0.125 2023-06-23 15:36:21,761 INFO [train.py:996] (2/4) Epoch 9, batch 9150, loss[loss=0.2195, simple_loss=0.3088, pruned_loss=0.06514, over 21638.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3186, pruned_loss=0.07727, over 4267864.45 frames. ], batch size: 263, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:36:44,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1518702.0, ans=0.1 2023-06-23 15:37:48,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1518882.0, ans=0.125 2023-06-23 15:37:57,516 INFO [train.py:996] (2/4) Epoch 9, batch 9200, loss[loss=0.3012, simple_loss=0.3655, pruned_loss=0.1185, over 21823.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.32, pruned_loss=0.07713, over 4263980.75 frames. ], batch size: 124, lr: 3.32e-03, grad_scale: 32.0 2023-06-23 15:38:01,465 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.429e+02 6.542e+02 9.064e+02 1.359e+03 2.938e+03, threshold=1.813e+03, percent-clipped=21.0 2023-06-23 15:38:44,203 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.59 vs. limit=15.0 2023-06-23 15:38:52,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1519062.0, ans=0.0 2023-06-23 15:38:53,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1519062.0, ans=0.125 2023-06-23 15:39:24,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1519182.0, ans=0.125 2023-06-23 15:39:33,614 INFO [train.py:996] (2/4) Epoch 9, batch 9250, loss[loss=0.201, simple_loss=0.27, pruned_loss=0.06595, over 21641.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3236, pruned_loss=0.08109, over 4267313.47 frames. ], batch size: 298, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:40:58,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1519482.0, ans=0.0 2023-06-23 15:41:09,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1519482.0, ans=0.125 2023-06-23 15:41:16,017 INFO [train.py:996] (2/4) Epoch 9, batch 9300, loss[loss=0.3199, simple_loss=0.408, pruned_loss=0.1158, over 21478.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3183, pruned_loss=0.08119, over 4270612.57 frames. ], batch size: 471, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:41:18,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1519542.0, ans=0.95 2023-06-23 15:41:20,602 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.633e+02 6.631e+02 9.639e+02 1.652e+03 4.303e+03, threshold=1.928e+03, percent-clipped=19.0 2023-06-23 15:41:44,333 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-23 15:42:46,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1519782.0, ans=0.0 2023-06-23 15:43:03,514 INFO [train.py:996] (2/4) Epoch 9, batch 9350, loss[loss=0.2806, simple_loss=0.3543, pruned_loss=0.1034, over 21796.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3243, pruned_loss=0.08215, over 4278412.23 frames. ], batch size: 441, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:43:19,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1519842.0, ans=0.2 2023-06-23 15:44:20,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1520022.0, ans=0.0 2023-06-23 15:44:50,876 INFO [train.py:996] (2/4) Epoch 9, batch 9400, loss[loss=0.222, simple_loss=0.2826, pruned_loss=0.08071, over 21545.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3236, pruned_loss=0.08253, over 4279734.60 frames. ], batch size: 263, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 15:44:57,842 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.099e+02 5.147e+02 6.319e+02 1.049e+03 2.062e+03, threshold=1.264e+03, percent-clipped=1.0 2023-06-23 15:46:06,051 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.63 vs. limit=15.0 2023-06-23 15:46:11,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1520382.0, ans=0.125 2023-06-23 15:46:13,861 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=15.0 2023-06-23 15:46:30,673 INFO [train.py:996] (2/4) Epoch 9, batch 9450, loss[loss=0.271, simple_loss=0.4144, pruned_loss=0.06378, over 19811.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3165, pruned_loss=0.08149, over 4277558.51 frames. ], batch size: 702, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 15:46:55,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1520502.0, ans=0.0 2023-06-23 15:48:00,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1520682.0, ans=0.125 2023-06-23 15:48:02,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1520682.0, ans=0.09899494936611666 2023-06-23 15:48:06,753 INFO [train.py:996] (2/4) Epoch 9, batch 9500, loss[loss=0.2086, simple_loss=0.2904, pruned_loss=0.06345, over 21711.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3102, pruned_loss=0.08033, over 4262601.69 frames. ], batch size: 332, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 15:48:13,335 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.451e+02 6.669e+02 1.059e+03 1.542e+03 2.765e+03, threshold=2.119e+03, percent-clipped=38.0 2023-06-23 15:48:13,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1520742.0, ans=0.95 2023-06-23 15:49:02,003 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=22.5 2023-06-23 15:49:08,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1520922.0, ans=0.05 2023-06-23 15:49:13,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1520922.0, ans=0.2 2023-06-23 15:49:35,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1520982.0, ans=0.125 2023-06-23 15:49:48,347 INFO [train.py:996] (2/4) Epoch 9, batch 9550, loss[loss=0.2695, simple_loss=0.3346, pruned_loss=0.1022, over 21547.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3145, pruned_loss=0.08325, over 4271111.46 frames. ], batch size: 263, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 15:49:48,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1521042.0, ans=0.0 2023-06-23 15:49:48,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1521042.0, ans=0.125 2023-06-23 15:50:17,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1521102.0, ans=0.125 2023-06-23 15:50:36,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1521162.0, ans=0.2 2023-06-23 15:51:28,275 INFO [train.py:996] (2/4) Epoch 9, batch 9600, loss[loss=0.2165, simple_loss=0.2861, pruned_loss=0.07348, over 21369.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3161, pruned_loss=0.08477, over 4275467.64 frames. ], batch size: 176, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:51:35,129 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.691e+02 5.650e+02 7.031e+02 8.940e+02 1.543e+03, threshold=1.406e+03, percent-clipped=0.0 2023-06-23 15:51:38,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1521342.0, ans=0.1 2023-06-23 15:52:07,433 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-23 15:52:10,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1521462.0, ans=0.0 2023-06-23 15:52:19,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1521462.0, ans=0.0 2023-06-23 15:52:33,629 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=12.0 2023-06-23 15:52:58,532 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-23 15:53:10,504 INFO [train.py:996] (2/4) Epoch 9, batch 9650, loss[loss=0.2563, simple_loss=0.3334, pruned_loss=0.08961, over 21450.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3176, pruned_loss=0.08497, over 4280907.28 frames. ], batch size: 211, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:53:17,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1521642.0, ans=0.125 2023-06-23 15:53:27,368 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.66 vs. limit=15.0 2023-06-23 15:53:39,262 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.55 vs. limit=15.0 2023-06-23 15:54:21,535 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 15:54:23,531 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-06-23 15:54:33,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1521882.0, ans=0.5 2023-06-23 15:54:47,464 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.20 vs. limit=12.0 2023-06-23 15:54:51,596 INFO [train.py:996] (2/4) Epoch 9, batch 9700, loss[loss=0.241, simple_loss=0.3376, pruned_loss=0.07214, over 21748.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3187, pruned_loss=0.08476, over 4281309.78 frames. ], batch size: 351, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:55:02,828 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.621e+02 5.546e+02 7.387e+02 1.131e+03 2.841e+03, threshold=1.477e+03, percent-clipped=15.0 2023-06-23 15:55:13,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1521942.0, ans=0.0 2023-06-23 15:56:12,639 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.67 vs. limit=15.0 2023-06-23 15:56:32,928 INFO [train.py:996] (2/4) Epoch 9, batch 9750, loss[loss=0.2051, simple_loss=0.269, pruned_loss=0.07053, over 21551.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3114, pruned_loss=0.08296, over 4268082.50 frames. ], batch size: 391, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:57:18,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1522362.0, ans=0.0 2023-06-23 15:58:11,378 INFO [train.py:996] (2/4) Epoch 9, batch 9800, loss[loss=0.2231, simple_loss=0.2931, pruned_loss=0.0766, over 21822.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3113, pruned_loss=0.08316, over 4252506.98 frames. ], batch size: 298, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 15:58:18,327 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.522e+02 5.907e+02 7.792e+02 1.093e+03 2.144e+03, threshold=1.558e+03, percent-clipped=9.0 2023-06-23 15:58:44,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1522602.0, ans=0.0 2023-06-23 15:58:56,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1522662.0, ans=0.1 2023-06-23 15:58:59,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1522662.0, ans=0.125 2023-06-23 15:59:40,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1522782.0, ans=0.125 2023-06-23 15:59:52,733 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=15.0 2023-06-23 15:59:53,350 INFO [train.py:996] (2/4) Epoch 9, batch 9850, loss[loss=0.2338, simple_loss=0.2967, pruned_loss=0.08549, over 21348.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3098, pruned_loss=0.08278, over 4252384.17 frames. ], batch size: 177, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:00:31,911 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=22.5 2023-06-23 16:01:27,328 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.47 vs. limit=12.0 2023-06-23 16:01:34,835 INFO [train.py:996] (2/4) Epoch 9, batch 9900, loss[loss=0.2368, simple_loss=0.3108, pruned_loss=0.0814, over 21750.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3065, pruned_loss=0.08253, over 4256904.54 frames. ], batch size: 333, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:01:45,572 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.718e+02 5.699e+02 7.870e+02 1.232e+03 3.104e+03, threshold=1.574e+03, percent-clipped=11.0 2023-06-23 16:02:20,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1523262.0, ans=10.0 2023-06-23 16:02:46,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1523322.0, ans=0.2 2023-06-23 16:03:15,833 INFO [train.py:996] (2/4) Epoch 9, batch 9950, loss[loss=0.2516, simple_loss=0.3023, pruned_loss=0.1004, over 21568.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3078, pruned_loss=0.08378, over 4238753.28 frames. ], batch size: 415, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:03:40,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1523502.0, ans=0.125 2023-06-23 16:04:58,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1523682.0, ans=0.1 2023-06-23 16:05:02,509 INFO [train.py:996] (2/4) Epoch 9, batch 10000, loss[loss=0.1846, simple_loss=0.2352, pruned_loss=0.06698, over 20861.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3024, pruned_loss=0.082, over 4247206.98 frames. ], batch size: 613, lr: 3.32e-03, grad_scale: 32.0 2023-06-23 16:05:14,523 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.772e+02 5.460e+02 7.211e+02 1.053e+03 2.107e+03, threshold=1.442e+03, percent-clipped=5.0 2023-06-23 16:06:19,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1523922.0, ans=0.0 2023-06-23 16:06:50,041 INFO [train.py:996] (2/4) Epoch 9, batch 10050, loss[loss=0.1968, simple_loss=0.2744, pruned_loss=0.05964, over 21288.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3046, pruned_loss=0.08249, over 4252397.41 frames. ], batch size: 549, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:06:52,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1524042.0, ans=0.1 2023-06-23 16:07:00,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1524042.0, ans=0.035 2023-06-23 16:07:07,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1524102.0, ans=0.0 2023-06-23 16:07:57,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1524222.0, ans=0.05 2023-06-23 16:08:12,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1524282.0, ans=0.125 2023-06-23 16:08:20,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1524282.0, ans=0.1 2023-06-23 16:08:23,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1524282.0, ans=0.2 2023-06-23 16:08:32,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1524342.0, ans=0.2 2023-06-23 16:08:33,194 INFO [train.py:996] (2/4) Epoch 9, batch 10100, loss[loss=0.2533, simple_loss=0.3265, pruned_loss=0.09001, over 21890.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.304, pruned_loss=0.08072, over 4259982.37 frames. ], batch size: 316, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:08:38,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1524342.0, ans=0.2 2023-06-23 16:08:41,438 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.724e+02 5.845e+02 8.901e+02 1.389e+03 2.930e+03, threshold=1.780e+03, percent-clipped=23.0 2023-06-23 16:10:04,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1524582.0, ans=0.125 2023-06-23 16:10:08,500 INFO [train.py:996] (2/4) Epoch 9, batch 10150, loss[loss=0.2188, simple_loss=0.2913, pruned_loss=0.07312, over 21252.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3108, pruned_loss=0.08365, over 4260146.64 frames. ], batch size: 159, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 16:10:09,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1524642.0, ans=0.2 2023-06-23 16:10:35,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1524702.0, ans=0.125 2023-06-23 16:11:04,211 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=22.5 2023-06-23 16:11:37,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1524882.0, ans=0.125 2023-06-23 16:11:48,561 INFO [train.py:996] (2/4) Epoch 9, batch 10200, loss[loss=0.2349, simple_loss=0.3745, pruned_loss=0.04766, over 19771.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3088, pruned_loss=0.08098, over 4265680.97 frames. ], batch size: 702, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 16:12:03,216 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.213e+02 5.208e+02 7.016e+02 1.136e+03 3.363e+03, threshold=1.403e+03, percent-clipped=6.0 2023-06-23 16:12:06,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1524942.0, ans=0.1 2023-06-23 16:12:15,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1525002.0, ans=0.1 2023-06-23 16:12:34,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1525062.0, ans=0.125 2023-06-23 16:13:24,891 INFO [train.py:996] (2/4) Epoch 9, batch 10250, loss[loss=0.2388, simple_loss=0.3207, pruned_loss=0.07838, over 21580.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3057, pruned_loss=0.07657, over 4258037.60 frames. ], batch size: 414, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 16:13:37,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1525242.0, ans=0.125 2023-06-23 16:13:40,385 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 16:14:06,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1525302.0, ans=0.1 2023-06-23 16:14:15,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1525362.0, ans=0.0 2023-06-23 16:14:28,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1525362.0, ans=0.125 2023-06-23 16:14:58,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1525482.0, ans=0.125 2023-06-23 16:14:58,625 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.99 vs. limit=15.0 2023-06-23 16:15:14,132 INFO [train.py:996] (2/4) Epoch 9, batch 10300, loss[loss=0.2572, simple_loss=0.3314, pruned_loss=0.09147, over 21250.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3097, pruned_loss=0.07726, over 4260793.23 frames. ], batch size: 159, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 16:15:24,051 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.918e+02 5.852e+02 8.943e+02 1.203e+03 2.933e+03, threshold=1.789e+03, percent-clipped=17.0 2023-06-23 16:15:46,070 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-06-23 16:15:47,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1525602.0, ans=0.0 2023-06-23 16:16:56,863 INFO [train.py:996] (2/4) Epoch 9, batch 10350, loss[loss=0.2115, simple_loss=0.2892, pruned_loss=0.06686, over 21829.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3091, pruned_loss=0.07708, over 4257815.03 frames. ], batch size: 317, lr: 3.32e-03, grad_scale: 8.0 2023-06-23 16:17:45,771 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.29 vs. limit=15.0 2023-06-23 16:17:51,610 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-23 16:17:54,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1525962.0, ans=0.125 2023-06-23 16:18:45,381 INFO [train.py:996] (2/4) Epoch 9, batch 10400, loss[loss=0.2008, simple_loss=0.2695, pruned_loss=0.06607, over 21672.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3043, pruned_loss=0.07642, over 4261625.57 frames. ], batch size: 263, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:18:55,288 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.672e+02 5.585e+02 9.781e+02 1.543e+03 3.065e+03, threshold=1.956e+03, percent-clipped=20.0 2023-06-23 16:19:16,052 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.24 vs. limit=12.0 2023-06-23 16:19:52,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1526322.0, ans=0.125 2023-06-23 16:20:19,602 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.93 vs. limit=22.5 2023-06-23 16:20:28,478 INFO [train.py:996] (2/4) Epoch 9, batch 10450, loss[loss=0.2941, simple_loss=0.3711, pruned_loss=0.1086, over 21663.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3064, pruned_loss=0.0782, over 4251707.23 frames. ], batch size: 414, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:20:58,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1526502.0, ans=0.0 2023-06-23 16:20:59,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1526502.0, ans=0.1 2023-06-23 16:21:39,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1526622.0, ans=0.0 2023-06-23 16:22:09,062 INFO [train.py:996] (2/4) Epoch 9, batch 10500, loss[loss=0.2389, simple_loss=0.3005, pruned_loss=0.0887, over 21439.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3071, pruned_loss=0.07712, over 4254243.54 frames. ], batch size: 389, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:22:23,393 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.525e+02 6.343e+02 8.149e+02 1.174e+03 2.736e+03, threshold=1.630e+03, percent-clipped=6.0 2023-06-23 16:23:53,899 INFO [train.py:996] (2/4) Epoch 9, batch 10550, loss[loss=0.2111, simple_loss=0.2795, pruned_loss=0.07131, over 21854.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3016, pruned_loss=0.07676, over 4250464.36 frames. ], batch size: 107, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:24:13,031 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=22.5 2023-06-23 16:24:46,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1527162.0, ans=0.0 2023-06-23 16:25:01,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1527222.0, ans=0.0 2023-06-23 16:25:35,487 INFO [train.py:996] (2/4) Epoch 9, batch 10600, loss[loss=0.2378, simple_loss=0.3004, pruned_loss=0.08758, over 20168.00 frames. ], tot_loss[loss=0.224, simple_loss=0.297, pruned_loss=0.07552, over 4253994.85 frames. ], batch size: 707, lr: 3.32e-03, grad_scale: 16.0 2023-06-23 16:25:36,918 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.72 vs. limit=5.0 2023-06-23 16:25:41,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1527342.0, ans=0.2 2023-06-23 16:25:50,373 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.855e+02 5.123e+02 6.754e+02 9.468e+02 2.113e+03, threshold=1.351e+03, percent-clipped=4.0 2023-06-23 16:25:57,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1527402.0, ans=0.2 2023-06-23 16:26:30,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1527462.0, ans=0.125 2023-06-23 16:26:37,523 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=15.0 2023-06-23 16:26:40,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1527522.0, ans=0.0 2023-06-23 16:27:22,978 INFO [train.py:996] (2/4) Epoch 9, batch 10650, loss[loss=0.3081, simple_loss=0.4259, pruned_loss=0.09511, over 19798.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2986, pruned_loss=0.07488, over 4239132.62 frames. ], batch size: 702, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:27:26,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1527642.0, ans=0.125 2023-06-23 16:28:17,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1527822.0, ans=0.125 2023-06-23 16:28:19,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1527822.0, ans=0.1 2023-06-23 16:28:41,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1527882.0, ans=0.125 2023-06-23 16:28:46,120 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.06 vs. limit=15.0 2023-06-23 16:28:59,275 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2023-06-23 16:29:03,467 INFO [train.py:996] (2/4) Epoch 9, batch 10700, loss[loss=0.2332, simple_loss=0.3047, pruned_loss=0.08088, over 21317.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2964, pruned_loss=0.07392, over 4240219.96 frames. ], batch size: 176, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:29:12,814 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.925e+02 6.514e+02 1.117e+03 1.445e+03 3.043e+03, threshold=2.235e+03, percent-clipped=29.0 2023-06-23 16:30:47,163 INFO [train.py:996] (2/4) Epoch 9, batch 10750, loss[loss=0.303, simple_loss=0.3894, pruned_loss=0.1083, over 21722.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3081, pruned_loss=0.07863, over 4247119.50 frames. ], batch size: 441, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:31:44,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1528362.0, ans=0.125 2023-06-23 16:32:33,879 INFO [train.py:996] (2/4) Epoch 9, batch 10800, loss[loss=0.2506, simple_loss=0.3237, pruned_loss=0.08877, over 21493.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3128, pruned_loss=0.07943, over 4251256.80 frames. ], batch size: 211, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 16:32:43,299 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.788e+02 5.066e+02 7.349e+02 1.067e+03 2.269e+03, threshold=1.470e+03, percent-clipped=1.0 2023-06-23 16:33:53,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1528782.0, ans=0.125 2023-06-23 16:34:14,698 INFO [train.py:996] (2/4) Epoch 9, batch 10850, loss[loss=0.2281, simple_loss=0.2893, pruned_loss=0.08347, over 21199.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3129, pruned_loss=0.07987, over 4256005.06 frames. ], batch size: 159, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:34:44,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1528902.0, ans=0.5 2023-06-23 16:35:23,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1529022.0, ans=0.125 2023-06-23 16:35:56,323 INFO [train.py:996] (2/4) Epoch 9, batch 10900, loss[loss=0.2034, simple_loss=0.289, pruned_loss=0.05892, over 21381.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.306, pruned_loss=0.07749, over 4252400.71 frames. ], batch size: 194, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:36:12,634 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.519e+02 5.159e+02 7.524e+02 1.150e+03 2.135e+03, threshold=1.505e+03, percent-clipped=11.0 2023-06-23 16:36:21,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1529202.0, ans=0.0 2023-06-23 16:36:22,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1529202.0, ans=0.125 2023-06-23 16:36:22,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1529202.0, ans=0.05 2023-06-23 16:36:58,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1529322.0, ans=0.09899494936611666 2023-06-23 16:37:05,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1529322.0, ans=0.015 2023-06-23 16:37:22,300 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 16:37:36,168 INFO [train.py:996] (2/4) Epoch 9, batch 10950, loss[loss=0.2378, simple_loss=0.3146, pruned_loss=0.08053, over 20682.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3027, pruned_loss=0.07613, over 4261864.78 frames. ], batch size: 607, lr: 3.31e-03, grad_scale: 8.0 2023-06-23 16:37:49,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1529442.0, ans=0.125 2023-06-23 16:38:21,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1529562.0, ans=0.0 2023-06-23 16:38:35,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1529562.0, ans=0.0 2023-06-23 16:38:35,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1529562.0, ans=0.0 2023-06-23 16:38:55,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1529622.0, ans=0.5 2023-06-23 16:39:12,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1529682.0, ans=0.125 2023-06-23 16:39:13,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1529682.0, ans=0.0 2023-06-23 16:39:16,266 INFO [train.py:996] (2/4) Epoch 9, batch 11000, loss[loss=0.2364, simple_loss=0.3036, pruned_loss=0.08457, over 21499.00 frames. ], tot_loss[loss=0.228, simple_loss=0.302, pruned_loss=0.07693, over 4268559.14 frames. ], batch size: 212, lr: 3.31e-03, grad_scale: 8.0 2023-06-23 16:39:16,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1529742.0, ans=0.5 2023-06-23 16:39:19,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1529742.0, ans=0.0 2023-06-23 16:39:32,526 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.640e+02 5.350e+02 8.050e+02 1.212e+03 3.028e+03, threshold=1.610e+03, percent-clipped=11.0 2023-06-23 16:40:05,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1529862.0, ans=0.125 2023-06-23 16:40:54,200 INFO [train.py:996] (2/4) Epoch 9, batch 11050, loss[loss=0.2276, simple_loss=0.2792, pruned_loss=0.08802, over 21576.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3001, pruned_loss=0.07901, over 4273559.04 frames. ], batch size: 414, lr: 3.31e-03, grad_scale: 8.0 2023-06-23 16:41:30,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1530102.0, ans=0.125 2023-06-23 16:41:38,903 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2023-06-23 16:42:38,852 INFO [train.py:996] (2/4) Epoch 9, batch 11100, loss[loss=0.2324, simple_loss=0.291, pruned_loss=0.08689, over 21551.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.2974, pruned_loss=0.07841, over 4266529.42 frames. ], batch size: 441, lr: 3.31e-03, grad_scale: 8.0 2023-06-23 16:42:45,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1530342.0, ans=10.0 2023-06-23 16:42:50,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1530342.0, ans=10.0 2023-06-23 16:42:51,392 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.448e+02 5.087e+02 6.616e+02 8.877e+02 2.244e+03, threshold=1.323e+03, percent-clipped=5.0 2023-06-23 16:43:26,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1530462.0, ans=0.1 2023-06-23 16:44:18,614 INFO [train.py:996] (2/4) Epoch 9, batch 11150, loss[loss=0.2122, simple_loss=0.2829, pruned_loss=0.07078, over 21846.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2958, pruned_loss=0.0779, over 4259054.83 frames. ], batch size: 107, lr: 3.31e-03, grad_scale: 8.0 2023-06-23 16:45:02,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1530762.0, ans=0.125 2023-06-23 16:45:03,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1530762.0, ans=0.0 2023-06-23 16:45:23,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1530822.0, ans=0.0 2023-06-23 16:45:40,212 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.78 vs. limit=10.0 2023-06-23 16:45:49,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1530882.0, ans=0.2 2023-06-23 16:45:58,170 INFO [train.py:996] (2/4) Epoch 9, batch 11200, loss[loss=0.2129, simple_loss=0.2832, pruned_loss=0.07129, over 21699.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.295, pruned_loss=0.07736, over 4262953.88 frames. ], batch size: 282, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:46:06,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1530942.0, ans=0.5 2023-06-23 16:46:10,776 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.780e+02 5.536e+02 7.546e+02 1.213e+03 2.221e+03, threshold=1.509e+03, percent-clipped=19.0 2023-06-23 16:46:37,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1531062.0, ans=0.1 2023-06-23 16:46:50,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1531122.0, ans=0.035 2023-06-23 16:47:32,747 INFO [train.py:996] (2/4) Epoch 9, batch 11250, loss[loss=0.2756, simple_loss=0.339, pruned_loss=0.1061, over 21907.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2947, pruned_loss=0.07767, over 4263272.78 frames. ], batch size: 107, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:47:48,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1531242.0, ans=0.1 2023-06-23 16:48:44,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1531422.0, ans=0.0 2023-06-23 16:49:11,436 INFO [train.py:996] (2/4) Epoch 9, batch 11300, loss[loss=0.2375, simple_loss=0.309, pruned_loss=0.08297, over 21759.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.297, pruned_loss=0.07818, over 4275425.44 frames. ], batch size: 389, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:49:28,319 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.911e+02 5.299e+02 7.050e+02 1.034e+03 1.810e+03, threshold=1.410e+03, percent-clipped=1.0 2023-06-23 16:49:49,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1531602.0, ans=0.125 2023-06-23 16:50:22,830 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.26 vs. limit=15.0 2023-06-23 16:50:32,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1531722.0, ans=0.2 2023-06-23 16:50:36,869 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.23 vs. limit=10.0 2023-06-23 16:50:39,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1531782.0, ans=0.1 2023-06-23 16:50:56,817 INFO [train.py:996] (2/4) Epoch 9, batch 11350, loss[loss=0.2565, simple_loss=0.3356, pruned_loss=0.08874, over 21845.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3001, pruned_loss=0.07763, over 4279165.31 frames. ], batch size: 118, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:51:25,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1531902.0, ans=0.2 2023-06-23 16:52:02,030 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=22.5 2023-06-23 16:52:03,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=1532022.0, ans=0.02 2023-06-23 16:52:04,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=1532022.0, ans=0.05 2023-06-23 16:52:39,048 INFO [train.py:996] (2/4) Epoch 9, batch 11400, loss[loss=0.2165, simple_loss=0.3015, pruned_loss=0.06574, over 20659.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3063, pruned_loss=0.08021, over 4271769.03 frames. ], batch size: 607, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:52:40,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1532142.0, ans=0.125 2023-06-23 16:52:52,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1532142.0, ans=0.035 2023-06-23 16:52:56,782 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.822e+02 6.591e+02 8.859e+02 1.390e+03 3.018e+03, threshold=1.772e+03, percent-clipped=23.0 2023-06-23 16:53:07,230 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 16:53:16,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1532262.0, ans=0.0 2023-06-23 16:54:09,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1532382.0, ans=0.0 2023-06-23 16:54:20,200 INFO [train.py:996] (2/4) Epoch 9, batch 11450, loss[loss=0.2401, simple_loss=0.3059, pruned_loss=0.0871, over 21443.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3065, pruned_loss=0.07904, over 4264941.22 frames. ], batch size: 131, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:54:20,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1532442.0, ans=0.2 2023-06-23 16:54:33,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1532442.0, ans=0.125 2023-06-23 16:54:45,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1532502.0, ans=0.0 2023-06-23 16:54:47,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1532502.0, ans=0.2 2023-06-23 16:55:45,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1532682.0, ans=0.0 2023-06-23 16:56:02,576 INFO [train.py:996] (2/4) Epoch 9, batch 11500, loss[loss=0.2083, simple_loss=0.3037, pruned_loss=0.05651, over 21737.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3096, pruned_loss=0.07968, over 4270032.88 frames. ], batch size: 298, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:56:19,883 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.487e+02 5.481e+02 7.380e+02 1.202e+03 2.850e+03, threshold=1.476e+03, percent-clipped=9.0 2023-06-23 16:56:23,714 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1532802.0, ans=0.0 2023-06-23 16:56:24,230 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.53 vs. limit=15.0 2023-06-23 16:56:24,442 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.83 vs. limit=10.0 2023-06-23 16:56:28,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1532802.0, ans=0.1 2023-06-23 16:57:00,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1532862.0, ans=0.0 2023-06-23 16:57:26,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1532982.0, ans=0.0 2023-06-23 16:57:33,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1532982.0, ans=0.2 2023-06-23 16:57:49,152 INFO [train.py:996] (2/4) Epoch 9, batch 11550, loss[loss=0.1776, simple_loss=0.235, pruned_loss=0.06008, over 16607.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3167, pruned_loss=0.08033, over 4263233.68 frames. ], batch size: 61, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 16:58:01,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1533042.0, ans=0.125 2023-06-23 16:58:02,481 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-06-23 16:58:11,122 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.93 vs. limit=6.0 2023-06-23 16:58:25,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1533102.0, ans=0.025 2023-06-23 16:58:56,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1533222.0, ans=0.125 2023-06-23 16:59:10,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1533222.0, ans=0.0 2023-06-23 16:59:31,703 INFO [train.py:996] (2/4) Epoch 9, batch 11600, loss[loss=0.242, simple_loss=0.337, pruned_loss=0.07346, over 21393.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3359, pruned_loss=0.08408, over 4269895.53 frames. ], batch size: 194, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 16:59:50,456 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.192e+02 7.053e+02 9.279e+02 1.499e+03 3.190e+03, threshold=1.856e+03, percent-clipped=25.0 2023-06-23 16:59:55,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1533402.0, ans=0.1 2023-06-23 16:59:55,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1533402.0, ans=0.1 2023-06-23 16:59:59,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1533402.0, ans=0.125 2023-06-23 17:00:02,967 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.52 vs. limit=6.0 2023-06-23 17:00:18,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1533462.0, ans=0.0 2023-06-23 17:01:12,928 INFO [train.py:996] (2/4) Epoch 9, batch 11650, loss[loss=0.2454, simple_loss=0.3365, pruned_loss=0.07717, over 21760.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3396, pruned_loss=0.08377, over 4275556.02 frames. ], batch size: 351, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 17:01:44,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1533702.0, ans=0.1 2023-06-23 17:02:20,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1533822.0, ans=0.125 2023-06-23 17:02:41,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1533882.0, ans=0.1 2023-06-23 17:02:52,842 INFO [train.py:996] (2/4) Epoch 9, batch 11700, loss[loss=0.2229, simple_loss=0.2844, pruned_loss=0.08067, over 21383.00 frames. ], tot_loss[loss=0.2479, simple_loss=0.3292, pruned_loss=0.08334, over 4265034.51 frames. ], batch size: 389, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 17:02:53,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1533942.0, ans=0.05 2023-06-23 17:03:07,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1533942.0, ans=0.125 2023-06-23 17:03:10,680 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.713e+02 7.514e+02 1.058e+03 1.633e+03 4.255e+03, threshold=2.116e+03, percent-clipped=16.0 2023-06-23 17:03:19,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1534002.0, ans=0.07 2023-06-23 17:03:22,556 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.85 vs. limit=22.5 2023-06-23 17:04:07,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1534122.0, ans=0.125 2023-06-23 17:04:09,911 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2023-06-23 17:04:14,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1534182.0, ans=0.1 2023-06-23 17:04:31,756 INFO [train.py:996] (2/4) Epoch 9, batch 11750, loss[loss=0.2243, simple_loss=0.2773, pruned_loss=0.08563, over 21392.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3211, pruned_loss=0.08313, over 4264747.92 frames. ], batch size: 144, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 17:05:19,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1534362.0, ans=0.05 2023-06-23 17:05:50,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1534482.0, ans=0.0 2023-06-23 17:05:55,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1534482.0, ans=0.95 2023-06-23 17:06:17,854 INFO [train.py:996] (2/4) Epoch 9, batch 11800, loss[loss=0.2314, simple_loss=0.3374, pruned_loss=0.06269, over 19867.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3223, pruned_loss=0.08437, over 4265311.71 frames. ], batch size: 704, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 17:06:32,021 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.688e+02 5.572e+02 8.368e+02 1.434e+03 3.192e+03, threshold=1.674e+03, percent-clipped=11.0 2023-06-23 17:06:55,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1534662.0, ans=0.125 2023-06-23 17:07:45,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1534782.0, ans=0.1 2023-06-23 17:07:58,027 INFO [train.py:996] (2/4) Epoch 9, batch 11850, loss[loss=0.2538, simple_loss=0.3395, pruned_loss=0.08407, over 21888.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.323, pruned_loss=0.08301, over 4272485.47 frames. ], batch size: 316, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 17:09:39,290 INFO [train.py:996] (2/4) Epoch 9, batch 11900, loss[loss=0.2562, simple_loss=0.3209, pruned_loss=0.09576, over 21811.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3232, pruned_loss=0.08122, over 4271967.04 frames. ], batch size: 102, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 17:09:59,003 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.827e+02 5.472e+02 7.234e+02 9.480e+02 2.463e+03, threshold=1.447e+03, percent-clipped=1.0 2023-06-23 17:09:59,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1535202.0, ans=0.0 2023-06-23 17:10:21,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1535262.0, ans=0.125 2023-06-23 17:10:31,656 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-06-23 17:11:02,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1535322.0, ans=0.1 2023-06-23 17:11:26,117 INFO [train.py:996] (2/4) Epoch 9, batch 11950, loss[loss=0.2084, simple_loss=0.306, pruned_loss=0.05544, over 21653.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3222, pruned_loss=0.07809, over 4269021.83 frames. ], batch size: 389, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 17:12:43,256 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.60 vs. limit=5.0 2023-06-23 17:12:47,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1535682.0, ans=0.125 2023-06-23 17:12:50,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1535682.0, ans=0.125 2023-06-23 17:13:03,643 INFO [train.py:996] (2/4) Epoch 9, batch 12000, loss[loss=0.2016, simple_loss=0.2687, pruned_loss=0.06724, over 21609.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3161, pruned_loss=0.07583, over 4273421.81 frames. ], batch size: 247, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 17:13:03,643 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-23 17:13:24,470 INFO [train.py:1028] (2/4) Epoch 9, validation: loss=0.2567, simple_loss=0.3528, pruned_loss=0.08029, over 1796401.00 frames. 2023-06-23 17:13:24,470 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-23 17:13:38,397 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.383e+02 5.788e+02 7.844e+02 1.305e+03 3.845e+03, threshold=1.569e+03, percent-clipped=19.0 2023-06-23 17:13:42,125 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:13:44,416 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.82 vs. limit=15.0 2023-06-23 17:14:14,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1535862.0, ans=0.0 2023-06-23 17:14:38,212 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-23 17:15:03,715 INFO [train.py:996] (2/4) Epoch 9, batch 12050, loss[loss=0.2789, simple_loss=0.3286, pruned_loss=0.1145, over 21337.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.313, pruned_loss=0.07796, over 4275330.37 frames. ], batch size: 143, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 17:15:07,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1536042.0, ans=0.2 2023-06-23 17:15:24,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1536102.0, ans=0.0 2023-06-23 17:15:40,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1536162.0, ans=0.0 2023-06-23 17:16:01,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1536222.0, ans=0.125 2023-06-23 17:16:45,350 INFO [train.py:996] (2/4) Epoch 9, batch 12100, loss[loss=0.2924, simple_loss=0.4241, pruned_loss=0.08039, over 20803.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3189, pruned_loss=0.08229, over 4283663.45 frames. ], batch size: 607, lr: 3.31e-03, grad_scale: 32.0 2023-06-23 17:16:48,359 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.18 vs. limit=15.0 2023-06-23 17:17:01,187 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.886e+02 6.749e+02 9.796e+02 1.461e+03 3.096e+03, threshold=1.959e+03, percent-clipped=20.0 2023-06-23 17:17:01,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1536402.0, ans=0.125 2023-06-23 17:17:13,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1536402.0, ans=0.0 2023-06-23 17:17:37,716 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:18:06,114 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-23 17:18:31,913 INFO [train.py:996] (2/4) Epoch 9, batch 12150, loss[loss=0.1838, simple_loss=0.2368, pruned_loss=0.06544, over 20711.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3232, pruned_loss=0.08113, over 4273157.67 frames. ], batch size: 609, lr: 3.31e-03, grad_scale: 16.0 2023-06-23 17:18:40,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1536642.0, ans=0.0 2023-06-23 17:18:41,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1536642.0, ans=0.125 2023-06-23 17:19:11,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1536702.0, ans=0.125 2023-06-23 17:19:35,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1536762.0, ans=0.95 2023-06-23 17:19:54,336 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:20:11,278 INFO [train.py:996] (2/4) Epoch 9, batch 12200, loss[loss=0.2081, simple_loss=0.2625, pruned_loss=0.07682, over 21330.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3174, pruned_loss=0.08029, over 4271113.78 frames. ], batch size: 160, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:20:32,011 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.485e+02 6.911e+02 1.120e+03 1.509e+03 3.105e+03, threshold=2.240e+03, percent-clipped=12.0 2023-06-23 17:21:29,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1537182.0, ans=0.125 2023-06-23 17:21:45,636 INFO [train.py:996] (2/4) Epoch 9, batch 12250, loss[loss=0.1834, simple_loss=0.2685, pruned_loss=0.04909, over 21648.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3091, pruned_loss=0.07734, over 4265146.84 frames. ], batch size: 414, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:21:46,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1537242.0, ans=0.125 2023-06-23 17:23:10,413 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.39 vs. limit=15.0 2023-06-23 17:23:24,449 INFO [train.py:996] (2/4) Epoch 9, batch 12300, loss[loss=0.1902, simple_loss=0.2702, pruned_loss=0.05509, over 21305.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3, pruned_loss=0.07124, over 4253210.81 frames. ], batch size: 176, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:23:45,601 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.509e+02 5.150e+02 7.519e+02 1.212e+03 3.138e+03, threshold=1.504e+03, percent-clipped=3.0 2023-06-23 17:24:19,809 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.71 vs. limit=15.0 2023-06-23 17:24:24,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1537662.0, ans=0.95 2023-06-23 17:24:26,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1537722.0, ans=0.09899494936611666 2023-06-23 17:24:56,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1537782.0, ans=0.125 2023-06-23 17:25:02,687 INFO [train.py:996] (2/4) Epoch 9, batch 12350, loss[loss=0.2157, simple_loss=0.2966, pruned_loss=0.06737, over 21653.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3049, pruned_loss=0.07174, over 4252810.89 frames. ], batch size: 230, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:25:41,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1537902.0, ans=0.125 2023-06-23 17:26:12,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1538022.0, ans=0.0 2023-06-23 17:26:40,774 INFO [train.py:996] (2/4) Epoch 9, batch 12400, loss[loss=0.262, simple_loss=0.313, pruned_loss=0.1055, over 21547.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3073, pruned_loss=0.07572, over 4264254.49 frames. ], batch size: 194, lr: 3.30e-03, grad_scale: 32.0 2023-06-23 17:27:01,902 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.060e+02 5.603e+02 7.484e+02 1.004e+03 2.626e+03, threshold=1.497e+03, percent-clipped=10.0 2023-06-23 17:27:13,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1538202.0, ans=0.0 2023-06-23 17:27:56,699 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.71 vs. limit=15.0 2023-06-23 17:28:25,782 INFO [train.py:996] (2/4) Epoch 9, batch 12450, loss[loss=0.2784, simple_loss=0.3383, pruned_loss=0.1093, over 21394.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3121, pruned_loss=0.07974, over 4273803.11 frames. ], batch size: 176, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:29:11,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1538562.0, ans=0.125 2023-06-23 17:30:11,447 INFO [train.py:996] (2/4) Epoch 9, batch 12500, loss[loss=0.2445, simple_loss=0.3378, pruned_loss=0.0756, over 21603.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3211, pruned_loss=0.08277, over 4266310.70 frames. ], batch size: 230, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:30:33,994 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.008e+02 5.952e+02 7.744e+02 1.112e+03 2.842e+03, threshold=1.549e+03, percent-clipped=7.0 2023-06-23 17:30:54,172 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=12.0 2023-06-23 17:31:22,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1538922.0, ans=0.125 2023-06-23 17:31:58,470 INFO [train.py:996] (2/4) Epoch 9, batch 12550, loss[loss=0.2838, simple_loss=0.3482, pruned_loss=0.1097, over 21678.00 frames. ], tot_loss[loss=0.2483, simple_loss=0.327, pruned_loss=0.0848, over 4265826.72 frames. ], batch size: 351, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:31:59,439 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-23 17:32:09,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1539042.0, ans=0.5 2023-06-23 17:32:11,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1539042.0, ans=0.0 2023-06-23 17:32:27,077 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.52 vs. limit=10.0 2023-06-23 17:32:55,700 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.69 vs. limit=10.0 2023-06-23 17:33:45,028 INFO [train.py:996] (2/4) Epoch 9, batch 12600, loss[loss=0.206, simple_loss=0.2823, pruned_loss=0.06478, over 21380.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3246, pruned_loss=0.08292, over 4258108.23 frames. ], batch size: 131, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:34:03,202 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.580e+02 5.911e+02 8.328e+02 1.277e+03 2.400e+03, threshold=1.666e+03, percent-clipped=14.0 2023-06-23 17:34:05,641 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-23 17:34:29,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1539462.0, ans=0.2 2023-06-23 17:35:23,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1539642.0, ans=0.125 2023-06-23 17:35:25,015 INFO [train.py:996] (2/4) Epoch 9, batch 12650, loss[loss=0.2243, simple_loss=0.2869, pruned_loss=0.08088, over 21494.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3165, pruned_loss=0.07907, over 4260698.73 frames. ], batch size: 194, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:35:48,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1539702.0, ans=0.1 2023-06-23 17:35:50,017 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=15.0 2023-06-23 17:37:05,721 INFO [train.py:996] (2/4) Epoch 9, batch 12700, loss[loss=0.3264, simple_loss=0.3831, pruned_loss=0.1348, over 21354.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3181, pruned_loss=0.08227, over 4265074.25 frames. ], batch size: 507, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:37:22,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1540002.0, ans=0.1 2023-06-23 17:37:23,855 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.729e+02 5.436e+02 7.219e+02 1.107e+03 2.161e+03, threshold=1.444e+03, percent-clipped=5.0 2023-06-23 17:37:42,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1540062.0, ans=0.125 2023-06-23 17:38:11,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1540122.0, ans=0.1 2023-06-23 17:38:46,087 INFO [train.py:996] (2/4) Epoch 9, batch 12750, loss[loss=0.241, simple_loss=0.3122, pruned_loss=0.0849, over 21424.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3181, pruned_loss=0.08211, over 4271147.25 frames. ], batch size: 131, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:39:04,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1540302.0, ans=0.04949747468305833 2023-06-23 17:39:04,674 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=22.5 2023-06-23 17:39:49,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1540422.0, ans=0.125 2023-06-23 17:40:09,201 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:40:26,701 INFO [train.py:996] (2/4) Epoch 9, batch 12800, loss[loss=0.2612, simple_loss=0.3323, pruned_loss=0.09505, over 21763.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3178, pruned_loss=0.0827, over 4273282.87 frames. ], batch size: 414, lr: 3.30e-03, grad_scale: 32.0 2023-06-23 17:40:30,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1540542.0, ans=0.125 2023-06-23 17:40:51,641 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.701e+02 5.271e+02 6.331e+02 9.056e+02 1.664e+03, threshold=1.266e+03, percent-clipped=3.0 2023-06-23 17:41:05,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1540602.0, ans=0.04949747468305833 2023-06-23 17:42:00,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1540782.0, ans=0.125 2023-06-23 17:42:08,021 INFO [train.py:996] (2/4) Epoch 9, batch 12850, loss[loss=0.2075, simple_loss=0.2902, pruned_loss=0.06242, over 21286.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3206, pruned_loss=0.08428, over 4272901.12 frames. ], batch size: 176, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:42:30,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1540902.0, ans=0.0 2023-06-23 17:43:41,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=15.0 2023-06-23 17:43:54,698 INFO [train.py:996] (2/4) Epoch 9, batch 12900, loss[loss=0.2048, simple_loss=0.2761, pruned_loss=0.06673, over 21320.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3188, pruned_loss=0.08123, over 4271200.23 frames. ], batch size: 176, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:44:00,905 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=22.5 2023-06-23 17:44:24,418 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.381e+02 5.349e+02 7.787e+02 1.135e+03 3.186e+03, threshold=1.557e+03, percent-clipped=18.0 2023-06-23 17:44:42,346 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-23 17:44:53,946 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.81 vs. limit=22.5 2023-06-23 17:44:58,324 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.36 vs. limit=12.0 2023-06-23 17:44:59,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1541322.0, ans=0.125 2023-06-23 17:45:01,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten.whitening_limit, batch_count=1541322.0, ans=15.0 2023-06-23 17:45:41,756 INFO [train.py:996] (2/4) Epoch 9, batch 12950, loss[loss=0.2645, simple_loss=0.3811, pruned_loss=0.07396, over 19767.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3175, pruned_loss=0.07953, over 4271173.44 frames. ], batch size: 703, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:46:03,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1541502.0, ans=0.125 2023-06-23 17:46:29,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1541562.0, ans=0.125 2023-06-23 17:46:29,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1541562.0, ans=0.5 2023-06-23 17:46:31,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1541562.0, ans=0.1 2023-06-23 17:46:59,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1541622.0, ans=0.1 2023-06-23 17:47:27,859 INFO [train.py:996] (2/4) Epoch 9, batch 13000, loss[loss=0.1878, simple_loss=0.2659, pruned_loss=0.05487, over 21716.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3181, pruned_loss=0.0804, over 4274044.48 frames. ], batch size: 124, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:47:46,459 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.593e+02 5.869e+02 8.632e+02 1.298e+03 2.714e+03, threshold=1.726e+03, percent-clipped=15.0 2023-06-23 17:48:01,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1541862.0, ans=0.0 2023-06-23 17:48:10,577 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 17:48:20,616 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-23 17:48:29,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1541922.0, ans=0.125 2023-06-23 17:48:46,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1541982.0, ans=0.1 2023-06-23 17:49:01,566 INFO [train.py:996] (2/4) Epoch 9, batch 13050, loss[loss=0.2205, simple_loss=0.2903, pruned_loss=0.07535, over 21584.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.314, pruned_loss=0.07881, over 4278667.78 frames. ], batch size: 195, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:49:16,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1542042.0, ans=0.1 2023-06-23 17:49:48,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1542162.0, ans=0.125 2023-06-23 17:49:54,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1542162.0, ans=0.125 2023-06-23 17:50:08,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1542222.0, ans=0.125 2023-06-23 17:50:46,359 INFO [train.py:996] (2/4) Epoch 9, batch 13100, loss[loss=0.2979, simple_loss=0.3617, pruned_loss=0.117, over 21246.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3148, pruned_loss=0.07915, over 4286231.76 frames. ], batch size: 143, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:50:50,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1542342.0, ans=0.125 2023-06-23 17:51:06,460 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.778e+02 5.747e+02 7.827e+02 1.039e+03 1.771e+03, threshold=1.565e+03, percent-clipped=1.0 2023-06-23 17:51:28,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1542462.0, ans=0.125 2023-06-23 17:51:38,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1542462.0, ans=0.2 2023-06-23 17:51:51,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1542522.0, ans=0.0 2023-06-23 17:52:08,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1542582.0, ans=0.125 2023-06-23 17:52:28,863 INFO [train.py:996] (2/4) Epoch 9, batch 13150, loss[loss=0.2069, simple_loss=0.2787, pruned_loss=0.06755, over 21376.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3181, pruned_loss=0.08211, over 4286287.16 frames. ], batch size: 211, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:52:29,891 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-23 17:53:53,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1542882.0, ans=0.0 2023-06-23 17:54:04,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1542882.0, ans=0.2 2023-06-23 17:54:10,297 INFO [train.py:996] (2/4) Epoch 9, batch 13200, loss[loss=0.237, simple_loss=0.3113, pruned_loss=0.08135, over 21508.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3151, pruned_loss=0.08126, over 4279042.63 frames. ], batch size: 194, lr: 3.30e-03, grad_scale: 32.0 2023-06-23 17:54:33,688 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.479e+02 5.951e+02 7.570e+02 1.042e+03 3.191e+03, threshold=1.514e+03, percent-clipped=13.0 2023-06-23 17:55:49,999 INFO [train.py:996] (2/4) Epoch 9, batch 13250, loss[loss=0.2348, simple_loss=0.3016, pruned_loss=0.08403, over 21294.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3149, pruned_loss=0.08295, over 4290473.45 frames. ], batch size: 176, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:56:16,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1543302.0, ans=0.0 2023-06-23 17:57:09,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1543422.0, ans=0.0 2023-06-23 17:57:22,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1543482.0, ans=0.125 2023-06-23 17:57:36,339 INFO [train.py:996] (2/4) Epoch 9, batch 13300, loss[loss=0.219, simple_loss=0.338, pruned_loss=0.04998, over 21255.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3173, pruned_loss=0.08225, over 4287304.25 frames. ], batch size: 548, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:58:08,189 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.720e+02 5.402e+02 7.318e+02 1.029e+03 1.964e+03, threshold=1.464e+03, percent-clipped=5.0 2023-06-23 17:58:10,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1543602.0, ans=0.0 2023-06-23 17:58:24,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1543662.0, ans=0.125 2023-06-23 17:59:18,316 INFO [train.py:996] (2/4) Epoch 9, batch 13350, loss[loss=0.252, simple_loss=0.3291, pruned_loss=0.08741, over 21807.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3229, pruned_loss=0.08492, over 4286896.56 frames. ], batch size: 118, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 17:59:21,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1543842.0, ans=0.1 2023-06-23 17:59:50,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1543902.0, ans=0.0 2023-06-23 17:59:59,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1543902.0, ans=0.0 2023-06-23 18:00:01,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.whiten.whitening_limit, batch_count=1543902.0, ans=15.0 2023-06-23 18:00:04,517 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=15.0 2023-06-23 18:00:57,191 INFO [train.py:996] (2/4) Epoch 9, batch 13400, loss[loss=0.3001, simple_loss=0.3574, pruned_loss=0.1215, over 21535.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3232, pruned_loss=0.08639, over 4282685.80 frames. ], batch size: 471, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 18:01:30,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1544202.0, ans=0.1 2023-06-23 18:01:32,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1544202.0, ans=0.2 2023-06-23 18:01:33,081 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.18 vs. limit=15.0 2023-06-23 18:01:35,260 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.876e+02 6.088e+02 8.910e+02 1.105e+03 2.382e+03, threshold=1.782e+03, percent-clipped=11.0 2023-06-23 18:01:40,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1544202.0, ans=0.125 2023-06-23 18:02:17,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1544322.0, ans=0.0 2023-06-23 18:02:50,788 INFO [train.py:996] (2/4) Epoch 9, batch 13450, loss[loss=0.2261, simple_loss=0.3032, pruned_loss=0.07451, over 21637.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3237, pruned_loss=0.08861, over 4277535.33 frames. ], batch size: 415, lr: 3.30e-03, grad_scale: 8.0 2023-06-23 18:03:22,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1544502.0, ans=0.0 2023-06-23 18:03:25,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1544502.0, ans=0.125 2023-06-23 18:03:41,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1544562.0, ans=0.125 2023-06-23 18:04:30,758 INFO [train.py:996] (2/4) Epoch 9, batch 13500, loss[loss=0.2356, simple_loss=0.3066, pruned_loss=0.08227, over 21732.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.318, pruned_loss=0.08638, over 4270490.63 frames. ], batch size: 298, lr: 3.30e-03, grad_scale: 8.0 2023-06-23 18:04:33,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1544742.0, ans=0.1 2023-06-23 18:04:54,273 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.791e+02 5.275e+02 7.519e+02 1.324e+03 2.778e+03, threshold=1.504e+03, percent-clipped=14.0 2023-06-23 18:05:19,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1544862.0, ans=0.125 2023-06-23 18:05:29,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1544922.0, ans=0.125 2023-06-23 18:05:57,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1544982.0, ans=0.1 2023-06-23 18:06:13,518 INFO [train.py:996] (2/4) Epoch 9, batch 13550, loss[loss=0.2429, simple_loss=0.3296, pruned_loss=0.0781, over 20968.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3211, pruned_loss=0.08527, over 4264989.10 frames. ], batch size: 607, lr: 3.30e-03, grad_scale: 8.0 2023-06-23 18:06:54,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1545162.0, ans=0.0 2023-06-23 18:07:24,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1545222.0, ans=0.1 2023-06-23 18:07:55,219 INFO [train.py:996] (2/4) Epoch 9, batch 13600, loss[loss=0.2535, simple_loss=0.3288, pruned_loss=0.08906, over 21756.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3225, pruned_loss=0.08531, over 4270994.13 frames. ], batch size: 112, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 18:07:55,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1545342.0, ans=0.125 2023-06-23 18:08:18,101 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.790e+02 6.429e+02 9.165e+02 1.553e+03 3.162e+03, threshold=1.833e+03, percent-clipped=25.0 2023-06-23 18:08:31,750 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=22.5 2023-06-23 18:08:59,348 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.13 vs. limit=22.5 2023-06-23 18:09:00,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1545522.0, ans=0.125 2023-06-23 18:09:05,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1545522.0, ans=0.125 2023-06-23 18:09:30,311 INFO [train.py:996] (2/4) Epoch 9, batch 13650, loss[loss=0.2133, simple_loss=0.2791, pruned_loss=0.07377, over 16221.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3169, pruned_loss=0.08272, over 4267295.47 frames. ], batch size: 66, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 18:09:30,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1545642.0, ans=0.1 2023-06-23 18:09:30,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1545642.0, ans=0.0 2023-06-23 18:10:02,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1545702.0, ans=0.125 2023-06-23 18:10:04,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1545702.0, ans=0.125 2023-06-23 18:11:13,769 INFO [train.py:996] (2/4) Epoch 9, batch 13700, loss[loss=0.1764, simple_loss=0.2464, pruned_loss=0.05322, over 21729.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3123, pruned_loss=0.08137, over 4263410.16 frames. ], batch size: 112, lr: 3.30e-03, grad_scale: 16.0 2023-06-23 18:11:24,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1545942.0, ans=0.0 2023-06-23 18:11:41,507 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.658e+02 5.677e+02 7.972e+02 1.070e+03 2.613e+03, threshold=1.594e+03, percent-clipped=4.0 2023-06-23 18:12:51,915 INFO [train.py:996] (2/4) Epoch 9, batch 13750, loss[loss=0.1984, simple_loss=0.2625, pruned_loss=0.06709, over 21391.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3091, pruned_loss=0.08077, over 4269734.51 frames. ], batch size: 194, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:13:16,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1546302.0, ans=0.1 2023-06-23 18:13:29,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1546302.0, ans=0.1 2023-06-23 18:14:30,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1546482.0, ans=0.1 2023-06-23 18:14:34,986 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-23 18:14:35,702 INFO [train.py:996] (2/4) Epoch 9, batch 13800, loss[loss=0.223, simple_loss=0.3227, pruned_loss=0.0616, over 21696.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3084, pruned_loss=0.07847, over 4271424.66 frames. ], batch size: 247, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:14:52,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1546542.0, ans=0.1 2023-06-23 18:15:16,350 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.059e+02 5.824e+02 9.603e+02 1.417e+03 3.093e+03, threshold=1.921e+03, percent-clipped=19.0 2023-06-23 18:15:16,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1546602.0, ans=0.1 2023-06-23 18:15:32,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1546662.0, ans=0.125 2023-06-23 18:16:23,572 INFO [train.py:996] (2/4) Epoch 9, batch 13850, loss[loss=0.2902, simple_loss=0.3627, pruned_loss=0.1089, over 21872.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3161, pruned_loss=0.08008, over 4275442.88 frames. ], batch size: 371, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:17:23,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1546962.0, ans=0.125 2023-06-23 18:17:42,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1547082.0, ans=10.0 2023-06-23 18:18:14,858 INFO [train.py:996] (2/4) Epoch 9, batch 13900, loss[loss=0.1957, simple_loss=0.3041, pruned_loss=0.04359, over 20859.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3202, pruned_loss=0.08244, over 4281568.42 frames. ], batch size: 608, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:18:18,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1547142.0, ans=0.125 2023-06-23 18:18:18,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1547142.0, ans=0.09899494936611666 2023-06-23 18:18:41,695 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.964e+02 6.028e+02 8.450e+02 1.187e+03 2.483e+03, threshold=1.690e+03, percent-clipped=4.0 2023-06-23 18:18:49,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1547262.0, ans=0.2 2023-06-23 18:18:58,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1547262.0, ans=0.125 2023-06-23 18:19:04,603 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:19:09,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1547322.0, ans=0.125 2023-06-23 18:19:12,582 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:19:14,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1547322.0, ans=0.0 2023-06-23 18:19:36,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1547382.0, ans=0.0 2023-06-23 18:19:49,970 INFO [train.py:996] (2/4) Epoch 9, batch 13950, loss[loss=0.2276, simple_loss=0.3012, pruned_loss=0.07704, over 21650.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3192, pruned_loss=0.08404, over 4284946.13 frames. ], batch size: 263, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:19:50,437 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:20:21,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1547502.0, ans=0.1 2023-06-23 18:20:29,123 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-23 18:20:43,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1547622.0, ans=0.1 2023-06-23 18:21:29,625 INFO [train.py:996] (2/4) Epoch 9, batch 14000, loss[loss=0.2351, simple_loss=0.332, pruned_loss=0.06913, over 21799.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3185, pruned_loss=0.08232, over 4275679.43 frames. ], batch size: 332, lr: 3.29e-03, grad_scale: 32.0 2023-06-23 18:21:39,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1547742.0, ans=0.2 2023-06-23 18:21:53,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1547802.0, ans=0.125 2023-06-23 18:21:56,232 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.630e+02 5.065e+02 8.726e+02 1.240e+03 2.803e+03, threshold=1.745e+03, percent-clipped=8.0 2023-06-23 18:23:03,102 INFO [train.py:996] (2/4) Epoch 9, batch 14050, loss[loss=0.2297, simple_loss=0.2899, pruned_loss=0.08473, over 21292.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3138, pruned_loss=0.07889, over 4281156.19 frames. ], batch size: 471, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:23:50,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1548162.0, ans=10.0 2023-06-23 18:24:08,786 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=15.0 2023-06-23 18:24:20,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1548282.0, ans=0.0 2023-06-23 18:24:22,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1548282.0, ans=0.125 2023-06-23 18:24:42,556 INFO [train.py:996] (2/4) Epoch 9, batch 14100, loss[loss=0.2728, simple_loss=0.335, pruned_loss=0.1053, over 21675.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3078, pruned_loss=0.07848, over 4286652.92 frames. ], batch size: 441, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:25:08,059 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=15.0 2023-06-23 18:25:10,278 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.343e+02 6.298e+02 9.143e+02 1.408e+03 2.663e+03, threshold=1.829e+03, percent-clipped=10.0 2023-06-23 18:25:15,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1548402.0, ans=0.0 2023-06-23 18:25:17,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1548462.0, ans=6.0 2023-06-23 18:26:10,598 INFO [train.py:996] (2/4) Epoch 9, batch 14150, loss[loss=0.2597, simple_loss=0.3338, pruned_loss=0.09277, over 21497.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.311, pruned_loss=0.07916, over 4282946.77 frames. ], batch size: 160, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:27:48,777 INFO [train.py:996] (2/4) Epoch 9, batch 14200, loss[loss=0.2317, simple_loss=0.2946, pruned_loss=0.08445, over 21770.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3112, pruned_loss=0.07921, over 4280457.52 frames. ], batch size: 316, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:27:53,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1548942.0, ans=0.05 2023-06-23 18:28:22,647 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.722e+02 5.420e+02 7.650e+02 1.190e+03 2.098e+03, threshold=1.530e+03, percent-clipped=4.0 2023-06-23 18:28:27,616 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:28:44,505 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:28:48,668 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=12.0 2023-06-23 18:29:25,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1549182.0, ans=0.125 2023-06-23 18:29:27,836 INFO [train.py:996] (2/4) Epoch 9, batch 14250, loss[loss=0.2285, simple_loss=0.3086, pruned_loss=0.07416, over 21681.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3056, pruned_loss=0.07884, over 4279159.15 frames. ], batch size: 415, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:29:53,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1549302.0, ans=0.125 2023-06-23 18:30:15,753 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.18 vs. limit=12.0 2023-06-23 18:31:09,739 INFO [train.py:996] (2/4) Epoch 9, batch 14300, loss[loss=0.4275, simple_loss=0.4902, pruned_loss=0.1824, over 21521.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.308, pruned_loss=0.07956, over 4272087.43 frames. ], batch size: 471, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:31:29,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1549542.0, ans=0.0 2023-06-23 18:31:49,037 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.696e+02 4.681e+02 6.476e+02 1.240e+03 3.295e+03, threshold=1.295e+03, percent-clipped=18.0 2023-06-23 18:32:28,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1549722.0, ans=0.0 2023-06-23 18:32:28,877 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.40 vs. limit=12.0 2023-06-23 18:32:39,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1549782.0, ans=0.125 2023-06-23 18:32:49,926 INFO [train.py:996] (2/4) Epoch 9, batch 14350, loss[loss=0.2386, simple_loss=0.3158, pruned_loss=0.08066, over 21871.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3116, pruned_loss=0.07901, over 4273473.21 frames. ], batch size: 371, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:33:37,186 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=15.0 2023-06-23 18:33:47,426 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=22.5 2023-06-23 18:33:54,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1550022.0, ans=0.125 2023-06-23 18:34:34,656 INFO [train.py:996] (2/4) Epoch 9, batch 14400, loss[loss=0.2328, simple_loss=0.2972, pruned_loss=0.08421, over 21643.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3095, pruned_loss=0.07985, over 4272526.56 frames. ], batch size: 332, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:34:52,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1550142.0, ans=0.125 2023-06-23 18:35:08,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1550202.0, ans=0.1 2023-06-23 18:35:09,618 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.631e+02 4.844e+02 6.439e+02 1.111e+03 2.671e+03, threshold=1.288e+03, percent-clipped=19.0 2023-06-23 18:35:13,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1550202.0, ans=0.125 2023-06-23 18:35:16,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1550262.0, ans=0.2 2023-06-23 18:35:17,130 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.49 vs. limit=5.0 2023-06-23 18:35:33,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1550322.0, ans=0.125 2023-06-23 18:35:33,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1550322.0, ans=0.05 2023-06-23 18:36:03,857 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.14 vs. limit=15.0 2023-06-23 18:36:07,880 INFO [train.py:996] (2/4) Epoch 9, batch 14450, loss[loss=0.1966, simple_loss=0.2655, pruned_loss=0.06381, over 21637.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3056, pruned_loss=0.08023, over 4266943.33 frames. ], batch size: 247, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:36:24,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1550442.0, ans=0.0 2023-06-23 18:37:25,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1550622.0, ans=0.125 2023-06-23 18:37:32,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1550682.0, ans=0.125 2023-06-23 18:37:43,323 INFO [train.py:996] (2/4) Epoch 9, batch 14500, loss[loss=0.2419, simple_loss=0.3208, pruned_loss=0.08152, over 21402.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3021, pruned_loss=0.07912, over 4265853.73 frames. ], batch size: 194, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:38:23,476 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.792e+02 5.208e+02 6.817e+02 8.713e+02 1.535e+03, threshold=1.363e+03, percent-clipped=1.0 2023-06-23 18:38:35,307 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.10 vs. limit=15.0 2023-06-23 18:38:40,386 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=12.0 2023-06-23 18:38:41,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1550862.0, ans=0.1 2023-06-23 18:39:29,509 INFO [train.py:996] (2/4) Epoch 9, batch 14550, loss[loss=0.2843, simple_loss=0.3553, pruned_loss=0.1067, over 21903.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.307, pruned_loss=0.08083, over 4254529.88 frames. ], batch size: 316, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:39:31,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1551042.0, ans=0.0 2023-06-23 18:40:10,571 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=15.0 2023-06-23 18:40:28,188 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.66 vs. limit=22.5 2023-06-23 18:40:37,388 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.36 vs. limit=22.5 2023-06-23 18:41:10,880 INFO [train.py:996] (2/4) Epoch 9, batch 14600, loss[loss=0.3069, simple_loss=0.3652, pruned_loss=0.1243, over 21334.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3152, pruned_loss=0.08433, over 4262053.77 frames. ], batch size: 507, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:41:37,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1551402.0, ans=0.0 2023-06-23 18:41:42,077 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.404e+02 6.083e+02 8.730e+02 1.243e+03 2.471e+03, threshold=1.746e+03, percent-clipped=17.0 2023-06-23 18:42:02,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1551522.0, ans=0.1 2023-06-23 18:42:13,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1551522.0, ans=0.125 2023-06-23 18:42:45,937 INFO [train.py:996] (2/4) Epoch 9, batch 14650, loss[loss=0.2093, simple_loss=0.3039, pruned_loss=0.05731, over 21616.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3188, pruned_loss=0.08337, over 4258978.25 frames. ], batch size: 389, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:43:02,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1551642.0, ans=0.125 2023-06-23 18:43:14,477 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.48 vs. limit=12.0 2023-06-23 18:43:15,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1551702.0, ans=0.0 2023-06-23 18:43:37,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1551822.0, ans=0.0 2023-06-23 18:43:37,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1551822.0, ans=0.125 2023-06-23 18:43:47,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1551822.0, ans=0.125 2023-06-23 18:43:48,953 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:43:50,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1551822.0, ans=0.125 2023-06-23 18:43:56,232 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.34 vs. limit=22.5 2023-06-23 18:44:21,194 INFO [train.py:996] (2/4) Epoch 9, batch 14700, loss[loss=0.2232, simple_loss=0.3332, pruned_loss=0.05662, over 21238.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3111, pruned_loss=0.07755, over 4258443.53 frames. ], batch size: 548, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:44:59,144 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 2.773e+02 5.196e+02 7.542e+02 1.109e+03 2.941e+03, threshold=1.508e+03, percent-clipped=7.0 2023-06-23 18:45:04,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1552062.0, ans=0.0 2023-06-23 18:45:04,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1552062.0, ans=0.1 2023-06-23 18:45:06,659 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-06-23 18:45:30,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1552122.0, ans=0.125 2023-06-23 18:46:08,853 INFO [train.py:996] (2/4) Epoch 9, batch 14750, loss[loss=0.3159, simple_loss=0.4008, pruned_loss=0.1155, over 21298.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3164, pruned_loss=0.08024, over 4263593.51 frames. ], batch size: 548, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:46:16,397 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.01 vs. limit=15.0 2023-06-23 18:46:19,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1552242.0, ans=0.1 2023-06-23 18:46:43,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1552302.0, ans=0.2 2023-06-23 18:46:55,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1552362.0, ans=0.1 2023-06-23 18:47:22,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1552422.0, ans=0.0 2023-06-23 18:47:37,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1552482.0, ans=0.125 2023-06-23 18:47:42,948 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.95 vs. limit=12.0 2023-06-23 18:47:45,548 INFO [train.py:996] (2/4) Epoch 9, batch 14800, loss[loss=0.3535, simple_loss=0.3945, pruned_loss=0.1563, over 21383.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.329, pruned_loss=0.08727, over 4262099.99 frames. ], batch size: 507, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:47:52,877 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.61 vs. limit=15.0 2023-06-23 18:48:02,066 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=15.0 2023-06-23 18:48:02,953 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 18:48:16,734 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.287e+02 6.500e+02 8.733e+02 1.311e+03 2.731e+03, threshold=1.747e+03, percent-clipped=18.0 2023-06-23 18:48:50,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1552722.0, ans=0.1 2023-06-23 18:49:32,208 INFO [train.py:996] (2/4) Epoch 9, batch 14850, loss[loss=0.2247, simple_loss=0.3558, pruned_loss=0.04679, over 19853.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3232, pruned_loss=0.08702, over 4264615.53 frames. ], batch size: 702, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 18:49:35,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1552842.0, ans=0.125 2023-06-23 18:49:39,840 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.76 vs. limit=15.0 2023-06-23 18:49:45,156 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=15.0 2023-06-23 18:49:51,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1552902.0, ans=0.125 2023-06-23 18:50:48,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1553022.0, ans=0.125 2023-06-23 18:50:48,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1553022.0, ans=0.0 2023-06-23 18:51:09,919 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=22.5 2023-06-23 18:51:14,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1553142.0, ans=0.07 2023-06-23 18:51:15,491 INFO [train.py:996] (2/4) Epoch 9, batch 14900, loss[loss=0.2738, simple_loss=0.3378, pruned_loss=0.1049, over 21804.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3255, pruned_loss=0.08788, over 4264651.13 frames. ], batch size: 282, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:51:23,169 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=6.45 vs. limit=6.0 2023-06-23 18:51:54,858 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.598e+02 5.568e+02 9.380e+02 1.428e+03 3.360e+03, threshold=1.876e+03, percent-clipped=13.0 2023-06-23 18:52:55,387 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2023-06-23 18:52:55,919 INFO [train.py:996] (2/4) Epoch 9, batch 14950, loss[loss=0.3062, simple_loss=0.3687, pruned_loss=0.1219, over 21453.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3268, pruned_loss=0.08747, over 4267214.60 frames. ], batch size: 509, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:53:24,971 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=22.5 2023-06-23 18:54:29,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1553682.0, ans=0.0 2023-06-23 18:54:37,608 INFO [train.py:996] (2/4) Epoch 9, batch 15000, loss[loss=0.229, simple_loss=0.3029, pruned_loss=0.07758, over 21385.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3293, pruned_loss=0.08878, over 4273383.06 frames. ], batch size: 548, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:54:37,608 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-23 18:54:48,467 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.1.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([3.9091, 3.3560, 1.5855, 1.6465], device='cuda:2') 2023-06-23 18:54:58,183 INFO [train.py:1028] (2/4) Epoch 9, validation: loss=0.2574, simple_loss=0.352, pruned_loss=0.08137, over 1796401.00 frames. 2023-06-23 18:54:58,184 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-23 18:55:18,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1553802.0, ans=0.125 2023-06-23 18:55:30,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1553802.0, ans=0.125 2023-06-23 18:55:32,917 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.760e+02 5.829e+02 9.207e+02 1.364e+03 3.991e+03, threshold=1.841e+03, percent-clipped=17.0 2023-06-23 18:56:38,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1554042.0, ans=0.0 2023-06-23 18:56:39,782 INFO [train.py:996] (2/4) Epoch 9, batch 15050, loss[loss=0.213, simple_loss=0.2686, pruned_loss=0.07869, over 21847.00 frames. ], tot_loss[loss=0.2549, simple_loss=0.3297, pruned_loss=0.09005, over 4279779.81 frames. ], batch size: 107, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:56:46,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1554042.0, ans=0.0 2023-06-23 18:56:48,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1554042.0, ans=0.125 2023-06-23 18:57:18,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1554102.0, ans=0.125 2023-06-23 18:58:19,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1554342.0, ans=0.125 2023-06-23 18:58:20,503 INFO [train.py:996] (2/4) Epoch 9, batch 15100, loss[loss=0.2517, simple_loss=0.3263, pruned_loss=0.08854, over 21342.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.3311, pruned_loss=0.08955, over 4270178.02 frames. ], batch size: 548, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 18:58:59,269 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.914e+02 5.498e+02 7.540e+02 1.313e+03 2.793e+03, threshold=1.508e+03, percent-clipped=8.0 2023-06-23 18:59:20,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1554522.0, ans=0.04949747468305833 2023-06-23 18:59:25,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1554522.0, ans=0.2 2023-06-23 19:00:04,649 INFO [train.py:996] (2/4) Epoch 9, batch 15150, loss[loss=0.2045, simple_loss=0.273, pruned_loss=0.06802, over 21631.00 frames. ], tot_loss[loss=0.2536, simple_loss=0.3274, pruned_loss=0.08996, over 4272768.71 frames. ], batch size: 282, lr: 3.29e-03, grad_scale: 8.0 2023-06-23 19:00:14,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1554642.0, ans=0.125 2023-06-23 19:00:23,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1554642.0, ans=0.2 2023-06-23 19:00:46,576 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=15.0 2023-06-23 19:01:13,310 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=22.5 2023-06-23 19:01:20,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1554822.0, ans=0.125 2023-06-23 19:01:45,735 INFO [train.py:996] (2/4) Epoch 9, batch 15200, loss[loss=0.1657, simple_loss=0.2495, pruned_loss=0.04092, over 21575.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3205, pruned_loss=0.08637, over 4261958.46 frames. ], batch size: 263, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 19:02:03,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1554942.0, ans=0.0 2023-06-23 19:02:05,736 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.86 vs. limit=15.0 2023-06-23 19:02:06,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1555002.0, ans=0.0 2023-06-23 19:02:19,107 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.584e+02 6.669e+02 9.281e+02 1.408e+03 4.015e+03, threshold=1.856e+03, percent-clipped=19.0 2023-06-23 19:03:27,221 INFO [train.py:996] (2/4) Epoch 9, batch 15250, loss[loss=0.2158, simple_loss=0.2872, pruned_loss=0.07218, over 21450.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3142, pruned_loss=0.08459, over 4263389.12 frames. ], batch size: 389, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 19:04:21,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1555362.0, ans=0.0 2023-06-23 19:04:57,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1555482.0, ans=10.0 2023-06-23 19:05:06,543 INFO [train.py:996] (2/4) Epoch 9, batch 15300, loss[loss=0.2589, simple_loss=0.3455, pruned_loss=0.08614, over 17950.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3168, pruned_loss=0.0869, over 4267937.09 frames. ], batch size: 61, lr: 3.29e-03, grad_scale: 16.0 2023-06-23 19:05:27,020 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.14 vs. limit=6.0 2023-06-23 19:05:41,056 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.812e+02 5.913e+02 8.241e+02 1.222e+03 2.288e+03, threshold=1.648e+03, percent-clipped=6.0 2023-06-23 19:06:14,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1555722.0, ans=0.125 2023-06-23 19:06:17,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1555722.0, ans=0.125 2023-06-23 19:06:30,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1555782.0, ans=0.2 2023-06-23 19:06:37,377 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=22.5 2023-06-23 19:06:52,450 INFO [train.py:996] (2/4) Epoch 9, batch 15350, loss[loss=0.2953, simple_loss=0.3705, pruned_loss=0.1101, over 21417.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3216, pruned_loss=0.08919, over 4270640.27 frames. ], batch size: 507, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:07:13,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1555902.0, ans=0.125 2023-06-23 19:07:16,143 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=12.0 2023-06-23 19:07:30,157 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.23 vs. limit=15.0 2023-06-23 19:08:03,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1556022.0, ans=0.0 2023-06-23 19:08:25,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1556142.0, ans=0.0 2023-06-23 19:08:26,418 INFO [train.py:996] (2/4) Epoch 9, batch 15400, loss[loss=0.2475, simple_loss=0.328, pruned_loss=0.08346, over 21475.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3225, pruned_loss=0.08793, over 4273694.14 frames. ], batch size: 131, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:08:47,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1556202.0, ans=0.2 2023-06-23 19:08:53,215 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.34 vs. limit=15.0 2023-06-23 19:08:58,589 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.809e+02 6.075e+02 7.840e+02 1.049e+03 1.941e+03, threshold=1.568e+03, percent-clipped=4.0 2023-06-23 19:09:00,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1556262.0, ans=0.125 2023-06-23 19:09:18,599 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.55 vs. limit=12.0 2023-06-23 19:09:46,426 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.25 vs. limit=15.0 2023-06-23 19:10:04,868 INFO [train.py:996] (2/4) Epoch 9, batch 15450, loss[loss=0.2225, simple_loss=0.3165, pruned_loss=0.06427, over 21797.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3193, pruned_loss=0.08631, over 4273736.50 frames. ], batch size: 351, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:10:29,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1556502.0, ans=0.125 2023-06-23 19:10:41,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1556562.0, ans=0.125 2023-06-23 19:11:04,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1556622.0, ans=0.125 2023-06-23 19:11:09,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1556622.0, ans=0.2 2023-06-23 19:11:46,667 INFO [train.py:996] (2/4) Epoch 9, batch 15500, loss[loss=0.2698, simple_loss=0.3438, pruned_loss=0.09786, over 21330.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3235, pruned_loss=0.08759, over 4254653.80 frames. ], batch size: 143, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:12:05,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1556742.0, ans=0.125 2023-06-23 19:12:09,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1556802.0, ans=0.0 2023-06-23 19:12:26,941 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.768e+02 5.144e+02 7.122e+02 1.016e+03 2.468e+03, threshold=1.424e+03, percent-clipped=4.0 2023-06-23 19:12:27,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1556802.0, ans=0.125 2023-06-23 19:13:09,868 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=22.5 2023-06-23 19:13:12,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1556982.0, ans=0.125 2023-06-23 19:13:33,993 INFO [train.py:996] (2/4) Epoch 9, batch 15550, loss[loss=0.2372, simple_loss=0.2972, pruned_loss=0.08862, over 21344.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.322, pruned_loss=0.08424, over 4255366.79 frames. ], batch size: 160, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:14:05,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1557102.0, ans=0.1 2023-06-23 19:14:47,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1557222.0, ans=0.0 2023-06-23 19:14:47,901 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=12.0 2023-06-23 19:15:14,757 INFO [train.py:996] (2/4) Epoch 9, batch 15600, loss[loss=0.2654, simple_loss=0.3298, pruned_loss=0.1005, over 21384.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3169, pruned_loss=0.08269, over 4248091.76 frames. ], batch size: 508, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:15:20,161 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:15:26,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1557342.0, ans=0.1 2023-06-23 19:15:49,478 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.952e+02 5.285e+02 6.916e+02 1.084e+03 2.169e+03, threshold=1.383e+03, percent-clipped=9.0 2023-06-23 19:16:36,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1557582.0, ans=0.07 2023-06-23 19:16:53,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1557582.0, ans=0.2 2023-06-23 19:16:55,701 INFO [train.py:996] (2/4) Epoch 9, batch 15650, loss[loss=0.2549, simple_loss=0.3114, pruned_loss=0.09924, over 21464.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3155, pruned_loss=0.08226, over 4253712.22 frames. ], batch size: 441, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:16:58,088 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.45 vs. limit=10.0 2023-06-23 19:16:59,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1557642.0, ans=0.0 2023-06-23 19:17:01,083 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.25 vs. limit=15.0 2023-06-23 19:17:25,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1557702.0, ans=0.125 2023-06-23 19:17:45,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1557762.0, ans=0.125 2023-06-23 19:18:06,509 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-23 19:18:31,506 INFO [train.py:996] (2/4) Epoch 9, batch 15700, loss[loss=0.2229, simple_loss=0.287, pruned_loss=0.07936, over 21857.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3113, pruned_loss=0.0816, over 4248036.51 frames. ], batch size: 107, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:18:31,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1557942.0, ans=0.0 2023-06-23 19:18:39,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1557942.0, ans=0.125 2023-06-23 19:18:54,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1558002.0, ans=0.0 2023-06-23 19:19:07,038 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.566e+02 5.534e+02 7.672e+02 1.120e+03 2.103e+03, threshold=1.534e+03, percent-clipped=13.0 2023-06-23 19:19:52,883 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.49 vs. limit=22.5 2023-06-23 19:19:56,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1558182.0, ans=0.125 2023-06-23 19:20:00,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1558182.0, ans=0.0 2023-06-23 19:20:11,353 INFO [train.py:996] (2/4) Epoch 9, batch 15750, loss[loss=0.2174, simple_loss=0.2776, pruned_loss=0.07858, over 21957.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3068, pruned_loss=0.0815, over 4258389.07 frames. ], batch size: 119, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:20:35,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1558302.0, ans=0.125 2023-06-23 19:21:44,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1558482.0, ans=0.125 2023-06-23 19:21:50,623 INFO [train.py:996] (2/4) Epoch 9, batch 15800, loss[loss=0.2152, simple_loss=0.2774, pruned_loss=0.07649, over 21589.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.302, pruned_loss=0.08044, over 4265817.74 frames. ], batch size: 442, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:22:12,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1558602.0, ans=0.125 2023-06-23 19:22:26,965 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.689e+02 5.286e+02 6.867e+02 8.896e+02 1.872e+03, threshold=1.373e+03, percent-clipped=1.0 2023-06-23 19:22:53,141 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.56 vs. limit=15.0 2023-06-23 19:23:21,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1558782.0, ans=0.2 2023-06-23 19:23:30,821 INFO [train.py:996] (2/4) Epoch 9, batch 15850, loss[loss=0.2447, simple_loss=0.319, pruned_loss=0.08517, over 21387.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3034, pruned_loss=0.08247, over 4270456.35 frames. ], batch size: 549, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:23:31,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1558842.0, ans=0.125 2023-06-23 19:23:51,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1558902.0, ans=0.125 2023-06-23 19:23:53,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1558902.0, ans=0.125 2023-06-23 19:24:05,511 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.16 vs. limit=22.5 2023-06-23 19:24:22,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1558962.0, ans=0.0 2023-06-23 19:24:31,228 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.28 vs. limit=15.0 2023-06-23 19:24:32,651 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=12.0 2023-06-23 19:24:38,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1559022.0, ans=0.5 2023-06-23 19:24:43,956 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.93 vs. limit=15.0 2023-06-23 19:24:53,978 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.67 vs. limit=5.0 2023-06-23 19:24:55,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1559082.0, ans=0.125 2023-06-23 19:24:56,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1559082.0, ans=0.0 2023-06-23 19:24:59,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1559082.0, ans=0.0 2023-06-23 19:25:10,598 INFO [train.py:996] (2/4) Epoch 9, batch 15900, loss[loss=0.2405, simple_loss=0.3091, pruned_loss=0.08591, over 21833.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3005, pruned_loss=0.08259, over 4278539.89 frames. ], batch size: 124, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:25:17,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1559142.0, ans=0.07 2023-06-23 19:25:46,392 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.755e+02 5.068e+02 6.374e+02 9.133e+02 1.940e+03, threshold=1.275e+03, percent-clipped=6.0 2023-06-23 19:26:22,905 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=12.0 2023-06-23 19:26:28,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1559382.0, ans=0.125 2023-06-23 19:26:51,963 INFO [train.py:996] (2/4) Epoch 9, batch 15950, loss[loss=0.2313, simple_loss=0.2988, pruned_loss=0.08196, over 15503.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3007, pruned_loss=0.07985, over 4261330.13 frames. ], batch size: 60, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:28:32,476 INFO [train.py:996] (2/4) Epoch 9, batch 16000, loss[loss=0.1849, simple_loss=0.2604, pruned_loss=0.05472, over 21853.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3015, pruned_loss=0.07745, over 4269165.06 frames. ], batch size: 98, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:28:40,121 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.26 vs. limit=15.0 2023-06-23 19:29:07,942 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.766e+02 5.425e+02 7.645e+02 1.261e+03 2.910e+03, threshold=1.529e+03, percent-clipped=25.0 2023-06-23 19:29:31,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1559922.0, ans=0.125 2023-06-23 19:30:09,390 INFO [train.py:996] (2/4) Epoch 9, batch 16050, loss[loss=0.3168, simple_loss=0.4012, pruned_loss=0.1162, over 21656.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3047, pruned_loss=0.07621, over 4262971.78 frames. ], batch size: 441, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:30:09,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1560042.0, ans=0.0 2023-06-23 19:30:45,847 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-23 19:30:49,035 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=15.0 2023-06-23 19:31:00,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1560222.0, ans=0.0 2023-06-23 19:31:21,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1560222.0, ans=0.125 2023-06-23 19:31:47,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1560342.0, ans=0.125 2023-06-23 19:31:48,134 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.43 vs. limit=15.0 2023-06-23 19:31:48,833 INFO [train.py:996] (2/4) Epoch 9, batch 16100, loss[loss=0.2124, simple_loss=0.299, pruned_loss=0.06293, over 21295.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3082, pruned_loss=0.07633, over 4268251.18 frames. ], batch size: 176, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:31:53,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1560342.0, ans=0.125 2023-06-23 19:32:00,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1560342.0, ans=0.0 2023-06-23 19:32:24,535 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.945e+02 6.839e+02 1.039e+03 1.501e+03 2.959e+03, threshold=2.078e+03, percent-clipped=23.0 2023-06-23 19:32:37,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1560462.0, ans=0.125 2023-06-23 19:33:00,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1560522.0, ans=0.0 2023-06-23 19:33:03,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1560522.0, ans=0.035 2023-06-23 19:33:29,428 INFO [train.py:996] (2/4) Epoch 9, batch 16150, loss[loss=0.3057, simple_loss=0.366, pruned_loss=0.1227, over 21637.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3093, pruned_loss=0.07856, over 4278360.03 frames. ], batch size: 471, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:33:34,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1560642.0, ans=0.1 2023-06-23 19:34:06,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1560762.0, ans=0.0 2023-06-23 19:34:21,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1560762.0, ans=0.04949747468305833 2023-06-23 19:34:24,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1560762.0, ans=0.0 2023-06-23 19:35:00,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1560882.0, ans=0.1 2023-06-23 19:35:07,703 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.48 vs. limit=15.0 2023-06-23 19:35:08,320 INFO [train.py:996] (2/4) Epoch 9, batch 16200, loss[loss=0.303, simple_loss=0.3608, pruned_loss=0.1226, over 21827.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3116, pruned_loss=0.07999, over 4283805.24 frames. ], batch size: 118, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:35:12,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1560942.0, ans=0.125 2023-06-23 19:35:30,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1561002.0, ans=0.0 2023-06-23 19:35:45,983 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.932e+02 6.489e+02 9.592e+02 1.250e+03 2.736e+03, threshold=1.918e+03, percent-clipped=7.0 2023-06-23 19:36:32,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1561182.0, ans=0.1 2023-06-23 19:36:47,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1561182.0, ans=0.2 2023-06-23 19:36:56,501 INFO [train.py:996] (2/4) Epoch 9, batch 16250, loss[loss=0.2304, simple_loss=0.3023, pruned_loss=0.07923, over 21808.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.313, pruned_loss=0.08159, over 4284296.15 frames. ], batch size: 372, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:37:16,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1561302.0, ans=0.025 2023-06-23 19:37:24,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1561302.0, ans=0.125 2023-06-23 19:37:25,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1561302.0, ans=10.0 2023-06-23 19:37:43,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1561362.0, ans=0.1 2023-06-23 19:38:03,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1561422.0, ans=0.04949747468305833 2023-06-23 19:38:13,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.whiten.whitening_limit, batch_count=1561482.0, ans=12.0 2023-06-23 19:38:36,171 INFO [train.py:996] (2/4) Epoch 9, batch 16300, loss[loss=0.246, simple_loss=0.3175, pruned_loss=0.0872, over 21745.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3071, pruned_loss=0.07843, over 4264182.55 frames. ], batch size: 371, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:39:18,110 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.285e+02 4.893e+02 6.873e+02 9.809e+02 2.054e+03, threshold=1.375e+03, percent-clipped=1.0 2023-06-23 19:39:53,852 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.49 vs. limit=15.0 2023-06-23 19:40:09,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1561782.0, ans=0.125 2023-06-23 19:40:15,708 INFO [train.py:996] (2/4) Epoch 9, batch 16350, loss[loss=0.1778, simple_loss=0.2485, pruned_loss=0.05359, over 21174.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3075, pruned_loss=0.07876, over 4259019.19 frames. ], batch size: 176, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:40:17,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1561842.0, ans=0.125 2023-06-23 19:40:21,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1561842.0, ans=0.125 2023-06-23 19:40:21,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1561842.0, ans=0.125 2023-06-23 19:40:29,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1561842.0, ans=0.125 2023-06-23 19:40:38,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1561902.0, ans=0.125 2023-06-23 19:41:08,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1561962.0, ans=0.125 2023-06-23 19:41:09,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1561962.0, ans=0.1 2023-06-23 19:41:28,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1562022.0, ans=0.1 2023-06-23 19:41:49,460 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=12.0 2023-06-23 19:41:54,440 INFO [train.py:996] (2/4) Epoch 9, batch 16400, loss[loss=0.2285, simple_loss=0.3057, pruned_loss=0.07565, over 21695.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3115, pruned_loss=0.08068, over 4267587.43 frames. ], batch size: 389, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:42:21,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1562202.0, ans=0.125 2023-06-23 19:42:27,451 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.45 vs. limit=15.0 2023-06-23 19:42:37,800 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.770e+02 4.704e+02 7.326e+02 1.027e+03 2.811e+03, threshold=1.465e+03, percent-clipped=10.0 2023-06-23 19:43:29,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1562382.0, ans=0.0 2023-06-23 19:43:34,518 INFO [train.py:996] (2/4) Epoch 9, batch 16450, loss[loss=0.2618, simple_loss=0.3185, pruned_loss=0.1025, over 21565.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3118, pruned_loss=0.08185, over 4278199.99 frames. ], batch size: 548, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:43:34,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1562442.0, ans=0.125 2023-06-23 19:44:05,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1562502.0, ans=0.0 2023-06-23 19:44:33,467 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=15.0 2023-06-23 19:44:48,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1562622.0, ans=0.125 2023-06-23 19:44:55,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1562682.0, ans=0.1 2023-06-23 19:44:59,118 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.074e-03 2023-06-23 19:45:02,583 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.81 vs. limit=15.0 2023-06-23 19:45:15,097 INFO [train.py:996] (2/4) Epoch 9, batch 16500, loss[loss=0.2038, simple_loss=0.2715, pruned_loss=0.0681, over 21668.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3095, pruned_loss=0.08193, over 4280775.72 frames. ], batch size: 263, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:46:03,629 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.321e+02 6.770e+02 9.881e+02 1.345e+03 3.319e+03, threshold=1.976e+03, percent-clipped=17.0 2023-06-23 19:46:07,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1562862.0, ans=0.0 2023-06-23 19:46:18,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1562922.0, ans=0.125 2023-06-23 19:46:44,834 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:46:56,209 INFO [train.py:996] (2/4) Epoch 9, batch 16550, loss[loss=0.2184, simple_loss=0.2798, pruned_loss=0.07849, over 21359.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3076, pruned_loss=0.07937, over 4278517.02 frames. ], batch size: 159, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:48:47,688 INFO [train.py:996] (2/4) Epoch 9, batch 16600, loss[loss=0.2952, simple_loss=0.395, pruned_loss=0.09764, over 21773.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3146, pruned_loss=0.08247, over 4275461.61 frames. ], batch size: 332, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:49:26,429 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.870e+02 6.782e+02 8.603e+02 1.169e+03 2.865e+03, threshold=1.721e+03, percent-clipped=6.0 2023-06-23 19:49:38,670 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=15.0 2023-06-23 19:50:17,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1563582.0, ans=0.125 2023-06-23 19:50:28,625 INFO [train.py:996] (2/4) Epoch 9, batch 16650, loss[loss=0.3079, simple_loss=0.3798, pruned_loss=0.118, over 21448.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3246, pruned_loss=0.08475, over 4275402.87 frames. ], batch size: 131, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:50:34,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1563642.0, ans=0.125 2023-06-23 19:50:46,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1563702.0, ans=0.1 2023-06-23 19:51:08,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1563762.0, ans=0.0 2023-06-23 19:51:40,070 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.69 vs. limit=15.0 2023-06-23 19:51:50,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1563882.0, ans=0.125 2023-06-23 19:51:53,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1563882.0, ans=0.2 2023-06-23 19:51:58,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1563882.0, ans=0.125 2023-06-23 19:52:02,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1563882.0, ans=0.0 2023-06-23 19:52:08,214 INFO [train.py:996] (2/4) Epoch 9, batch 16700, loss[loss=0.2255, simple_loss=0.304, pruned_loss=0.07346, over 20668.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3238, pruned_loss=0.08481, over 4274386.45 frames. ], batch size: 607, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:52:57,947 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.483e+02 5.925e+02 7.840e+02 1.052e+03 2.518e+03, threshold=1.568e+03, percent-clipped=7.0 2023-06-23 19:53:00,367 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.01 vs. limit=15.0 2023-06-23 19:53:03,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1564062.0, ans=0.0 2023-06-23 19:53:58,344 INFO [train.py:996] (2/4) Epoch 9, batch 16750, loss[loss=0.3569, simple_loss=0.4322, pruned_loss=0.1409, over 21441.00 frames. ], tot_loss[loss=0.2519, simple_loss=0.3275, pruned_loss=0.08817, over 4271319.98 frames. ], batch size: 507, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:54:28,667 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:55:36,237 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:55:43,494 INFO [train.py:996] (2/4) Epoch 9, batch 16800, loss[loss=0.2255, simple_loss=0.3022, pruned_loss=0.07441, over 21803.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3324, pruned_loss=0.08765, over 4266703.84 frames. ], batch size: 112, lr: 3.28e-03, grad_scale: 32.0 2023-06-23 19:56:21,393 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=15.0 2023-06-23 19:56:22,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1564602.0, ans=0.125 2023-06-23 19:56:26,912 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.073e+02 6.649e+02 8.449e+02 1.121e+03 2.457e+03, threshold=1.690e+03, percent-clipped=14.0 2023-06-23 19:57:23,021 INFO [train.py:996] (2/4) Epoch 9, batch 16850, loss[loss=0.2795, simple_loss=0.4106, pruned_loss=0.07416, over 20799.00 frames. ], tot_loss[loss=0.2526, simple_loss=0.3295, pruned_loss=0.08785, over 4276524.42 frames. ], batch size: 607, lr: 3.28e-03, grad_scale: 16.0 2023-06-23 19:58:02,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1564902.0, ans=0.125 2023-06-23 19:58:56,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1565082.0, ans=0.125 2023-06-23 19:58:56,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1565082.0, ans=10.0 2023-06-23 19:59:07,893 INFO [train.py:996] (2/4) Epoch 9, batch 16900, loss[loss=0.198, simple_loss=0.2757, pruned_loss=0.06014, over 21609.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3237, pruned_loss=0.08562, over 4275825.12 frames. ], batch size: 230, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 19:59:26,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1565202.0, ans=0.125 2023-06-23 19:59:38,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1565202.0, ans=0.04949747468305833 2023-06-23 19:59:40,401 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 19:59:41,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1565262.0, ans=0.125 2023-06-23 19:59:44,350 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.40 vs. limit=22.5 2023-06-23 19:59:46,585 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.622e+02 4.767e+02 6.726e+02 1.260e+03 2.714e+03, threshold=1.345e+03, percent-clipped=10.0 2023-06-23 19:59:48,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1565262.0, ans=0.1 2023-06-23 20:00:15,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1565322.0, ans=0.125 2023-06-23 20:00:45,063 INFO [train.py:996] (2/4) Epoch 9, batch 16950, loss[loss=0.2062, simple_loss=0.2763, pruned_loss=0.06803, over 21687.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3157, pruned_loss=0.08336, over 4279795.31 frames. ], batch size: 230, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:00:59,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1565502.0, ans=0.0 2023-06-23 20:01:32,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1565562.0, ans=0.0 2023-06-23 20:02:23,828 INFO [train.py:996] (2/4) Epoch 9, batch 17000, loss[loss=0.242, simple_loss=0.3105, pruned_loss=0.08671, over 21847.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3126, pruned_loss=0.08399, over 4288696.29 frames. ], batch size: 124, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:03:02,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1565862.0, ans=0.2 2023-06-23 20:03:04,622 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.656e+02 4.823e+02 5.834e+02 7.335e+02 1.533e+03, threshold=1.167e+03, percent-clipped=2.0 2023-06-23 20:03:10,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1565862.0, ans=0.125 2023-06-23 20:03:35,120 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-23 20:03:57,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1565982.0, ans=0.04949747468305833 2023-06-23 20:04:04,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1566042.0, ans=0.1 2023-06-23 20:04:05,384 INFO [train.py:996] (2/4) Epoch 9, batch 17050, loss[loss=0.2048, simple_loss=0.2692, pruned_loss=0.07018, over 20274.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.32, pruned_loss=0.08631, over 4284521.71 frames. ], batch size: 703, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:04:32,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1566102.0, ans=0.025 2023-06-23 20:04:35,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1566102.0, ans=0.0 2023-06-23 20:05:28,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1566282.0, ans=0.125 2023-06-23 20:05:44,745 INFO [train.py:996] (2/4) Epoch 9, batch 17100, loss[loss=0.2525, simple_loss=0.3173, pruned_loss=0.09389, over 21947.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3177, pruned_loss=0.08632, over 4288872.77 frames. ], batch size: 351, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:05:52,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1566342.0, ans=0.125 2023-06-23 20:05:56,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1566342.0, ans=0.0 2023-06-23 20:06:12,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1566402.0, ans=0.2 2023-06-23 20:06:24,472 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.937e+02 5.247e+02 7.699e+02 1.064e+03 2.324e+03, threshold=1.540e+03, percent-clipped=17.0 2023-06-23 20:06:30,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1566462.0, ans=0.125 2023-06-23 20:07:16,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1566582.0, ans=0.07 2023-06-23 20:07:24,714 INFO [train.py:996] (2/4) Epoch 9, batch 17150, loss[loss=0.2364, simple_loss=0.3017, pruned_loss=0.0856, over 21573.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3151, pruned_loss=0.08635, over 4291534.93 frames. ], batch size: 212, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:07:49,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1566702.0, ans=0.1 2023-06-23 20:07:55,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1566702.0, ans=0.125 2023-06-23 20:08:19,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1566762.0, ans=0.0 2023-06-23 20:08:56,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1566882.0, ans=0.0 2023-06-23 20:08:57,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1566882.0, ans=0.04949747468305833 2023-06-23 20:09:05,729 INFO [train.py:996] (2/4) Epoch 9, batch 17200, loss[loss=0.2511, simple_loss=0.3184, pruned_loss=0.09193, over 21720.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3161, pruned_loss=0.08626, over 4292544.07 frames. ], batch size: 332, lr: 3.27e-03, grad_scale: 32.0 2023-06-23 20:09:17,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1566942.0, ans=0.125 2023-06-23 20:09:40,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1567002.0, ans=0.125 2023-06-23 20:09:53,655 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.320e+02 8.105e+02 1.085e+03 1.487e+03 3.292e+03, threshold=2.169e+03, percent-clipped=22.0 2023-06-23 20:10:52,864 INFO [train.py:996] (2/4) Epoch 9, batch 17250, loss[loss=0.2739, simple_loss=0.3527, pruned_loss=0.09752, over 21320.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3191, pruned_loss=0.0881, over 4293883.16 frames. ], batch size: 159, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:10:54,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1567242.0, ans=0.2 2023-06-23 20:11:18,965 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=12.0 2023-06-23 20:12:17,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1567422.0, ans=0.0 2023-06-23 20:12:34,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1567542.0, ans=0.1 2023-06-23 20:12:35,683 INFO [train.py:996] (2/4) Epoch 9, batch 17300, loss[loss=0.2883, simple_loss=0.3562, pruned_loss=0.1102, over 21645.00 frames. ], tot_loss[loss=0.2532, simple_loss=0.3257, pruned_loss=0.09031, over 4287142.66 frames. ], batch size: 389, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:12:42,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1567542.0, ans=0.125 2023-06-23 20:13:16,313 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.65 vs. limit=6.0 2023-06-23 20:13:28,000 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.316e+02 5.597e+02 7.560e+02 1.039e+03 2.489e+03, threshold=1.512e+03, percent-clipped=1.0 2023-06-23 20:13:36,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1567662.0, ans=0.1 2023-06-23 20:14:01,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1567782.0, ans=0.125 2023-06-23 20:14:02,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1567782.0, ans=0.0 2023-06-23 20:14:22,078 INFO [train.py:996] (2/4) Epoch 9, batch 17350, loss[loss=0.2052, simple_loss=0.2961, pruned_loss=0.05719, over 21788.00 frames. ], tot_loss[loss=0.2535, simple_loss=0.3271, pruned_loss=0.08993, over 4287109.87 frames. ], batch size: 316, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:14:35,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1567842.0, ans=0.125 2023-06-23 20:15:35,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1568022.0, ans=0.2 2023-06-23 20:16:07,991 INFO [train.py:996] (2/4) Epoch 9, batch 17400, loss[loss=0.2641, simple_loss=0.3384, pruned_loss=0.09493, over 20801.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3246, pruned_loss=0.08686, over 4288218.27 frames. ], batch size: 611, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:16:17,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1568142.0, ans=0.125 2023-06-23 20:16:41,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1568202.0, ans=0.0 2023-06-23 20:16:55,290 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.466e+02 5.667e+02 9.192e+02 1.513e+03 3.310e+03, threshold=1.838e+03, percent-clipped=24.0 2023-06-23 20:17:43,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1568382.0, ans=0.125 2023-06-23 20:17:49,316 INFO [train.py:996] (2/4) Epoch 9, batch 17450, loss[loss=0.1783, simple_loss=0.2581, pruned_loss=0.04921, over 21210.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3209, pruned_loss=0.08361, over 4282896.72 frames. ], batch size: 176, lr: 3.27e-03, grad_scale: 8.0 2023-06-23 20:18:13,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1568502.0, ans=0.0 2023-06-23 20:18:25,098 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=22.5 2023-06-23 20:18:32,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1568562.0, ans=0.0 2023-06-23 20:18:45,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1568562.0, ans=0.125 2023-06-23 20:19:17,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1568682.0, ans=0.07 2023-06-23 20:19:18,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1568682.0, ans=0.1 2023-06-23 20:19:19,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1568682.0, ans=0.0 2023-06-23 20:19:27,729 INFO [train.py:996] (2/4) Epoch 9, batch 17500, loss[loss=0.2271, simple_loss=0.3034, pruned_loss=0.07536, over 21670.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3161, pruned_loss=0.08115, over 4291338.65 frames. ], batch size: 441, lr: 3.27e-03, grad_scale: 8.0 2023-06-23 20:19:31,876 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.96 vs. limit=15.0 2023-06-23 20:20:05,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1568802.0, ans=0.125 2023-06-23 20:20:19,170 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.530e+02 5.781e+02 8.305e+02 1.162e+03 2.249e+03, threshold=1.661e+03, percent-clipped=4.0 2023-06-23 20:20:28,121 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-06-23 20:21:04,895 INFO [train.py:996] (2/4) Epoch 9, batch 17550, loss[loss=0.2082, simple_loss=0.3005, pruned_loss=0.05796, over 21634.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3166, pruned_loss=0.08084, over 4282468.14 frames. ], batch size: 230, lr: 3.27e-03, grad_scale: 8.0 2023-06-23 20:21:11,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1569042.0, ans=0.125 2023-06-23 20:21:44,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1569162.0, ans=0.0 2023-06-23 20:22:05,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1569222.0, ans=0.2 2023-06-23 20:22:23,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1569222.0, ans=0.2 2023-06-23 20:22:43,271 INFO [train.py:996] (2/4) Epoch 9, batch 17600, loss[loss=0.2473, simple_loss=0.3277, pruned_loss=0.08349, over 21919.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3191, pruned_loss=0.08185, over 4280123.54 frames. ], batch size: 372, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:23:35,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1569462.0, ans=0.0 2023-06-23 20:23:38,188 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.747e+02 5.641e+02 7.220e+02 1.088e+03 2.051e+03, threshold=1.444e+03, percent-clipped=1.0 2023-06-23 20:24:25,257 INFO [train.py:996] (2/4) Epoch 9, batch 17650, loss[loss=0.2516, simple_loss=0.3287, pruned_loss=0.08726, over 21463.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3166, pruned_loss=0.0819, over 4283373.88 frames. ], batch size: 471, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:24:26,135 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-23 20:24:27,163 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:24:52,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1569702.0, ans=0.125 2023-06-23 20:25:17,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1569762.0, ans=0.125 2023-06-23 20:25:31,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1569822.0, ans=0.125 2023-06-23 20:25:39,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1569822.0, ans=0.1 2023-06-23 20:25:40,096 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.65 vs. limit=15.0 2023-06-23 20:25:42,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1569822.0, ans=0.04949747468305833 2023-06-23 20:26:14,536 INFO [train.py:996] (2/4) Epoch 9, batch 17700, loss[loss=0.2546, simple_loss=0.3406, pruned_loss=0.08432, over 21344.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3119, pruned_loss=0.07894, over 4277479.60 frames. ], batch size: 549, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:26:24,822 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=15.43 vs. limit=15.0 2023-06-23 20:26:28,260 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=15.0 2023-06-23 20:27:02,714 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.206e+02 6.405e+02 1.162e+03 1.769e+03 3.070e+03, threshold=2.325e+03, percent-clipped=36.0 2023-06-23 20:27:07,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1570062.0, ans=0.125 2023-06-23 20:27:54,402 INFO [train.py:996] (2/4) Epoch 9, batch 17750, loss[loss=0.2322, simple_loss=0.3076, pruned_loss=0.0784, over 21823.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3187, pruned_loss=0.08217, over 4279353.69 frames. ], batch size: 282, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:27:56,807 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.13 vs. limit=12.0 2023-06-23 20:28:19,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1570302.0, ans=0.0 2023-06-23 20:28:43,310 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 20:28:50,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1570362.0, ans=0.125 2023-06-23 20:29:18,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1570482.0, ans=0.2 2023-06-23 20:29:27,018 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.41 vs. limit=12.0 2023-06-23 20:29:39,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1570542.0, ans=0.0 2023-06-23 20:29:40,663 INFO [train.py:996] (2/4) Epoch 9, batch 17800, loss[loss=0.2758, simple_loss=0.355, pruned_loss=0.09828, over 21467.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3184, pruned_loss=0.08087, over 4271861.06 frames. ], batch size: 471, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:30:24,623 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.850e+02 7.073e+02 9.847e+02 1.535e+03 2.589e+03, threshold=1.969e+03, percent-clipped=1.0 2023-06-23 20:31:08,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1570782.0, ans=0.0 2023-06-23 20:31:17,169 INFO [train.py:996] (2/4) Epoch 9, batch 17850, loss[loss=0.2962, simple_loss=0.3684, pruned_loss=0.112, over 21594.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3206, pruned_loss=0.08156, over 4269730.49 frames. ], batch size: 414, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:31:34,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1570842.0, ans=0.125 2023-06-23 20:31:51,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1570902.0, ans=0.125 2023-06-23 20:32:29,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1571022.0, ans=0.125 2023-06-23 20:32:36,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1571082.0, ans=0.125 2023-06-23 20:32:52,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1571082.0, ans=0.09899494936611666 2023-06-23 20:32:56,150 INFO [train.py:996] (2/4) Epoch 9, batch 17900, loss[loss=0.2419, simple_loss=0.3256, pruned_loss=0.07904, over 21305.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3248, pruned_loss=0.08335, over 4267866.60 frames. ], batch size: 176, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:33:04,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1571142.0, ans=0.125 2023-06-23 20:33:24,794 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=15.0 2023-06-23 20:33:49,665 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.056e+02 5.757e+02 7.357e+02 9.993e+02 2.226e+03, threshold=1.471e+03, percent-clipped=2.0 2023-06-23 20:33:57,246 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.62 vs. limit=15.0 2023-06-23 20:34:41,339 INFO [train.py:996] (2/4) Epoch 9, batch 17950, loss[loss=0.1884, simple_loss=0.2847, pruned_loss=0.04605, over 21679.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3246, pruned_loss=0.08096, over 4269726.53 frames. ], batch size: 247, lr: 3.27e-03, grad_scale: 8.0 2023-06-23 20:35:04,314 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.11 vs. limit=6.0 2023-06-23 20:35:07,762 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.70 vs. limit=5.0 2023-06-23 20:36:19,116 INFO [train.py:996] (2/4) Epoch 9, batch 18000, loss[loss=0.2204, simple_loss=0.2936, pruned_loss=0.07363, over 21331.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3172, pruned_loss=0.07881, over 4263471.73 frames. ], batch size: 131, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:36:19,117 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-23 20:36:30,952 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([1.4204, 3.3312, 1.9203, 1.4521], device='cuda:2') 2023-06-23 20:36:36,004 INFO [train.py:1028] (2/4) Epoch 9, validation: loss=0.2626, simple_loss=0.3575, pruned_loss=0.08385, over 1796401.00 frames. 2023-06-23 20:36:36,004 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-23 20:36:52,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1571742.0, ans=0.1 2023-06-23 20:37:09,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1571802.0, ans=0.125 2023-06-23 20:37:29,988 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.324e+02 4.978e+02 7.262e+02 1.032e+03 1.973e+03, threshold=1.452e+03, percent-clipped=7.0 2023-06-23 20:38:09,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1571982.0, ans=0.2 2023-06-23 20:38:17,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1571982.0, ans=0.125 2023-06-23 20:38:20,214 INFO [train.py:996] (2/4) Epoch 9, batch 18050, loss[loss=0.2135, simple_loss=0.2851, pruned_loss=0.071, over 21839.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3088, pruned_loss=0.07732, over 4260443.22 frames. ], batch size: 372, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:38:47,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1572102.0, ans=0.125 2023-06-23 20:38:49,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1572102.0, ans=0.125 2023-06-23 20:39:28,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1572222.0, ans=0.0 2023-06-23 20:39:53,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1572282.0, ans=0.125 2023-06-23 20:39:57,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1572282.0, ans=0.2 2023-06-23 20:40:00,641 INFO [train.py:996] (2/4) Epoch 9, batch 18100, loss[loss=0.2192, simple_loss=0.3233, pruned_loss=0.05757, over 20696.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3148, pruned_loss=0.08102, over 4262352.03 frames. ], batch size: 607, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:40:19,579 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=22.5 2023-06-23 20:40:27,068 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=22.5 2023-06-23 20:40:55,082 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.018e+02 5.547e+02 7.703e+02 1.178e+03 2.128e+03, threshold=1.541e+03, percent-clipped=13.0 2023-06-23 20:41:00,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1572462.0, ans=0.0 2023-06-23 20:41:23,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1572582.0, ans=0.0 2023-06-23 20:41:38,401 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.93 vs. limit=15.0 2023-06-23 20:41:38,946 INFO [train.py:996] (2/4) Epoch 9, batch 18150, loss[loss=0.2186, simple_loss=0.2822, pruned_loss=0.0775, over 21311.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3172, pruned_loss=0.08129, over 4253030.02 frames. ], batch size: 144, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:42:17,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1572702.0, ans=0.04949747468305833 2023-06-23 20:42:58,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1572822.0, ans=0.125 2023-06-23 20:43:05,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1572882.0, ans=0.125 2023-06-23 20:43:09,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1572882.0, ans=0.125 2023-06-23 20:43:15,633 INFO [train.py:996] (2/4) Epoch 9, batch 18200, loss[loss=0.2212, simple_loss=0.2834, pruned_loss=0.07953, over 21728.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3116, pruned_loss=0.08113, over 4248725.22 frames. ], batch size: 112, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:43:20,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1572942.0, ans=0.0 2023-06-23 20:43:20,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=1572942.0, ans=0.1 2023-06-23 20:43:32,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1573002.0, ans=0.2 2023-06-23 20:43:41,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1573002.0, ans=0.0 2023-06-23 20:44:03,318 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.079e+02 5.748e+02 7.422e+02 1.064e+03 2.381e+03, threshold=1.484e+03, percent-clipped=9.0 2023-06-23 20:44:08,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1573062.0, ans=10.0 2023-06-23 20:44:51,682 INFO [train.py:996] (2/4) Epoch 9, batch 18250, loss[loss=0.2423, simple_loss=0.3126, pruned_loss=0.08602, over 21764.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3036, pruned_loss=0.07849, over 4255035.73 frames. ], batch size: 441, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:45:01,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1573242.0, ans=0.125 2023-06-23 20:45:15,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1573302.0, ans=0.125 2023-06-23 20:45:50,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1573422.0, ans=0.1 2023-06-23 20:46:30,592 INFO [train.py:996] (2/4) Epoch 9, batch 18300, loss[loss=0.246, simple_loss=0.3486, pruned_loss=0.07174, over 21754.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.305, pruned_loss=0.07893, over 4259126.60 frames. ], batch size: 298, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:46:57,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1573602.0, ans=0.125 2023-06-23 20:47:10,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1573662.0, ans=10.0 2023-06-23 20:47:14,633 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.309e+02 5.362e+02 7.234e+02 9.653e+02 2.224e+03, threshold=1.447e+03, percent-clipped=7.0 2023-06-23 20:47:16,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1573662.0, ans=0.125 2023-06-23 20:48:06,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1573782.0, ans=0.1 2023-06-23 20:48:09,451 INFO [train.py:996] (2/4) Epoch 9, batch 18350, loss[loss=0.2037, simple_loss=0.2735, pruned_loss=0.0669, over 21585.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3088, pruned_loss=0.07854, over 4251365.93 frames. ], batch size: 263, lr: 3.27e-03, grad_scale: 16.0 2023-06-23 20:48:37,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1573902.0, ans=0.125 2023-06-23 20:49:23,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1574022.0, ans=0.125 2023-06-23 20:49:48,127 INFO [train.py:996] (2/4) Epoch 9, batch 18400, loss[loss=0.2086, simple_loss=0.2866, pruned_loss=0.06534, over 21211.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3051, pruned_loss=0.07727, over 4255799.95 frames. ], batch size: 176, lr: 3.27e-03, grad_scale: 32.0 2023-06-23 20:50:22,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1574202.0, ans=0.125 2023-06-23 20:50:38,668 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.618e+02 5.691e+02 8.490e+02 1.304e+03 3.377e+03, threshold=1.698e+03, percent-clipped=15.0 2023-06-23 20:51:24,263 INFO [train.py:996] (2/4) Epoch 9, batch 18450, loss[loss=0.1759, simple_loss=0.247, pruned_loss=0.05238, over 21182.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3022, pruned_loss=0.07361, over 4250863.73 frames. ], batch size: 143, lr: 3.27e-03, grad_scale: 32.0 2023-06-23 20:51:25,114 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=12.0 2023-06-23 20:52:18,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1574562.0, ans=0.0 2023-06-23 20:52:48,459 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.29 vs. limit=22.5 2023-06-23 20:52:59,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1574682.0, ans=0.1 2023-06-23 20:53:00,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1574682.0, ans=0.125 2023-06-23 20:53:03,426 INFO [train.py:996] (2/4) Epoch 9, batch 18500, loss[loss=0.1846, simple_loss=0.2705, pruned_loss=0.04935, over 21555.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2966, pruned_loss=0.07222, over 4259687.59 frames. ], batch size: 230, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 20:53:05,758 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=22.5 2023-06-23 20:53:57,745 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.581e+02 5.113e+02 7.633e+02 1.101e+03 4.944e+03, threshold=1.527e+03, percent-clipped=5.0 2023-06-23 20:54:05,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1574922.0, ans=0.125 2023-06-23 20:54:12,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1574922.0, ans=0.04949747468305833 2023-06-23 20:54:26,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1574982.0, ans=0.1 2023-06-23 20:54:42,148 INFO [train.py:996] (2/4) Epoch 9, batch 18550, loss[loss=0.2141, simple_loss=0.3035, pruned_loss=0.06233, over 21786.00 frames. ], tot_loss[loss=0.2183, simple_loss=0.2937, pruned_loss=0.07146, over 4251361.41 frames. ], batch size: 282, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 20:55:03,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1575102.0, ans=0.125 2023-06-23 20:55:12,259 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.42 vs. limit=12.0 2023-06-23 20:55:15,316 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.99 vs. limit=10.0 2023-06-23 20:56:16,017 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.01 vs. limit=12.0 2023-06-23 20:56:21,022 INFO [train.py:996] (2/4) Epoch 9, batch 18600, loss[loss=0.2005, simple_loss=0.2812, pruned_loss=0.05992, over 21629.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2929, pruned_loss=0.07295, over 4257362.95 frames. ], batch size: 263, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 20:56:53,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1575402.0, ans=10.0 2023-06-23 20:56:53,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1575402.0, ans=0.125 2023-06-23 20:56:57,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1575462.0, ans=0.125 2023-06-23 20:57:16,745 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.060e+02 5.217e+02 6.797e+02 9.080e+02 2.355e+03, threshold=1.359e+03, percent-clipped=3.0 2023-06-23 20:57:18,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1575462.0, ans=0.125 2023-06-23 20:57:59,909 INFO [train.py:996] (2/4) Epoch 9, batch 18650, loss[loss=0.1989, simple_loss=0.2643, pruned_loss=0.06671, over 21584.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.292, pruned_loss=0.07241, over 4257743.28 frames. ], batch size: 230, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 20:58:34,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=1575702.0, ans=0.2 2023-06-23 20:59:28,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1575882.0, ans=0.0 2023-06-23 20:59:33,507 INFO [train.py:996] (2/4) Epoch 9, batch 18700, loss[loss=0.2437, simple_loss=0.3016, pruned_loss=0.09291, over 21190.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2911, pruned_loss=0.07408, over 4254593.39 frames. ], batch size: 143, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 20:59:52,827 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=22.5 2023-06-23 21:00:03,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1576002.0, ans=10.0 2023-06-23 21:00:03,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1576002.0, ans=0.1 2023-06-23 21:00:27,896 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.989e+02 5.925e+02 8.021e+02 1.131e+03 2.066e+03, threshold=1.604e+03, percent-clipped=15.0 2023-06-23 21:00:44,721 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-23 21:01:06,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1576182.0, ans=0.125 2023-06-23 21:01:10,839 INFO [train.py:996] (2/4) Epoch 9, batch 18750, loss[loss=0.2222, simple_loss=0.2691, pruned_loss=0.08764, over 20306.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2927, pruned_loss=0.07649, over 4240313.48 frames. ], batch size: 703, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:01:30,686 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.48 vs. limit=15.0 2023-06-23 21:01:39,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1576302.0, ans=0.125 2023-06-23 21:01:39,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1576302.0, ans=0.1 2023-06-23 21:02:06,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1576362.0, ans=0.125 2023-06-23 21:02:23,615 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.19 vs. limit=12.0 2023-06-23 21:02:50,409 INFO [train.py:996] (2/4) Epoch 9, batch 18800, loss[loss=0.1968, simple_loss=0.257, pruned_loss=0.06831, over 20266.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2987, pruned_loss=0.07789, over 4239957.89 frames. ], batch size: 703, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:03:12,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1576602.0, ans=0.015 2023-06-23 21:03:29,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1576602.0, ans=0.125 2023-06-23 21:03:48,140 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.728e+02 5.591e+02 7.847e+02 1.340e+03 4.014e+03, threshold=1.569e+03, percent-clipped=18.0 2023-06-23 21:04:24,331 INFO [train.py:996] (2/4) Epoch 9, batch 18850, loss[loss=0.1662, simple_loss=0.247, pruned_loss=0.04274, over 21242.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.295, pruned_loss=0.0733, over 4241512.10 frames. ], batch size: 176, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:05:00,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1576902.0, ans=0.0 2023-06-23 21:05:46,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1577082.0, ans=0.0 2023-06-23 21:05:46,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1577082.0, ans=0.2 2023-06-23 21:06:01,645 INFO [train.py:996] (2/4) Epoch 9, batch 18900, loss[loss=0.2602, simple_loss=0.3046, pruned_loss=0.1078, over 21722.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2921, pruned_loss=0.0737, over 4249670.27 frames. ], batch size: 511, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:06:21,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1577202.0, ans=0.95 2023-06-23 21:06:23,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1577202.0, ans=0.2 2023-06-23 21:07:02,784 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.478e+02 5.570e+02 8.138e+02 1.105e+03 2.529e+03, threshold=1.628e+03, percent-clipped=6.0 2023-06-23 21:07:24,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1577382.0, ans=10.0 2023-06-23 21:07:32,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1577382.0, ans=0.04949747468305833 2023-06-23 21:07:40,569 INFO [train.py:996] (2/4) Epoch 9, batch 18950, loss[loss=0.3433, simple_loss=0.4055, pruned_loss=0.1405, over 21727.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2942, pruned_loss=0.07645, over 4267602.22 frames. ], batch size: 511, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:07:41,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1577442.0, ans=0.125 2023-06-23 21:08:24,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1577502.0, ans=0.07 2023-06-23 21:08:25,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1577562.0, ans=0.0 2023-06-23 21:08:48,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1577622.0, ans=0.125 2023-06-23 21:09:13,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1577682.0, ans=0.125 2023-06-23 21:09:25,300 INFO [train.py:996] (2/4) Epoch 9, batch 19000, loss[loss=0.2578, simple_loss=0.332, pruned_loss=0.09177, over 21295.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3031, pruned_loss=0.07764, over 4270895.54 frames. ], batch size: 143, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:09:28,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1577742.0, ans=0.125 2023-06-23 21:09:46,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1577802.0, ans=0.2 2023-06-23 21:09:51,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1577802.0, ans=0.125 2023-06-23 21:10:22,694 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.986e+02 5.803e+02 7.303e+02 9.676e+02 2.097e+03, threshold=1.461e+03, percent-clipped=8.0 2023-06-23 21:11:04,884 INFO [train.py:996] (2/4) Epoch 9, batch 19050, loss[loss=0.2529, simple_loss=0.322, pruned_loss=0.09189, over 21657.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3087, pruned_loss=0.08249, over 4279047.91 frames. ], batch size: 389, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:11:06,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1578042.0, ans=0.125 2023-06-23 21:11:30,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1578102.0, ans=0.125 2023-06-23 21:12:44,208 INFO [train.py:996] (2/4) Epoch 9, batch 19100, loss[loss=0.2318, simple_loss=0.2901, pruned_loss=0.08677, over 21435.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3093, pruned_loss=0.08449, over 4285882.26 frames. ], batch size: 195, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:13:27,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1578462.0, ans=0.0 2023-06-23 21:13:27,845 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.01 vs. limit=15.0 2023-06-23 21:13:38,213 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.144e+02 5.907e+02 9.492e+02 1.356e+03 2.303e+03, threshold=1.898e+03, percent-clipped=18.0 2023-06-23 21:13:42,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1578522.0, ans=0.125 2023-06-23 21:14:26,125 INFO [train.py:996] (2/4) Epoch 9, batch 19150, loss[loss=0.2514, simple_loss=0.3176, pruned_loss=0.09262, over 19887.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3109, pruned_loss=0.08428, over 4285991.62 frames. ], batch size: 702, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:14:55,401 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-23 21:15:06,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1578702.0, ans=0.2 2023-06-23 21:15:24,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1578822.0, ans=0.125 2023-06-23 21:15:43,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1578822.0, ans=0.125 2023-06-23 21:16:07,489 INFO [train.py:996] (2/4) Epoch 9, batch 19200, loss[loss=0.2087, simple_loss=0.2924, pruned_loss=0.06252, over 21268.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3179, pruned_loss=0.08453, over 4278368.21 frames. ], batch size: 159, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:16:39,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1579002.0, ans=0.125 2023-06-23 21:16:52,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1579062.0, ans=0.125 2023-06-23 21:17:00,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1579062.0, ans=0.125 2023-06-23 21:17:01,507 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.099e+02 6.099e+02 9.205e+02 1.363e+03 2.424e+03, threshold=1.841e+03, percent-clipped=8.0 2023-06-23 21:17:30,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=1579182.0, ans=0.2 2023-06-23 21:17:46,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1579242.0, ans=0.125 2023-06-23 21:17:47,695 INFO [train.py:996] (2/4) Epoch 9, batch 19250, loss[loss=0.2519, simple_loss=0.3182, pruned_loss=0.09283, over 21855.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3174, pruned_loss=0.07965, over 4279980.31 frames. ], batch size: 107, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:18:09,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1579242.0, ans=0.2 2023-06-23 21:18:25,309 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.61 vs. limit=15.0 2023-06-23 21:18:37,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1579362.0, ans=0.0 2023-06-23 21:19:11,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1579482.0, ans=0.04949747468305833 2023-06-23 21:19:22,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1579482.0, ans=0.125 2023-06-23 21:19:27,367 INFO [train.py:996] (2/4) Epoch 9, batch 19300, loss[loss=0.2091, simple_loss=0.2866, pruned_loss=0.06573, over 21878.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3149, pruned_loss=0.07967, over 4277406.90 frames. ], batch size: 351, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:20:23,010 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.305e+02 4.858e+02 7.673e+02 1.130e+03 2.664e+03, threshold=1.535e+03, percent-clipped=6.0 2023-06-23 21:20:44,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1579722.0, ans=0.2 2023-06-23 21:21:18,257 INFO [train.py:996] (2/4) Epoch 9, batch 19350, loss[loss=0.2001, simple_loss=0.2877, pruned_loss=0.05628, over 21711.00 frames. ], tot_loss[loss=0.231, simple_loss=0.311, pruned_loss=0.0755, over 4277937.96 frames. ], batch size: 351, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:21:20,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1579842.0, ans=0.125 2023-06-23 21:21:32,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1579902.0, ans=0.2 2023-06-23 21:22:07,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1580022.0, ans=0.04949747468305833 2023-06-23 21:22:20,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1580022.0, ans=0.125 2023-06-23 21:22:46,728 INFO [train.py:996] (2/4) Epoch 9, batch 19400, loss[loss=0.2704, simple_loss=0.3394, pruned_loss=0.1006, over 21817.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3078, pruned_loss=0.07459, over 4280025.74 frames. ], batch size: 391, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:23:24,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1580202.0, ans=0.125 2023-06-23 21:23:41,603 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.900e+02 5.598e+02 7.737e+02 1.076e+03 2.272e+03, threshold=1.547e+03, percent-clipped=6.0 2023-06-23 21:23:51,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1580322.0, ans=0.125 2023-06-23 21:24:03,048 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-23 21:24:30,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1580442.0, ans=0.125 2023-06-23 21:24:30,850 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=7.10 vs. limit=6.0 2023-06-23 21:24:36,318 INFO [train.py:996] (2/4) Epoch 9, batch 19450, loss[loss=0.2923, simple_loss=0.3553, pruned_loss=0.1146, over 14853.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3052, pruned_loss=0.07587, over 4273586.39 frames. ], batch size: 60, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:25:15,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1580562.0, ans=15.0 2023-06-23 21:25:38,727 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-23 21:25:49,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1580682.0, ans=0.125 2023-06-23 21:26:17,319 INFO [train.py:996] (2/4) Epoch 9, batch 19500, loss[loss=0.22, simple_loss=0.2682, pruned_loss=0.08586, over 21108.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3018, pruned_loss=0.07698, over 4263435.40 frames. ], batch size: 143, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:26:24,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1580742.0, ans=0.125 2023-06-23 21:26:25,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1580742.0, ans=0.0 2023-06-23 21:27:07,901 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.161e+02 5.798e+02 8.156e+02 1.303e+03 2.400e+03, threshold=1.631e+03, percent-clipped=12.0 2023-06-23 21:27:30,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1580982.0, ans=0.0 2023-06-23 21:27:34,807 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.96 vs. limit=15.0 2023-06-23 21:27:37,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1580982.0, ans=0.0 2023-06-23 21:27:45,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1580982.0, ans=0.125 2023-06-23 21:27:57,874 INFO [train.py:996] (2/4) Epoch 9, batch 19550, loss[loss=0.186, simple_loss=0.2616, pruned_loss=0.0552, over 21199.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2967, pruned_loss=0.07542, over 4250055.56 frames. ], batch size: 159, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:28:22,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1581102.0, ans=0.1 2023-06-23 21:28:30,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1581162.0, ans=0.0 2023-06-23 21:28:42,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1581162.0, ans=0.125 2023-06-23 21:28:53,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1581222.0, ans=0.125 2023-06-23 21:28:54,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1581222.0, ans=0.0 2023-06-23 21:29:35,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1581342.0, ans=0.125 2023-06-23 21:29:37,013 INFO [train.py:996] (2/4) Epoch 9, batch 19600, loss[loss=0.2275, simple_loss=0.3008, pruned_loss=0.07712, over 21481.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2999, pruned_loss=0.07611, over 4256325.18 frames. ], batch size: 194, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:29:47,655 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.80 vs. limit=15.0 2023-06-23 21:30:25,819 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.955e+02 6.042e+02 7.862e+02 1.200e+03 2.695e+03, threshold=1.572e+03, percent-clipped=11.0 2023-06-23 21:30:29,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1581522.0, ans=0.07 2023-06-23 21:30:41,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1581522.0, ans=0.0 2023-06-23 21:30:59,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1581582.0, ans=0.125 2023-06-23 21:31:06,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1581582.0, ans=0.1 2023-06-23 21:31:11,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1581582.0, ans=0.125 2023-06-23 21:31:15,786 INFO [train.py:996] (2/4) Epoch 9, batch 19650, loss[loss=0.2333, simple_loss=0.3054, pruned_loss=0.08059, over 21950.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3052, pruned_loss=0.08034, over 4265987.93 frames. ], batch size: 316, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:31:22,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1581642.0, ans=0.1 2023-06-23 21:31:36,830 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.68 vs. limit=12.0 2023-06-23 21:31:47,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1581702.0, ans=0.125 2023-06-23 21:31:53,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1581762.0, ans=0.05 2023-06-23 21:32:32,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1581822.0, ans=0.125 2023-06-23 21:32:47,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1581882.0, ans=0.125 2023-06-23 21:32:59,044 INFO [train.py:996] (2/4) Epoch 9, batch 19700, loss[loss=0.2048, simple_loss=0.2842, pruned_loss=0.06269, over 21359.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.309, pruned_loss=0.08135, over 4268283.51 frames. ], batch size: 194, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:33:02,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1581942.0, ans=0.2 2023-06-23 21:33:14,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1581942.0, ans=0.95 2023-06-23 21:33:17,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1581942.0, ans=0.125 2023-06-23 21:33:36,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1582002.0, ans=0.125 2023-06-23 21:33:59,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1582062.0, ans=0.125 2023-06-23 21:34:05,973 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.102e+02 5.626e+02 7.962e+02 1.128e+03 2.480e+03, threshold=1.592e+03, percent-clipped=10.0 2023-06-23 21:34:15,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1582122.0, ans=0.2 2023-06-23 21:34:43,680 INFO [train.py:996] (2/4) Epoch 9, batch 19750, loss[loss=0.2687, simple_loss=0.3558, pruned_loss=0.09076, over 21285.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3197, pruned_loss=0.08336, over 4269732.26 frames. ], batch size: 176, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:35:29,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1582362.0, ans=0.1 2023-06-23 21:36:01,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1582482.0, ans=0.5 2023-06-23 21:36:05,052 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.83 vs. limit=15.0 2023-06-23 21:36:19,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1582482.0, ans=0.125 2023-06-23 21:36:23,333 INFO [train.py:996] (2/4) Epoch 9, batch 19800, loss[loss=0.2486, simple_loss=0.3111, pruned_loss=0.09305, over 19975.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3188, pruned_loss=0.08361, over 4267685.91 frames. ], batch size: 702, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:37:08,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1582662.0, ans=0.125 2023-06-23 21:37:26,054 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.106e+02 6.339e+02 1.006e+03 1.413e+03 2.674e+03, threshold=2.011e+03, percent-clipped=18.0 2023-06-23 21:37:42,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1582782.0, ans=0.125 2023-06-23 21:37:58,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1582782.0, ans=0.2 2023-06-23 21:38:05,339 INFO [train.py:996] (2/4) Epoch 9, batch 19850, loss[loss=0.1804, simple_loss=0.2665, pruned_loss=0.04712, over 21600.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3118, pruned_loss=0.07891, over 4270002.09 frames. ], batch size: 263, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:38:52,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1582962.0, ans=0.125 2023-06-23 21:38:57,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1582962.0, ans=0.125 2023-06-23 21:39:02,338 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=15.0 2023-06-23 21:39:03,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1582962.0, ans=0.0 2023-06-23 21:39:07,192 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-06-23 21:39:08,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1583022.0, ans=0.0 2023-06-23 21:39:42,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1583142.0, ans=0.1 2023-06-23 21:39:43,943 INFO [train.py:996] (2/4) Epoch 9, batch 19900, loss[loss=0.1821, simple_loss=0.265, pruned_loss=0.04957, over 21575.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3111, pruned_loss=0.07584, over 4267391.92 frames. ], batch size: 247, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:40:26,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1583202.0, ans=0.125 2023-06-23 21:40:31,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1583262.0, ans=0.125 2023-06-23 21:40:40,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1583262.0, ans=0.1 2023-06-23 21:40:45,968 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.607e+02 5.026e+02 6.181e+02 9.309e+02 2.570e+03, threshold=1.236e+03, percent-clipped=2.0 2023-06-23 21:40:50,020 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.44 vs. limit=15.0 2023-06-23 21:40:52,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1583322.0, ans=0.0 2023-06-23 21:40:55,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1583322.0, ans=0.1 2023-06-23 21:41:06,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1583382.0, ans=15.0 2023-06-23 21:41:27,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1583442.0, ans=0.09899494936611666 2023-06-23 21:41:28,255 INFO [train.py:996] (2/4) Epoch 9, batch 19950, loss[loss=0.2589, simple_loss=0.3286, pruned_loss=0.09466, over 21614.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3055, pruned_loss=0.0753, over 4263795.04 frames. ], batch size: 414, lr: 3.26e-03, grad_scale: 16.0 2023-06-23 21:41:58,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1583502.0, ans=0.125 2023-06-23 21:42:32,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1583622.0, ans=15.0 2023-06-23 21:42:38,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1583622.0, ans=0.2 2023-06-23 21:42:42,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1583622.0, ans=0.0 2023-06-23 21:43:01,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1583682.0, ans=0.05 2023-06-23 21:43:02,100 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.10 vs. limit=10.0 2023-06-23 21:43:07,068 INFO [train.py:996] (2/4) Epoch 9, batch 20000, loss[loss=0.2262, simple_loss=0.2915, pruned_loss=0.08047, over 20073.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3049, pruned_loss=0.07549, over 4272796.62 frames. ], batch size: 707, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:43:49,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1583862.0, ans=0.125 2023-06-23 21:43:52,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1583862.0, ans=0.0 2023-06-23 21:44:00,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1583862.0, ans=0.0 2023-06-23 21:44:03,260 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.657e+02 5.210e+02 7.571e+02 1.068e+03 2.474e+03, threshold=1.514e+03, percent-clipped=20.0 2023-06-23 21:44:33,944 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.67 vs. limit=15.0 2023-06-23 21:44:46,975 INFO [train.py:996] (2/4) Epoch 9, batch 20050, loss[loss=0.2798, simple_loss=0.3378, pruned_loss=0.1109, over 21567.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3069, pruned_loss=0.07772, over 4277936.83 frames. ], batch size: 548, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:44:47,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1584042.0, ans=0.2 2023-06-23 21:45:47,422 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.73 vs. limit=6.0 2023-06-23 21:46:28,200 INFO [train.py:996] (2/4) Epoch 9, batch 20100, loss[loss=0.2487, simple_loss=0.34, pruned_loss=0.07871, over 21812.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3103, pruned_loss=0.08018, over 4292063.85 frames. ], batch size: 332, lr: 3.26e-03, grad_scale: 32.0 2023-06-23 21:47:30,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1584522.0, ans=0.04949747468305833 2023-06-23 21:47:32,233 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.946e+02 5.376e+02 7.013e+02 1.127e+03 1.999e+03, threshold=1.403e+03, percent-clipped=12.0 2023-06-23 21:47:42,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1584522.0, ans=0.125 2023-06-23 21:48:06,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1584582.0, ans=0.125 2023-06-23 21:48:18,544 INFO [train.py:996] (2/4) Epoch 9, batch 20150, loss[loss=0.2881, simple_loss=0.3562, pruned_loss=0.1099, over 21454.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3214, pruned_loss=0.08424, over 4290969.72 frames. ], batch size: 471, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 21:48:29,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1584642.0, ans=0.125 2023-06-23 21:48:52,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1584702.0, ans=0.5 2023-06-23 21:48:55,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1584762.0, ans=0.125 2023-06-23 21:49:37,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1584822.0, ans=0.125 2023-06-23 21:50:01,803 INFO [train.py:996] (2/4) Epoch 9, batch 20200, loss[loss=0.2959, simple_loss=0.3929, pruned_loss=0.09939, over 21701.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3268, pruned_loss=0.08706, over 4291544.77 frames. ], batch size: 441, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 21:50:21,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1585002.0, ans=0.125 2023-06-23 21:50:44,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1585062.0, ans=0.125 2023-06-23 21:50:59,407 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.884e+02 7.487e+02 1.027e+03 1.466e+03 2.661e+03, threshold=2.055e+03, percent-clipped=25.0 2023-06-23 21:51:34,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1585182.0, ans=0.125 2023-06-23 21:51:37,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1585182.0, ans=0.2 2023-06-23 21:51:42,587 INFO [train.py:996] (2/4) Epoch 9, batch 20250, loss[loss=0.2131, simple_loss=0.3031, pruned_loss=0.0615, over 20853.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3275, pruned_loss=0.08612, over 4284076.64 frames. ], batch size: 607, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 21:51:43,529 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-23 21:52:02,650 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=22.5 2023-06-23 21:52:11,524 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-23 21:52:17,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1585302.0, ans=0.0 2023-06-23 21:52:55,021 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:53:09,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1585482.0, ans=0.125 2023-06-23 21:53:13,649 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.16 vs. limit=22.5 2023-06-23 21:53:16,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1585482.0, ans=0.125 2023-06-23 21:53:22,022 INFO [train.py:996] (2/4) Epoch 9, batch 20300, loss[loss=0.2002, simple_loss=0.2906, pruned_loss=0.05487, over 21769.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3235, pruned_loss=0.08298, over 4290129.22 frames. ], batch size: 282, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 21:53:35,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1585542.0, ans=0.1 2023-06-23 21:53:35,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1585542.0, ans=0.0 2023-06-23 21:54:28,240 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.674e+02 5.917e+02 9.257e+02 1.332e+03 3.110e+03, threshold=1.851e+03, percent-clipped=6.0 2023-06-23 21:54:43,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1585782.0, ans=0.125 2023-06-23 21:54:44,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1585782.0, ans=0.2 2023-06-23 21:55:00,616 INFO [train.py:996] (2/4) Epoch 9, batch 20350, loss[loss=0.2628, simple_loss=0.3274, pruned_loss=0.09911, over 21880.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3254, pruned_loss=0.08447, over 4283306.33 frames. ], batch size: 118, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 21:55:08,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1585842.0, ans=0.125 2023-06-23 21:56:04,383 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.33 vs. limit=15.0 2023-06-23 21:56:22,116 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=15.0 2023-06-23 21:56:39,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1586142.0, ans=0.125 2023-06-23 21:56:40,518 INFO [train.py:996] (2/4) Epoch 9, batch 20400, loss[loss=0.2795, simple_loss=0.345, pruned_loss=0.107, over 21430.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3282, pruned_loss=0.08748, over 4281694.46 frames. ], batch size: 211, lr: 3.25e-03, grad_scale: 32.0 2023-06-23 21:56:50,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1586142.0, ans=0.125 2023-06-23 21:56:51,497 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.11 vs. limit=6.0 2023-06-23 21:57:03,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1586202.0, ans=0.09899494936611666 2023-06-23 21:57:15,440 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.39 vs. limit=15.0 2023-06-23 21:57:24,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1586262.0, ans=0.5 2023-06-23 21:57:37,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1586262.0, ans=0.125 2023-06-23 21:57:42,768 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.128e+02 5.797e+02 8.307e+02 1.162e+03 2.401e+03, threshold=1.661e+03, percent-clipped=4.0 2023-06-23 21:57:49,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1586322.0, ans=0.125 2023-06-23 21:57:56,215 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 21:58:00,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1586382.0, ans=0.125 2023-06-23 21:58:04,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1586382.0, ans=0.1 2023-06-23 21:58:15,248 INFO [train.py:996] (2/4) Epoch 9, batch 20450, loss[loss=0.2361, simple_loss=0.3049, pruned_loss=0.0836, over 21944.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3276, pruned_loss=0.0891, over 4272110.88 frames. ], batch size: 316, lr: 3.25e-03, grad_scale: 32.0 2023-06-23 21:59:25,939 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.35 vs. limit=22.5 2023-06-23 21:59:46,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1586682.0, ans=0.1 2023-06-23 21:59:49,246 INFO [train.py:996] (2/4) Epoch 9, batch 20500, loss[loss=0.2106, simple_loss=0.2741, pruned_loss=0.07354, over 21617.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3229, pruned_loss=0.0892, over 4269025.07 frames. ], batch size: 263, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:00:08,284 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:00:37,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1586862.0, ans=0.1 2023-06-23 22:00:58,940 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.931e+02 5.832e+02 8.906e+02 1.259e+03 2.508e+03, threshold=1.781e+03, percent-clipped=16.0 2023-06-23 22:01:06,995 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1586922.0, ans=0.2 2023-06-23 22:01:18,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1586982.0, ans=0.0 2023-06-23 22:01:29,767 INFO [train.py:996] (2/4) Epoch 9, batch 20550, loss[loss=0.2432, simple_loss=0.312, pruned_loss=0.08721, over 21570.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3144, pruned_loss=0.08648, over 4273241.81 frames. ], batch size: 414, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:01:58,347 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.95 vs. limit=22.5 2023-06-23 22:01:59,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1587102.0, ans=0.125 2023-06-23 22:02:47,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1587222.0, ans=0.1 2023-06-23 22:02:58,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1587282.0, ans=0.125 2023-06-23 22:03:10,656 INFO [train.py:996] (2/4) Epoch 9, batch 20600, loss[loss=0.2895, simple_loss=0.3493, pruned_loss=0.1148, over 21786.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3173, pruned_loss=0.08513, over 4267211.83 frames. ], batch size: 441, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:03:25,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1587342.0, ans=0.0 2023-06-23 22:03:31,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1587402.0, ans=0.125 2023-06-23 22:04:20,269 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.756e+02 4.749e+02 5.700e+02 8.605e+02 1.495e+03, threshold=1.140e+03, percent-clipped=0.0 2023-06-23 22:04:51,865 INFO [train.py:996] (2/4) Epoch 9, batch 20650, loss[loss=0.1882, simple_loss=0.26, pruned_loss=0.05822, over 21688.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3137, pruned_loss=0.08527, over 4274937.83 frames. ], batch size: 298, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:04:56,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1587642.0, ans=0.1 2023-06-23 22:05:29,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1587702.0, ans=0.1 2023-06-23 22:06:13,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1587822.0, ans=0.125 2023-06-23 22:06:18,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer_na.min_abs, batch_count=1587882.0, ans=0.02 2023-06-23 22:06:32,155 INFO [train.py:996] (2/4) Epoch 9, batch 20700, loss[loss=0.1931, simple_loss=0.2609, pruned_loss=0.06261, over 21768.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3042, pruned_loss=0.08089, over 4265773.14 frames. ], batch size: 124, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:06:58,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1588002.0, ans=0.0 2023-06-23 22:07:00,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1588002.0, ans=0.0 2023-06-23 22:07:22,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1588062.0, ans=0.0 2023-06-23 22:07:23,002 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-06-23 22:07:44,044 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.364e+02 5.397e+02 7.207e+02 1.144e+03 2.870e+03, threshold=1.441e+03, percent-clipped=25.0 2023-06-23 22:08:20,554 INFO [train.py:996] (2/4) Epoch 9, batch 20750, loss[loss=0.2499, simple_loss=0.3452, pruned_loss=0.07727, over 21632.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3071, pruned_loss=0.07997, over 4263178.90 frames. ], batch size: 263, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:08:26,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1588242.0, ans=0.1 2023-06-23 22:08:50,558 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.10 vs. limit=22.5 2023-06-23 22:09:18,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1588362.0, ans=0.125 2023-06-23 22:09:53,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1588482.0, ans=0.125 2023-06-23 22:10:00,929 INFO [train.py:996] (2/4) Epoch 9, batch 20800, loss[loss=0.2255, simple_loss=0.292, pruned_loss=0.0795, over 21468.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3127, pruned_loss=0.08215, over 4261291.52 frames. ], batch size: 441, lr: 3.25e-03, grad_scale: 32.0 2023-06-23 22:10:51,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1588662.0, ans=0.0 2023-06-23 22:11:06,864 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.367e+02 5.156e+02 7.961e+02 1.123e+03 3.663e+03, threshold=1.592e+03, percent-clipped=17.0 2023-06-23 22:11:40,266 INFO [train.py:996] (2/4) Epoch 9, batch 20850, loss[loss=0.1942, simple_loss=0.264, pruned_loss=0.0622, over 21781.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3046, pruned_loss=0.07967, over 4262217.62 frames. ], batch size: 247, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:11:41,112 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.45 vs. limit=22.5 2023-06-23 22:12:07,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1588902.0, ans=0.0 2023-06-23 22:12:53,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1589022.0, ans=0.0 2023-06-23 22:13:11,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1589082.0, ans=0.1 2023-06-23 22:13:19,717 INFO [train.py:996] (2/4) Epoch 9, batch 20900, loss[loss=0.2075, simple_loss=0.288, pruned_loss=0.06346, over 21585.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3049, pruned_loss=0.08117, over 4269492.59 frames. ], batch size: 263, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:13:24,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1589142.0, ans=0.125 2023-06-23 22:13:55,127 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=22.5 2023-06-23 22:14:04,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1589262.0, ans=0.2 2023-06-23 22:14:23,919 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.693e+02 5.793e+02 9.224e+02 1.777e+03 3.715e+03, threshold=1.845e+03, percent-clipped=30.0 2023-06-23 22:14:51,902 INFO [train.py:996] (2/4) Epoch 9, batch 20950, loss[loss=0.1749, simple_loss=0.2472, pruned_loss=0.05124, over 21441.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3005, pruned_loss=0.07735, over 4269631.71 frames. ], batch size: 131, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:15:03,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1589442.0, ans=0.1 2023-06-23 22:16:01,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1589622.0, ans=0.1 2023-06-23 22:16:29,628 INFO [train.py:996] (2/4) Epoch 9, batch 21000, loss[loss=0.2356, simple_loss=0.3075, pruned_loss=0.0819, over 21913.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2988, pruned_loss=0.0775, over 4266654.89 frames. ], batch size: 316, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:16:29,628 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-23 22:16:43,222 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.4.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.4821, 2.2867, 3.4829, 3.8093], device='cuda:2') 2023-06-23 22:16:50,154 INFO [train.py:1028] (2/4) Epoch 9, validation: loss=0.2633, simple_loss=0.3613, pruned_loss=0.0826, over 1796401.00 frames. 2023-06-23 22:16:50,155 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-23 22:17:26,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1589862.0, ans=15.0 2023-06-23 22:17:43,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1589862.0, ans=0.0 2023-06-23 22:17:49,008 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=15.0 2023-06-23 22:17:51,342 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.945e+02 5.898e+02 8.113e+02 1.195e+03 2.501e+03, threshold=1.623e+03, percent-clipped=8.0 2023-06-23 22:18:30,029 INFO [train.py:996] (2/4) Epoch 9, batch 21050, loss[loss=0.2118, simple_loss=0.2787, pruned_loss=0.07244, over 21278.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2969, pruned_loss=0.07754, over 4273899.65 frames. ], batch size: 177, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:18:30,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1590042.0, ans=0.125 2023-06-23 22:18:51,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1590102.0, ans=0.125 2023-06-23 22:19:04,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1590102.0, ans=0.125 2023-06-23 22:19:22,152 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-06-23 22:20:08,617 INFO [train.py:996] (2/4) Epoch 9, batch 21100, loss[loss=0.1817, simple_loss=0.2303, pruned_loss=0.06654, over 20755.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2947, pruned_loss=0.0775, over 4275366.59 frames. ], batch size: 608, lr: 3.25e-03, grad_scale: 8.0 2023-06-23 22:20:59,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1590462.0, ans=0.125 2023-06-23 22:21:11,569 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.438e+02 5.136e+02 6.651e+02 8.328e+02 1.901e+03, threshold=1.330e+03, percent-clipped=2.0 2023-06-23 22:21:15,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1590522.0, ans=0.125 2023-06-23 22:21:23,212 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:21:30,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1590582.0, ans=0.125 2023-06-23 22:21:48,071 INFO [train.py:996] (2/4) Epoch 9, batch 21150, loss[loss=0.2144, simple_loss=0.2774, pruned_loss=0.0757, over 21826.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2919, pruned_loss=0.0777, over 4268752.42 frames. ], batch size: 107, lr: 3.25e-03, grad_scale: 8.0 2023-06-23 22:22:50,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten.whitening_limit, batch_count=1590822.0, ans=15.0 2023-06-23 22:22:52,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1590822.0, ans=0.2 2023-06-23 22:22:54,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1590822.0, ans=0.125 2023-06-23 22:23:26,831 INFO [train.py:996] (2/4) Epoch 9, batch 21200, loss[loss=0.1917, simple_loss=0.2555, pruned_loss=0.06389, over 21224.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2877, pruned_loss=0.07656, over 4253409.85 frames. ], batch size: 176, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:23:38,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1590942.0, ans=0.125 2023-06-23 22:23:52,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1591002.0, ans=0.125 2023-06-23 22:23:57,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1591002.0, ans=0.125 2023-06-23 22:24:26,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1591122.0, ans=0.125 2023-06-23 22:24:28,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1591122.0, ans=0.2 2023-06-23 22:24:29,352 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.749e+02 4.861e+02 6.796e+02 9.543e+02 2.010e+03, threshold=1.359e+03, percent-clipped=3.0 2023-06-23 22:25:05,762 INFO [train.py:996] (2/4) Epoch 9, batch 21250, loss[loss=0.2902, simple_loss=0.3629, pruned_loss=0.1088, over 21606.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2849, pruned_loss=0.07566, over 4257276.05 frames. ], batch size: 389, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:25:19,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1591242.0, ans=0.0 2023-06-23 22:26:12,177 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.06 vs. limit=10.0 2023-06-23 22:26:41,372 INFO [train.py:996] (2/4) Epoch 9, batch 21300, loss[loss=0.2221, simple_loss=0.3062, pruned_loss=0.06896, over 21737.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2918, pruned_loss=0.07796, over 4271532.70 frames. ], batch size: 298, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:27:00,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1591542.0, ans=0.125 2023-06-23 22:27:02,807 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=15.0 2023-06-23 22:27:08,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1591602.0, ans=0.125 2023-06-23 22:27:10,482 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:27:11,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1591602.0, ans=0.125 2023-06-23 22:27:18,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1591602.0, ans=0.125 2023-06-23 22:27:41,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1591662.0, ans=0.0 2023-06-23 22:27:49,240 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.208e+02 6.806e+02 9.811e+02 1.401e+03 3.569e+03, threshold=1.962e+03, percent-clipped=29.0 2023-06-23 22:27:53,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1591722.0, ans=0.125 2023-06-23 22:27:53,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1591722.0, ans=0.1 2023-06-23 22:28:12,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1591782.0, ans=0.0 2023-06-23 22:28:25,293 INFO [train.py:996] (2/4) Epoch 9, batch 21350, loss[loss=0.2411, simple_loss=0.3297, pruned_loss=0.07618, over 21355.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.2981, pruned_loss=0.07921, over 4276172.36 frames. ], batch size: 548, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:28:34,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1591842.0, ans=0.125 2023-06-23 22:28:36,396 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.97 vs. limit=22.5 2023-06-23 22:29:07,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1591962.0, ans=0.0 2023-06-23 22:29:22,184 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.19 vs. limit=15.0 2023-06-23 22:30:10,696 INFO [train.py:996] (2/4) Epoch 9, batch 21400, loss[loss=0.2846, simple_loss=0.3569, pruned_loss=0.1061, over 21466.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3004, pruned_loss=0.07874, over 4276234.23 frames. ], batch size: 131, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:30:44,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1592202.0, ans=0.0 2023-06-23 22:30:54,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1592262.0, ans=0.0 2023-06-23 22:31:04,084 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.46 vs. limit=10.0 2023-06-23 22:31:08,187 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.672e+02 5.270e+02 6.886e+02 1.009e+03 2.109e+03, threshold=1.377e+03, percent-clipped=2.0 2023-06-23 22:31:21,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1592322.0, ans=0.1 2023-06-23 22:31:41,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1592382.0, ans=0.015 2023-06-23 22:31:50,214 INFO [train.py:996] (2/4) Epoch 9, batch 21450, loss[loss=0.2403, simple_loss=0.3059, pruned_loss=0.08731, over 21303.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3053, pruned_loss=0.0814, over 4280021.73 frames. ], batch size: 176, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:32:26,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1592562.0, ans=0.0 2023-06-23 22:32:39,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1592562.0, ans=0.125 2023-06-23 22:33:27,854 INFO [train.py:996] (2/4) Epoch 9, batch 21500, loss[loss=0.2081, simple_loss=0.2722, pruned_loss=0.07203, over 21712.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3043, pruned_loss=0.08219, over 4279584.73 frames. ], batch size: 333, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:33:33,698 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.42 vs. limit=15.0 2023-06-23 22:33:35,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1592742.0, ans=0.0 2023-06-23 22:34:01,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1592802.0, ans=0.2 2023-06-23 22:34:08,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1592862.0, ans=0.125 2023-06-23 22:34:08,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1592862.0, ans=0.125 2023-06-23 22:34:29,426 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.705e+02 5.721e+02 7.470e+02 9.927e+02 1.833e+03, threshold=1.494e+03, percent-clipped=12.0 2023-06-23 22:34:32,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1592922.0, ans=0.125 2023-06-23 22:35:06,808 INFO [train.py:996] (2/4) Epoch 9, batch 21550, loss[loss=0.1821, simple_loss=0.2443, pruned_loss=0.05991, over 21212.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2974, pruned_loss=0.07915, over 4268069.82 frames. ], batch size: 548, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:35:09,725 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.45 vs. limit=15.0 2023-06-23 22:36:26,667 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.47 vs. limit=15.0 2023-06-23 22:36:35,629 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.68 vs. limit=15.0 2023-06-23 22:36:35,655 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.34 vs. limit=6.0 2023-06-23 22:36:37,167 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.14 vs. limit=10.0 2023-06-23 22:36:47,527 INFO [train.py:996] (2/4) Epoch 9, batch 21600, loss[loss=0.2026, simple_loss=0.264, pruned_loss=0.07059, over 21220.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.2924, pruned_loss=0.07694, over 4263323.86 frames. ], batch size: 549, lr: 3.25e-03, grad_scale: 32.0 2023-06-23 22:37:40,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1593462.0, ans=0.2 2023-06-23 22:37:41,028 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.78 vs. limit=22.5 2023-06-23 22:38:02,196 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.32 vs. limit=10.0 2023-06-23 22:38:04,081 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.659e+02 6.338e+02 9.920e+02 1.459e+03 3.157e+03, threshold=1.984e+03, percent-clipped=22.0 2023-06-23 22:38:21,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1593582.0, ans=0.125 2023-06-23 22:38:21,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1593582.0, ans=0.0 2023-06-23 22:38:28,837 INFO [train.py:996] (2/4) Epoch 9, batch 21650, loss[loss=0.2215, simple_loss=0.3142, pruned_loss=0.06435, over 21819.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2959, pruned_loss=0.07431, over 4258706.99 frames. ], batch size: 317, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:38:29,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1593642.0, ans=0.04949747468305833 2023-06-23 22:38:32,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1593642.0, ans=0.2 2023-06-23 22:39:55,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1593882.0, ans=0.125 2023-06-23 22:40:07,978 INFO [train.py:996] (2/4) Epoch 9, batch 21700, loss[loss=0.2086, simple_loss=0.2701, pruned_loss=0.07354, over 21324.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2961, pruned_loss=0.07322, over 4261857.16 frames. ], batch size: 144, lr: 3.25e-03, grad_scale: 16.0 2023-06-23 22:40:22,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1594002.0, ans=0.0 2023-06-23 22:40:38,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1594002.0, ans=0.125 2023-06-23 22:41:10,622 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.363e+02 6.189e+02 8.394e+02 1.254e+03 2.013e+03, threshold=1.679e+03, percent-clipped=1.0 2023-06-23 22:41:45,166 INFO [train.py:996] (2/4) Epoch 9, batch 21750, loss[loss=0.2238, simple_loss=0.2824, pruned_loss=0.08256, over 21851.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2922, pruned_loss=0.0734, over 4245151.38 frames. ], batch size: 107, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:41:51,122 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2023-06-23 22:41:53,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1594242.0, ans=0.2 2023-06-23 22:42:00,000 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.10 vs. limit=22.5 2023-06-23 22:42:01,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1594302.0, ans=0.0 2023-06-23 22:42:57,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1594422.0, ans=0.125 2023-06-23 22:43:22,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1594482.0, ans=0.125 2023-06-23 22:43:25,516 INFO [train.py:996] (2/4) Epoch 9, batch 21800, loss[loss=0.1989, simple_loss=0.2649, pruned_loss=0.06643, over 21834.00 frames. ], tot_loss[loss=0.2201, simple_loss=0.2911, pruned_loss=0.07457, over 4241688.17 frames. ], batch size: 318, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:43:26,654 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.86 vs. limit=12.0 2023-06-23 22:43:27,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1594542.0, ans=0.125 2023-06-23 22:43:35,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1594542.0, ans=0.125 2023-06-23 22:43:39,682 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-06-23 22:43:52,367 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.06 vs. limit=10.0 2023-06-23 22:44:34,096 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.869e+02 5.122e+02 6.776e+02 1.046e+03 2.535e+03, threshold=1.355e+03, percent-clipped=3.0 2023-06-23 22:45:05,001 INFO [train.py:996] (2/4) Epoch 9, batch 21850, loss[loss=0.2174, simple_loss=0.2816, pruned_loss=0.07661, over 19995.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2961, pruned_loss=0.07531, over 4250228.00 frames. ], batch size: 702, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:45:06,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1594842.0, ans=0.125 2023-06-23 22:45:10,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1594842.0, ans=0.1 2023-06-23 22:45:46,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1594962.0, ans=0.125 2023-06-23 22:46:38,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1595082.0, ans=0.0 2023-06-23 22:46:42,694 INFO [train.py:996] (2/4) Epoch 9, batch 21900, loss[loss=0.2741, simple_loss=0.321, pruned_loss=0.1136, over 21416.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2981, pruned_loss=0.07712, over 4261782.91 frames. ], batch size: 471, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:46:45,157 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.88 vs. limit=12.0 2023-06-23 22:47:49,991 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.776e+02 5.560e+02 7.973e+02 1.226e+03 2.341e+03, threshold=1.595e+03, percent-clipped=19.0 2023-06-23 22:47:53,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1595322.0, ans=0.0 2023-06-23 22:48:03,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1595382.0, ans=0.07 2023-06-23 22:48:08,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1595382.0, ans=0.0 2023-06-23 22:48:20,424 INFO [train.py:996] (2/4) Epoch 9, batch 21950, loss[loss=0.1812, simple_loss=0.2674, pruned_loss=0.04747, over 21661.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2924, pruned_loss=0.07578, over 4270452.19 frames. ], batch size: 415, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:49:00,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1595562.0, ans=0.05 2023-06-23 22:49:00,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1595562.0, ans=0.0 2023-06-23 22:49:59,983 INFO [train.py:996] (2/4) Epoch 9, batch 22000, loss[loss=0.1841, simple_loss=0.2546, pruned_loss=0.05674, over 21588.00 frames. ], tot_loss[loss=0.2149, simple_loss=0.2856, pruned_loss=0.07216, over 4271925.83 frames. ], batch size: 263, lr: 3.24e-03, grad_scale: 32.0 2023-06-23 22:50:44,716 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=15.0 2023-06-23 22:50:53,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1595862.0, ans=0.0 2023-06-23 22:51:14,071 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.248e+02 5.153e+02 7.605e+02 1.162e+03 2.837e+03, threshold=1.521e+03, percent-clipped=11.0 2023-06-23 22:51:20,091 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.27 vs. limit=15.0 2023-06-23 22:51:31,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1595982.0, ans=0.125 2023-06-23 22:51:34,367 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:51:40,191 INFO [train.py:996] (2/4) Epoch 9, batch 22050, loss[loss=0.2661, simple_loss=0.3508, pruned_loss=0.0907, over 21642.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2908, pruned_loss=0.07424, over 4263995.93 frames. ], batch size: 389, lr: 3.24e-03, grad_scale: 32.0 2023-06-23 22:52:14,469 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.25 vs. limit=15.0 2023-06-23 22:52:22,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1596102.0, ans=0.125 2023-06-23 22:52:41,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1596222.0, ans=0.125 2023-06-23 22:53:19,862 INFO [train.py:996] (2/4) Epoch 9, batch 22100, loss[loss=0.2009, simple_loss=0.2643, pruned_loss=0.06881, over 16959.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3016, pruned_loss=0.07865, over 4247504.54 frames. ], batch size: 63, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:53:57,078 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.12 vs. limit=12.0 2023-06-23 22:54:06,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1596462.0, ans=0.0 2023-06-23 22:54:34,346 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.056e+02 6.584e+02 8.540e+02 1.234e+03 2.755e+03, threshold=1.708e+03, percent-clipped=13.0 2023-06-23 22:54:44,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1596582.0, ans=0.0 2023-06-23 22:54:57,965 INFO [train.py:996] (2/4) Epoch 9, batch 22150, loss[loss=0.257, simple_loss=0.3216, pruned_loss=0.09616, over 21705.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3057, pruned_loss=0.08116, over 4259823.62 frames. ], batch size: 473, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 22:54:59,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1596642.0, ans=0.1 2023-06-23 22:55:28,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1596702.0, ans=0.125 2023-06-23 22:55:30,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1596702.0, ans=0.0 2023-06-23 22:55:33,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1596702.0, ans=0.1 2023-06-23 22:56:13,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1596822.0, ans=0.0 2023-06-23 22:56:16,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1596822.0, ans=0.04949747468305833 2023-06-23 22:56:37,887 INFO [train.py:996] (2/4) Epoch 9, batch 22200, loss[loss=0.2245, simple_loss=0.2985, pruned_loss=0.07523, over 21288.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3068, pruned_loss=0.08211, over 4275962.38 frames. ], batch size: 159, lr: 3.24e-03, grad_scale: 8.0 2023-06-23 22:56:46,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1596942.0, ans=0.0 2023-06-23 22:57:03,021 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.05 vs. limit=15.0 2023-06-23 22:57:30,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1597062.0, ans=0.0 2023-06-23 22:57:54,235 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.796e+02 5.424e+02 7.068e+02 9.828e+02 2.083e+03, threshold=1.414e+03, percent-clipped=7.0 2023-06-23 22:57:59,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1597182.0, ans=0.1 2023-06-23 22:58:13,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1597182.0, ans=0.125 2023-06-23 22:58:16,265 INFO [train.py:996] (2/4) Epoch 9, batch 22250, loss[loss=0.2605, simple_loss=0.3376, pruned_loss=0.09167, over 21770.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3163, pruned_loss=0.08392, over 4271330.64 frames. ], batch size: 247, lr: 3.24e-03, grad_scale: 8.0 2023-06-23 22:59:06,838 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=22.5 2023-06-23 22:59:21,195 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 22:59:43,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1597482.0, ans=0.0 2023-06-23 22:59:54,567 INFO [train.py:996] (2/4) Epoch 9, batch 22300, loss[loss=0.2379, simple_loss=0.3039, pruned_loss=0.08588, over 21870.00 frames. ], tot_loss[loss=0.2449, simple_loss=0.3178, pruned_loss=0.08603, over 4280310.68 frames. ], batch size: 332, lr: 3.24e-03, grad_scale: 8.0 2023-06-23 23:00:12,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1597542.0, ans=0.07 2023-06-23 23:00:34,244 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=22.5 2023-06-23 23:01:10,701 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.760e+02 5.946e+02 8.093e+02 1.234e+03 3.372e+03, threshold=1.619e+03, percent-clipped=19.0 2023-06-23 23:01:21,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1597782.0, ans=0.2 2023-06-23 23:01:33,317 INFO [train.py:996] (2/4) Epoch 9, batch 22350, loss[loss=0.2534, simple_loss=0.3321, pruned_loss=0.08735, over 17239.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3161, pruned_loss=0.08692, over 4279094.41 frames. ], batch size: 60, lr: 3.24e-03, grad_scale: 8.0 2023-06-23 23:02:08,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1597902.0, ans=0.0 2023-06-23 23:02:14,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1597902.0, ans=0.04949747468305833 2023-06-23 23:02:23,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1597962.0, ans=0.0 2023-06-23 23:03:22,837 INFO [train.py:996] (2/4) Epoch 9, batch 22400, loss[loss=0.205, simple_loss=0.2725, pruned_loss=0.06874, over 21307.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3124, pruned_loss=0.08387, over 4274673.87 frames. ], batch size: 608, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:03:54,738 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:04:21,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1598322.0, ans=0.0 2023-06-23 23:04:29,513 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.562e+02 4.983e+02 6.853e+02 9.645e+02 2.077e+03, threshold=1.371e+03, percent-clipped=3.0 2023-06-23 23:04:32,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1598322.0, ans=0.125 2023-06-23 23:04:52,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1598382.0, ans=0.125 2023-06-23 23:04:59,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1598442.0, ans=0.125 2023-06-23 23:05:00,739 INFO [train.py:996] (2/4) Epoch 9, batch 22450, loss[loss=0.242, simple_loss=0.2849, pruned_loss=0.09953, over 21317.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3071, pruned_loss=0.08324, over 4277716.84 frames. ], batch size: 507, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:05:39,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1598502.0, ans=0.125 2023-06-23 23:05:47,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1598562.0, ans=0.2 2023-06-23 23:06:00,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1598622.0, ans=0.0 2023-06-23 23:06:03,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1598622.0, ans=0.0 2023-06-23 23:06:05,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1598622.0, ans=0.2 2023-06-23 23:06:39,881 INFO [train.py:996] (2/4) Epoch 9, batch 22500, loss[loss=0.2778, simple_loss=0.3696, pruned_loss=0.09298, over 21564.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3021, pruned_loss=0.08193, over 4265832.40 frames. ], batch size: 441, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:06:44,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=1598742.0, ans=0.2 2023-06-23 23:07:35,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1598922.0, ans=0.0 2023-06-23 23:07:47,137 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.073e+02 5.895e+02 7.835e+02 1.248e+03 2.629e+03, threshold=1.567e+03, percent-clipped=21.0 2023-06-23 23:08:18,957 INFO [train.py:996] (2/4) Epoch 9, batch 22550, loss[loss=0.2364, simple_loss=0.351, pruned_loss=0.06097, over 20719.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3061, pruned_loss=0.0817, over 4274243.52 frames. ], batch size: 607, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:08:46,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1599102.0, ans=0.0 2023-06-23 23:09:21,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1599162.0, ans=0.125 2023-06-23 23:10:06,610 INFO [train.py:996] (2/4) Epoch 9, batch 22600, loss[loss=0.2691, simple_loss=0.3577, pruned_loss=0.09029, over 21654.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.311, pruned_loss=0.08203, over 4271418.11 frames. ], batch size: 441, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:10:13,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1599342.0, ans=0.125 2023-06-23 23:10:42,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1599462.0, ans=0.2 2023-06-23 23:11:13,380 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.041e+02 6.856e+02 1.098e+03 1.547e+03 4.006e+03, threshold=2.196e+03, percent-clipped=25.0 2023-06-23 23:11:24,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1599582.0, ans=0.1 2023-06-23 23:11:45,425 INFO [train.py:996] (2/4) Epoch 9, batch 22650, loss[loss=0.2443, simple_loss=0.2967, pruned_loss=0.0959, over 21527.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3075, pruned_loss=0.08149, over 4275154.33 frames. ], batch size: 441, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:12:12,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1599702.0, ans=0.0 2023-06-23 23:12:17,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1599762.0, ans=0.0 2023-06-23 23:13:18,776 INFO [train.py:996] (2/4) Epoch 9, batch 22700, loss[loss=0.2486, simple_loss=0.3023, pruned_loss=0.09744, over 21416.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3006, pruned_loss=0.08004, over 4272840.38 frames. ], batch size: 389, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:13:40,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1600002.0, ans=0.0 2023-06-23 23:13:45,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1600002.0, ans=0.04949747468305833 2023-06-23 23:14:08,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1600062.0, ans=0.125 2023-06-23 23:14:26,586 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.745e+02 5.778e+02 8.238e+02 1.243e+03 2.659e+03, threshold=1.648e+03, percent-clipped=2.0 2023-06-23 23:14:58,257 INFO [train.py:996] (2/4) Epoch 9, batch 22750, loss[loss=0.2203, simple_loss=0.2854, pruned_loss=0.07759, over 21145.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3007, pruned_loss=0.08171, over 4271148.79 frames. ], batch size: 143, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:15:13,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1600302.0, ans=0.0 2023-06-23 23:15:16,117 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-23 23:15:25,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1600302.0, ans=0.0 2023-06-23 23:15:48,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1600362.0, ans=0.1 2023-06-23 23:15:53,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1600362.0, ans=0.125 2023-06-23 23:16:05,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1600422.0, ans=0.0 2023-06-23 23:16:31,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1600482.0, ans=0.1 2023-06-23 23:16:37,195 INFO [train.py:996] (2/4) Epoch 9, batch 22800, loss[loss=0.2574, simple_loss=0.3152, pruned_loss=0.09983, over 21654.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.306, pruned_loss=0.08414, over 4273436.41 frames. ], batch size: 263, lr: 3.24e-03, grad_scale: 32.0 2023-06-23 23:16:37,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1600542.0, ans=0.125 2023-06-23 23:17:33,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1600722.0, ans=0.025 2023-06-23 23:17:38,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1600722.0, ans=0.125 2023-06-23 23:17:45,283 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.424e+02 5.839e+02 8.746e+02 1.348e+03 2.535e+03, threshold=1.749e+03, percent-clipped=13.0 2023-06-23 23:17:45,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1600722.0, ans=0.125 2023-06-23 23:17:47,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1600722.0, ans=0.04949747468305833 2023-06-23 23:18:10,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1600782.0, ans=0.125 2023-06-23 23:18:15,365 INFO [train.py:996] (2/4) Epoch 9, batch 22850, loss[loss=0.2387, simple_loss=0.296, pruned_loss=0.09071, over 15091.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3021, pruned_loss=0.08353, over 4269808.52 frames. ], batch size: 60, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:18:35,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1600902.0, ans=0.015 2023-06-23 23:18:46,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1600962.0, ans=0.015 2023-06-23 23:19:06,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1601022.0, ans=0.125 2023-06-23 23:19:35,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1601082.0, ans=0.125 2023-06-23 23:19:49,658 INFO [train.py:996] (2/4) Epoch 9, batch 22900, loss[loss=0.2278, simple_loss=0.3198, pruned_loss=0.06794, over 21391.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3023, pruned_loss=0.08296, over 4275748.34 frames. ], batch size: 211, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:20:00,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1601142.0, ans=10.0 2023-06-23 23:20:49,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1601322.0, ans=0.0 2023-06-23 23:20:50,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1601322.0, ans=0.2 2023-06-23 23:21:02,727 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.772e+02 7.597e+02 1.119e+03 1.552e+03 2.740e+03, threshold=2.237e+03, percent-clipped=15.0 2023-06-23 23:21:23,736 INFO [train.py:996] (2/4) Epoch 9, batch 22950, loss[loss=0.2487, simple_loss=0.3697, pruned_loss=0.06382, over 21795.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3139, pruned_loss=0.08083, over 4270602.86 frames. ], batch size: 351, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:21:35,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1601442.0, ans=0.1 2023-06-23 23:21:38,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1601502.0, ans=0.125 2023-06-23 23:22:38,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1601622.0, ans=0.0 2023-06-23 23:23:02,606 INFO [train.py:996] (2/4) Epoch 9, batch 23000, loss[loss=0.225, simple_loss=0.2999, pruned_loss=0.07501, over 21905.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3123, pruned_loss=0.07871, over 4274258.59 frames. ], batch size: 371, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:23:09,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1601742.0, ans=0.125 2023-06-23 23:23:51,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1601862.0, ans=0.1 2023-06-23 23:23:57,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1601862.0, ans=0.125 2023-06-23 23:24:17,139 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.759e+02 5.434e+02 6.875e+02 9.682e+02 1.732e+03, threshold=1.375e+03, percent-clipped=0.0 2023-06-23 23:24:38,048 INFO [train.py:996] (2/4) Epoch 9, batch 23050, loss[loss=0.2657, simple_loss=0.3392, pruned_loss=0.09616, over 21478.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3132, pruned_loss=0.08056, over 4268517.50 frames. ], batch size: 548, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:24:56,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1602042.0, ans=10.0 2023-06-23 23:25:06,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1602102.0, ans=0.125 2023-06-23 23:25:19,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1602162.0, ans=0.125 2023-06-23 23:25:41,953 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.03 vs. limit=10.0 2023-06-23 23:26:13,053 INFO [train.py:996] (2/4) Epoch 9, batch 23100, loss[loss=0.2023, simple_loss=0.2611, pruned_loss=0.07175, over 21156.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3095, pruned_loss=0.08116, over 4274976.19 frames. ], batch size: 159, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:26:51,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1602402.0, ans=0.1 2023-06-23 23:27:20,672 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=22.5 2023-06-23 23:27:25,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1602522.0, ans=0.125 2023-06-23 23:27:30,603 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.933e+02 6.265e+02 7.988e+02 9.890e+02 1.959e+03, threshold=1.598e+03, percent-clipped=10.0 2023-06-23 23:27:36,050 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-23 23:27:51,458 INFO [train.py:996] (2/4) Epoch 9, batch 23150, loss[loss=0.2125, simple_loss=0.2866, pruned_loss=0.06921, over 21803.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3041, pruned_loss=0.08063, over 4263688.77 frames. ], batch size: 298, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:28:04,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1602642.0, ans=0.0 2023-06-23 23:29:29,533 INFO [train.py:996] (2/4) Epoch 9, batch 23200, loss[loss=0.2183, simple_loss=0.2754, pruned_loss=0.0806, over 21689.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.304, pruned_loss=0.08138, over 4272032.57 frames. ], batch size: 230, lr: 3.24e-03, grad_scale: 32.0 2023-06-23 23:29:43,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1602942.0, ans=0.0 2023-06-23 23:29:50,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1603002.0, ans=0.2 2023-06-23 23:30:46,162 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.636e+02 5.576e+02 6.977e+02 1.069e+03 2.508e+03, threshold=1.395e+03, percent-clipped=7.0 2023-06-23 23:31:07,206 INFO [train.py:996] (2/4) Epoch 9, batch 23250, loss[loss=0.2949, simple_loss=0.3385, pruned_loss=0.1256, over 21796.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3047, pruned_loss=0.08337, over 4284082.28 frames. ], batch size: 508, lr: 3.24e-03, grad_scale: 32.0 2023-06-23 23:31:11,487 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2023-06-23 23:31:30,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1603302.0, ans=0.125 2023-06-23 23:31:37,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1603302.0, ans=10.0 2023-06-23 23:32:00,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1603362.0, ans=0.1 2023-06-23 23:32:22,004 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.55 vs. limit=22.5 2023-06-23 23:32:52,329 INFO [train.py:996] (2/4) Epoch 9, batch 23300, loss[loss=0.3731, simple_loss=0.4524, pruned_loss=0.147, over 21452.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3124, pruned_loss=0.08526, over 4290126.79 frames. ], batch size: 507, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:33:42,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_abs, batch_count=1603662.0, ans=0.5 2023-06-23 23:34:08,547 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.193e+02 5.673e+02 7.444e+02 1.083e+03 2.210e+03, threshold=1.489e+03, percent-clipped=13.0 2023-06-23 23:34:37,329 INFO [train.py:996] (2/4) Epoch 9, batch 23350, loss[loss=0.2235, simple_loss=0.3087, pruned_loss=0.06914, over 21761.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3163, pruned_loss=0.0837, over 4281683.96 frames. ], batch size: 332, lr: 3.24e-03, grad_scale: 16.0 2023-06-23 23:34:37,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1603842.0, ans=0.0 2023-06-23 23:35:26,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1603962.0, ans=0.0 2023-06-23 23:35:36,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1604022.0, ans=0.125 2023-06-23 23:36:14,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1604142.0, ans=0.125 2023-06-23 23:36:15,578 INFO [train.py:996] (2/4) Epoch 9, batch 23400, loss[loss=0.2316, simple_loss=0.2938, pruned_loss=0.08468, over 21462.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3097, pruned_loss=0.07982, over 4276294.34 frames. ], batch size: 194, lr: 3.23e-03, grad_scale: 16.0 2023-06-23 23:36:24,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1604142.0, ans=0.0 2023-06-23 23:36:24,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1604142.0, ans=0.0 2023-06-23 23:37:32,083 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.577e+02 6.077e+02 8.548e+02 1.177e+03 1.985e+03, threshold=1.710e+03, percent-clipped=13.0 2023-06-23 23:37:55,550 INFO [train.py:996] (2/4) Epoch 9, batch 23450, loss[loss=0.2461, simple_loss=0.3132, pruned_loss=0.08948, over 21967.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3105, pruned_loss=0.08175, over 4269937.64 frames. ], batch size: 316, lr: 3.23e-03, grad_scale: 16.0 2023-06-23 23:37:55,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1604442.0, ans=0.125 2023-06-23 23:38:18,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1604442.0, ans=0.2 2023-06-23 23:38:28,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1604502.0, ans=0.1 2023-06-23 23:38:31,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1604502.0, ans=0.1 2023-06-23 23:38:39,126 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.77 vs. limit=15.0 2023-06-23 23:38:43,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1604562.0, ans=0.125 2023-06-23 23:39:21,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1604682.0, ans=0.1 2023-06-23 23:39:31,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1604742.0, ans=0.125 2023-06-23 23:39:33,085 INFO [train.py:996] (2/4) Epoch 9, batch 23500, loss[loss=0.2334, simple_loss=0.2926, pruned_loss=0.08707, over 21457.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3108, pruned_loss=0.08389, over 4271153.89 frames. ], batch size: 194, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:40:08,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1604802.0, ans=0.2 2023-06-23 23:40:08,737 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-23 23:40:22,505 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.18 vs. limit=15.0 2023-06-23 23:40:48,966 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.149e+02 5.607e+02 7.001e+02 9.691e+02 1.810e+03, threshold=1.400e+03, percent-clipped=1.0 2023-06-23 23:41:11,079 INFO [train.py:996] (2/4) Epoch 9, batch 23550, loss[loss=0.2121, simple_loss=0.2713, pruned_loss=0.07646, over 21686.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3067, pruned_loss=0.08395, over 4262754.69 frames. ], batch size: 316, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:41:12,106 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.23 vs. limit=10.0 2023-06-23 23:41:47,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1605102.0, ans=0.125 2023-06-23 23:41:48,372 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-23 23:42:54,885 INFO [train.py:996] (2/4) Epoch 9, batch 23600, loss[loss=0.2324, simple_loss=0.3123, pruned_loss=0.07624, over 21873.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3073, pruned_loss=0.08409, over 4260542.91 frames. ], batch size: 316, lr: 3.23e-03, grad_scale: 16.0 2023-06-23 23:43:52,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1605462.0, ans=0.0 2023-06-23 23:44:15,245 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.063e+02 5.548e+02 8.580e+02 1.181e+03 2.336e+03, threshold=1.716e+03, percent-clipped=15.0 2023-06-23 23:44:43,184 INFO [train.py:996] (2/4) Epoch 9, batch 23650, loss[loss=0.205, simple_loss=0.2831, pruned_loss=0.06343, over 21328.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3063, pruned_loss=0.08201, over 4262273.43 frames. ], batch size: 159, lr: 3.23e-03, grad_scale: 16.0 2023-06-23 23:44:43,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1605642.0, ans=0.09899494936611666 2023-06-23 23:44:45,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1605642.0, ans=0.125 2023-06-23 23:44:45,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1605642.0, ans=0.125 2023-06-23 23:45:41,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1605822.0, ans=0.125 2023-06-23 23:46:23,144 INFO [train.py:996] (2/4) Epoch 9, batch 23700, loss[loss=0.2777, simple_loss=0.3458, pruned_loss=0.1048, over 21741.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3091, pruned_loss=0.08171, over 4265724.79 frames. ], batch size: 441, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:46:23,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1605942.0, ans=0.125 2023-06-23 23:46:36,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1605942.0, ans=0.0 2023-06-23 23:46:38,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1606002.0, ans=0.125 2023-06-23 23:47:20,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1606062.0, ans=0.0 2023-06-23 23:47:49,314 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.090e+02 6.365e+02 8.254e+02 1.222e+03 2.661e+03, threshold=1.651e+03, percent-clipped=9.0 2023-06-23 23:47:57,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1606182.0, ans=0.0 2023-06-23 23:48:05,076 INFO [train.py:996] (2/4) Epoch 9, batch 23750, loss[loss=0.231, simple_loss=0.3284, pruned_loss=0.06676, over 21737.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3123, pruned_loss=0.08177, over 4268465.71 frames. ], batch size: 351, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:48:07,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1606242.0, ans=0.0 2023-06-23 23:48:26,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1606302.0, ans=0.0 2023-06-23 23:48:45,348 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=15.0 2023-06-23 23:49:03,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1606362.0, ans=0.125 2023-06-23 23:49:26,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1606422.0, ans=0.125 2023-06-23 23:49:45,908 INFO [train.py:996] (2/4) Epoch 9, batch 23800, loss[loss=0.2429, simple_loss=0.3404, pruned_loss=0.07271, over 21805.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3104, pruned_loss=0.07908, over 4268224.60 frames. ], batch size: 282, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:51:00,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1606722.0, ans=0.0 2023-06-23 23:51:11,418 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.911e+02 6.511e+02 9.622e+02 1.496e+03 3.900e+03, threshold=1.924e+03, percent-clipped=16.0 2023-06-23 23:51:33,176 INFO [train.py:996] (2/4) Epoch 9, batch 23850, loss[loss=0.301, simple_loss=0.4083, pruned_loss=0.0969, over 19727.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3185, pruned_loss=0.08095, over 4269818.23 frames. ], batch size: 702, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:52:26,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1606962.0, ans=0.125 2023-06-23 23:52:28,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.whiten.whitening_limit, batch_count=1606962.0, ans=12.0 2023-06-23 23:52:36,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1607022.0, ans=0.04949747468305833 2023-06-23 23:52:49,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1607022.0, ans=22.5 2023-06-23 23:53:00,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1607082.0, ans=0.07 2023-06-23 23:53:01,250 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.88 vs. limit=15.0 2023-06-23 23:53:01,433 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2023-06-23 23:53:17,358 INFO [train.py:996] (2/4) Epoch 9, batch 23900, loss[loss=0.3129, simple_loss=0.4117, pruned_loss=0.1071, over 21616.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3255, pruned_loss=0.08387, over 4263832.59 frames. ], batch size: 414, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:53:28,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1607142.0, ans=0.125 2023-06-23 23:53:30,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1607142.0, ans=0.0 2023-06-23 23:53:53,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1607202.0, ans=0.125 2023-06-23 23:54:13,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1607322.0, ans=0.0 2023-06-23 23:54:30,535 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.004e+02 6.120e+02 8.437e+02 1.170e+03 2.663e+03, threshold=1.687e+03, percent-clipped=5.0 2023-06-23 23:54:48,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1607382.0, ans=0.125 2023-06-23 23:54:56,263 INFO [train.py:996] (2/4) Epoch 9, batch 23950, loss[loss=0.2476, simple_loss=0.3114, pruned_loss=0.09193, over 21734.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3194, pruned_loss=0.08312, over 4261505.58 frames. ], batch size: 282, lr: 3.23e-03, grad_scale: 8.0 2023-06-23 23:55:43,756 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=11.17 vs. limit=15.0 2023-06-23 23:55:59,346 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-23 23:56:26,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1607682.0, ans=0.125 2023-06-23 23:56:40,014 INFO [train.py:996] (2/4) Epoch 9, batch 24000, loss[loss=0.2948, simple_loss=0.3695, pruned_loss=0.1101, over 21457.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3208, pruned_loss=0.0857, over 4263443.08 frames. ], batch size: 131, lr: 3.23e-03, grad_scale: 16.0 2023-06-23 23:56:40,014 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-23 23:57:00,116 INFO [train.py:1028] (2/4) Epoch 9, validation: loss=0.2698, simple_loss=0.3635, pruned_loss=0.08806, over 1796401.00 frames. 2023-06-23 23:57:00,117 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-23 23:57:35,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1607862.0, ans=0.125 2023-06-23 23:58:00,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1607922.0, ans=0.025 2023-06-23 23:58:12,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1607922.0, ans=0.125 2023-06-23 23:58:19,611 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.132e+02 5.711e+02 7.408e+02 1.023e+03 1.952e+03, threshold=1.482e+03, percent-clipped=3.0 2023-06-23 23:58:41,979 INFO [train.py:996] (2/4) Epoch 9, batch 24050, loss[loss=0.2062, simple_loss=0.2995, pruned_loss=0.05642, over 21824.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3222, pruned_loss=0.0862, over 4265225.69 frames. ], batch size: 282, lr: 3.23e-03, grad_scale: 16.0 2023-06-23 23:59:40,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1608222.0, ans=0.1 2023-06-23 23:59:43,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1608222.0, ans=0.0 2023-06-24 00:00:21,937 INFO [train.py:996] (2/4) Epoch 9, batch 24100, loss[loss=0.2902, simple_loss=0.3716, pruned_loss=0.1044, over 21574.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3223, pruned_loss=0.0843, over 4263187.33 frames. ], batch size: 414, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:00:24,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1608342.0, ans=0.1 2023-06-24 00:00:37,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1608402.0, ans=0.0 2023-06-24 00:00:53,515 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.69 vs. limit=22.5 2023-06-24 00:01:07,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1608462.0, ans=0.1 2023-06-24 00:01:13,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1608462.0, ans=0.1 2023-06-24 00:01:39,800 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.035e+02 6.221e+02 8.690e+02 1.208e+03 2.210e+03, threshold=1.738e+03, percent-clipped=15.0 2023-06-24 00:02:00,856 INFO [train.py:996] (2/4) Epoch 9, batch 24150, loss[loss=0.1984, simple_loss=0.265, pruned_loss=0.06596, over 21169.00 frames. ], tot_loss[loss=0.2463, simple_loss=0.3213, pruned_loss=0.08568, over 4266129.15 frames. ], batch size: 608, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:02:22,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1608702.0, ans=0.1 2023-06-24 00:03:35,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1608882.0, ans=0.95 2023-06-24 00:03:41,742 INFO [train.py:996] (2/4) Epoch 9, batch 24200, loss[loss=0.2496, simple_loss=0.317, pruned_loss=0.09115, over 21437.00 frames. ], tot_loss[loss=0.2506, simple_loss=0.3249, pruned_loss=0.08808, over 4272010.65 frames. ], batch size: 195, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:03:56,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1609002.0, ans=0.125 2023-06-24 00:04:27,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1609062.0, ans=0.125 2023-06-24 00:04:29,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_na.min_abs, batch_count=1609062.0, ans=0.02 2023-06-24 00:04:33,088 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.64 vs. limit=22.5 2023-06-24 00:04:50,351 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.93 vs. limit=22.5 2023-06-24 00:05:07,425 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.768e+02 7.380e+02 9.978e+02 1.387e+03 2.651e+03, threshold=1.996e+03, percent-clipped=13.0 2023-06-24 00:05:22,617 INFO [train.py:996] (2/4) Epoch 9, batch 24250, loss[loss=0.1878, simple_loss=0.2837, pruned_loss=0.04593, over 21840.00 frames. ], tot_loss[loss=0.2419, simple_loss=0.3215, pruned_loss=0.08113, over 4265351.58 frames. ], batch size: 316, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:06:34,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1609422.0, ans=0.125 2023-06-24 00:06:58,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1609482.0, ans=0.95 2023-06-24 00:07:02,861 INFO [train.py:996] (2/4) Epoch 9, batch 24300, loss[loss=0.1972, simple_loss=0.2835, pruned_loss=0.05543, over 21647.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3144, pruned_loss=0.0761, over 4269271.73 frames. ], batch size: 389, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:07:34,155 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-24 00:08:25,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1609782.0, ans=0.125 2023-06-24 00:08:28,251 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.714e+02 6.136e+02 8.324e+02 1.263e+03 3.161e+03, threshold=1.665e+03, percent-clipped=12.0 2023-06-24 00:08:43,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1609782.0, ans=0.125 2023-06-24 00:08:52,373 INFO [train.py:996] (2/4) Epoch 9, batch 24350, loss[loss=0.2336, simple_loss=0.2989, pruned_loss=0.08419, over 21475.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3114, pruned_loss=0.07696, over 4272194.40 frames. ], batch size: 194, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:09:04,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1609842.0, ans=0.125 2023-06-24 00:09:12,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1609902.0, ans=0.2 2023-06-24 00:09:33,862 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.95 vs. limit=15.0 2023-06-24 00:10:34,908 INFO [train.py:996] (2/4) Epoch 9, batch 24400, loss[loss=0.235, simple_loss=0.3195, pruned_loss=0.07526, over 21612.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3156, pruned_loss=0.08071, over 4273795.51 frames. ], batch size: 230, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:11:56,499 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.237e+02 5.410e+02 6.876e+02 9.194e+02 2.686e+03, threshold=1.375e+03, percent-clipped=10.0 2023-06-24 00:11:56,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1610382.0, ans=0.05 2023-06-24 00:12:11,142 INFO [train.py:996] (2/4) Epoch 9, batch 24450, loss[loss=0.1839, simple_loss=0.258, pruned_loss=0.05493, over 16326.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.318, pruned_loss=0.08214, over 4266699.93 frames. ], batch size: 63, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:12:16,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1610442.0, ans=0.125 2023-06-24 00:12:37,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1610502.0, ans=0.125 2023-06-24 00:12:45,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1610502.0, ans=0.0 2023-06-24 00:12:51,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1610562.0, ans=0.125 2023-06-24 00:13:19,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1610622.0, ans=0.125 2023-06-24 00:13:51,171 INFO [train.py:996] (2/4) Epoch 9, batch 24500, loss[loss=0.2162, simple_loss=0.2884, pruned_loss=0.07199, over 21642.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3169, pruned_loss=0.08125, over 4272154.66 frames. ], batch size: 263, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:14:31,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1610862.0, ans=0.125 2023-06-24 00:14:39,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1610862.0, ans=0.125 2023-06-24 00:14:42,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1610862.0, ans=0.125 2023-06-24 00:15:15,964 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.883e+02 4.941e+02 6.307e+02 8.711e+02 3.165e+03, threshold=1.261e+03, percent-clipped=6.0 2023-06-24 00:15:29,998 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.28 vs. limit=15.0 2023-06-24 00:15:35,234 INFO [train.py:996] (2/4) Epoch 9, batch 24550, loss[loss=0.305, simple_loss=0.3605, pruned_loss=0.1248, over 21449.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3177, pruned_loss=0.08234, over 4274191.82 frames. ], batch size: 471, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:16:30,444 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=22.5 2023-06-24 00:16:37,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1611222.0, ans=0.125 2023-06-24 00:17:13,013 INFO [train.py:996] (2/4) Epoch 9, batch 24600, loss[loss=0.1845, simple_loss=0.2471, pruned_loss=0.06095, over 21783.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3145, pruned_loss=0.08371, over 4267614.98 frames. ], batch size: 112, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:17:50,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1611462.0, ans=0.0 2023-06-24 00:17:53,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1611462.0, ans=0.125 2023-06-24 00:18:09,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1611522.0, ans=0.125 2023-06-24 00:18:33,117 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.043e+02 5.417e+02 8.330e+02 1.065e+03 1.781e+03, threshold=1.666e+03, percent-clipped=13.0 2023-06-24 00:18:52,526 INFO [train.py:996] (2/4) Epoch 9, batch 24650, loss[loss=0.1853, simple_loss=0.2467, pruned_loss=0.06192, over 21594.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3059, pruned_loss=0.08203, over 4269764.71 frames. ], batch size: 231, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:20:32,018 INFO [train.py:996] (2/4) Epoch 9, batch 24700, loss[loss=0.2381, simple_loss=0.2948, pruned_loss=0.09068, over 21369.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3062, pruned_loss=0.08019, over 4272919.46 frames. ], batch size: 473, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:20:33,124 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-06-24 00:20:38,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1611942.0, ans=0.2 2023-06-24 00:21:01,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1612002.0, ans=0.125 2023-06-24 00:21:32,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1612122.0, ans=0.0 2023-06-24 00:21:53,235 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.694e+02 5.774e+02 7.890e+02 1.274e+03 2.911e+03, threshold=1.578e+03, percent-clipped=10.0 2023-06-24 00:22:10,831 INFO [train.py:996] (2/4) Epoch 9, batch 24750, loss[loss=0.1937, simple_loss=0.2685, pruned_loss=0.05951, over 21671.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.2996, pruned_loss=0.0779, over 4272456.51 frames. ], batch size: 298, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:22:30,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1612302.0, ans=0.1 2023-06-24 00:22:41,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1612302.0, ans=0.04949747468305833 2023-06-24 00:23:03,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1612362.0, ans=0.125 2023-06-24 00:23:14,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1612422.0, ans=0.0 2023-06-24 00:23:39,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1612482.0, ans=0.04949747468305833 2023-06-24 00:23:44,078 INFO [train.py:996] (2/4) Epoch 9, batch 24800, loss[loss=0.2664, simple_loss=0.3213, pruned_loss=0.1058, over 21565.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.296, pruned_loss=0.07776, over 4277446.33 frames. ], batch size: 471, lr: 3.23e-03, grad_scale: 16.0 2023-06-24 00:23:51,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1612542.0, ans=0.125 2023-06-24 00:24:05,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1612602.0, ans=0.125 2023-06-24 00:24:10,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1612602.0, ans=0.125 2023-06-24 00:24:28,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1612662.0, ans=0.125 2023-06-24 00:24:41,134 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.90 vs. limit=15.0 2023-06-24 00:24:47,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1612722.0, ans=0.125 2023-06-24 00:25:07,472 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.818e+02 6.000e+02 9.294e+02 1.511e+03 3.142e+03, threshold=1.859e+03, percent-clipped=19.0 2023-06-24 00:25:07,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1612782.0, ans=0.1 2023-06-24 00:25:22,843 INFO [train.py:996] (2/4) Epoch 9, batch 24850, loss[loss=0.1889, simple_loss=0.2518, pruned_loss=0.06299, over 21361.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.2967, pruned_loss=0.07907, over 4282367.51 frames. ], batch size: 176, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:25:37,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1612842.0, ans=0.125 2023-06-24 00:26:05,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1612962.0, ans=0.0 2023-06-24 00:26:47,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1613082.0, ans=0.0 2023-06-24 00:27:06,910 INFO [train.py:996] (2/4) Epoch 9, batch 24900, loss[loss=0.326, simple_loss=0.3851, pruned_loss=0.1334, over 21410.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3011, pruned_loss=0.08078, over 4279748.32 frames. ], batch size: 471, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:27:12,372 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=22.5 2023-06-24 00:28:20,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1613322.0, ans=0.07 2023-06-24 00:28:32,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1613382.0, ans=0.1 2023-06-24 00:28:36,554 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.759e+02 5.969e+02 8.739e+02 1.291e+03 2.372e+03, threshold=1.748e+03, percent-clipped=6.0 2023-06-24 00:28:47,993 INFO [train.py:996] (2/4) Epoch 9, batch 24950, loss[loss=0.2422, simple_loss=0.3157, pruned_loss=0.08433, over 21816.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3081, pruned_loss=0.08442, over 4280190.83 frames. ], batch size: 247, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:28:59,254 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.71 vs. limit=15.0 2023-06-24 00:29:32,916 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=22.5 2023-06-24 00:29:52,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1613622.0, ans=0.2 2023-06-24 00:30:11,656 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2023-06-24 00:30:19,372 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.21 vs. limit=12.0 2023-06-24 00:30:29,791 INFO [train.py:996] (2/4) Epoch 9, batch 25000, loss[loss=0.216, simple_loss=0.2872, pruned_loss=0.07243, over 22000.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3144, pruned_loss=0.08597, over 4283775.50 frames. ], batch size: 103, lr: 3.23e-03, grad_scale: 8.0 2023-06-24 00:30:42,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1613742.0, ans=0.1 2023-06-24 00:30:56,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1613802.0, ans=0.0 2023-06-24 00:31:00,293 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:31:25,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1613862.0, ans=0.2 2023-06-24 00:31:53,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1613982.0, ans=0.125 2023-06-24 00:31:57,963 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.997e+02 6.297e+02 8.541e+02 1.164e+03 2.225e+03, threshold=1.708e+03, percent-clipped=6.0 2023-06-24 00:32:08,772 INFO [train.py:996] (2/4) Epoch 9, batch 25050, loss[loss=0.2144, simple_loss=0.2721, pruned_loss=0.07836, over 21580.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.307, pruned_loss=0.08406, over 4277699.98 frames. ], batch size: 213, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:32:09,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1614042.0, ans=0.2 2023-06-24 00:32:45,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1614102.0, ans=0.125 2023-06-24 00:32:48,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1614102.0, ans=0.1 2023-06-24 00:33:20,308 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.42 vs. limit=6.0 2023-06-24 00:33:32,028 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-06-24 00:33:50,137 INFO [train.py:996] (2/4) Epoch 9, batch 25100, loss[loss=0.2898, simple_loss=0.3553, pruned_loss=0.1122, over 21448.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3022, pruned_loss=0.08321, over 4281166.70 frames. ], batch size: 508, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:34:42,320 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=17.55 vs. limit=15.0 2023-06-24 00:34:49,821 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:34:52,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1614522.0, ans=0.0 2023-06-24 00:35:16,805 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.018e+02 5.962e+02 8.437e+02 1.206e+03 2.426e+03, threshold=1.687e+03, percent-clipped=3.0 2023-06-24 00:35:27,817 INFO [train.py:996] (2/4) Epoch 9, batch 25150, loss[loss=0.2098, simple_loss=0.2996, pruned_loss=0.05996, over 21782.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3045, pruned_loss=0.08072, over 4264494.73 frames. ], batch size: 298, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:35:36,415 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=22.5 2023-06-24 00:35:42,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1614702.0, ans=0.2 2023-06-24 00:35:52,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1614702.0, ans=0.125 2023-06-24 00:36:17,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1614762.0, ans=0.125 2023-06-24 00:36:26,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1614822.0, ans=0.0 2023-06-24 00:36:29,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1614822.0, ans=0.125 2023-06-24 00:36:29,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1614822.0, ans=0.1 2023-06-24 00:36:52,221 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.90 vs. limit=15.0 2023-06-24 00:36:57,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1614882.0, ans=0.125 2023-06-24 00:37:05,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1614882.0, ans=0.125 2023-06-24 00:37:08,150 INFO [train.py:996] (2/4) Epoch 9, batch 25200, loss[loss=0.2144, simple_loss=0.3097, pruned_loss=0.05955, over 21745.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3041, pruned_loss=0.07893, over 4254901.62 frames. ], batch size: 351, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:37:12,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1614942.0, ans=0.1 2023-06-24 00:37:21,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_abs, batch_count=1614942.0, ans=0.5 2023-06-24 00:37:22,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1615002.0, ans=0.125 2023-06-24 00:37:24,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1615002.0, ans=0.1 2023-06-24 00:37:25,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1615002.0, ans=0.125 2023-06-24 00:37:46,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1615002.0, ans=0.125 2023-06-24 00:37:50,762 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.21 vs. limit=15.0 2023-06-24 00:38:04,248 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-24 00:38:14,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1615122.0, ans=0.125 2023-06-24 00:38:35,125 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.592e+02 5.246e+02 7.409e+02 1.372e+03 3.913e+03, threshold=1.482e+03, percent-clipped=20.0 2023-06-24 00:38:37,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1615182.0, ans=0.125 2023-06-24 00:38:43,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1615182.0, ans=0.125 2023-06-24 00:38:46,395 INFO [train.py:996] (2/4) Epoch 9, batch 25250, loss[loss=0.2573, simple_loss=0.314, pruned_loss=0.1003, over 22010.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3032, pruned_loss=0.07785, over 4255143.69 frames. ], batch size: 103, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:39:08,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1615302.0, ans=0.0 2023-06-24 00:39:10,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1615302.0, ans=0.125 2023-06-24 00:39:12,157 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=22.5 2023-06-24 00:39:23,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1615362.0, ans=0.125 2023-06-24 00:40:14,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1615482.0, ans=0.125 2023-06-24 00:40:14,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1615482.0, ans=0.0 2023-06-24 00:40:24,959 INFO [train.py:996] (2/4) Epoch 9, batch 25300, loss[loss=0.2481, simple_loss=0.3241, pruned_loss=0.08608, over 21724.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.2997, pruned_loss=0.07691, over 4257438.17 frames. ], batch size: 351, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:40:28,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1615542.0, ans=0.125 2023-06-24 00:40:51,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1615602.0, ans=0.125 2023-06-24 00:41:55,182 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.408e+02 6.402e+02 8.209e+02 1.215e+03 2.497e+03, threshold=1.642e+03, percent-clipped=20.0 2023-06-24 00:41:58,938 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:42:04,754 INFO [train.py:996] (2/4) Epoch 9, batch 25350, loss[loss=0.2289, simple_loss=0.3299, pruned_loss=0.06392, over 21301.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.303, pruned_loss=0.0769, over 4261028.72 frames. ], batch size: 549, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:42:22,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1615902.0, ans=0.07 2023-06-24 00:42:28,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1615902.0, ans=0.95 2023-06-24 00:42:46,648 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-24 00:43:11,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1616022.0, ans=0.125 2023-06-24 00:43:14,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1616022.0, ans=0.07 2023-06-24 00:43:44,777 INFO [train.py:996] (2/4) Epoch 9, batch 25400, loss[loss=0.2361, simple_loss=0.2949, pruned_loss=0.0887, over 21861.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2995, pruned_loss=0.0762, over 4255110.61 frames. ], batch size: 107, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:43:46,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1616142.0, ans=0.125 2023-06-24 00:44:11,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1616202.0, ans=0.125 2023-06-24 00:44:37,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1616262.0, ans=0.125 2023-06-24 00:44:38,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1616262.0, ans=0.125 2023-06-24 00:44:40,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1616262.0, ans=0.125 2023-06-24 00:45:17,975 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.471e+02 5.867e+02 9.461e+02 1.414e+03 2.497e+03, threshold=1.892e+03, percent-clipped=14.0 2023-06-24 00:45:26,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1616442.0, ans=0.1 2023-06-24 00:45:27,888 INFO [train.py:996] (2/4) Epoch 9, batch 25450, loss[loss=0.2562, simple_loss=0.3371, pruned_loss=0.08763, over 21698.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.2996, pruned_loss=0.07758, over 4258808.63 frames. ], batch size: 263, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:46:16,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1616562.0, ans=0.125 2023-06-24 00:46:44,185 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-24 00:46:46,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1616682.0, ans=0.0 2023-06-24 00:47:04,815 INFO [train.py:996] (2/4) Epoch 9, batch 25500, loss[loss=0.2337, simple_loss=0.3078, pruned_loss=0.07984, over 21315.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2997, pruned_loss=0.07418, over 4266615.03 frames. ], batch size: 159, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:47:10,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1616742.0, ans=0.125 2023-06-24 00:48:15,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1616922.0, ans=0.0 2023-06-24 00:48:39,568 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.450e+02 4.898e+02 7.192e+02 1.024e+03 1.607e+03, threshold=1.438e+03, percent-clipped=0.0 2023-06-24 00:48:49,463 INFO [train.py:996] (2/4) Epoch 9, batch 25550, loss[loss=0.2225, simple_loss=0.3239, pruned_loss=0.06049, over 21764.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3061, pruned_loss=0.07397, over 4271212.30 frames. ], batch size: 332, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 00:49:46,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1617162.0, ans=0.0 2023-06-24 00:49:49,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1617162.0, ans=0.0 2023-06-24 00:49:51,030 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 00:50:11,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1617222.0, ans=15.0 2023-06-24 00:50:34,394 INFO [train.py:996] (2/4) Epoch 9, batch 25600, loss[loss=0.2547, simple_loss=0.3273, pruned_loss=0.09101, over 21648.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3114, pruned_loss=0.07563, over 4272511.12 frames. ], batch size: 263, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:51:06,284 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=22.5 2023-06-24 00:51:59,675 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.613e+02 7.503e+02 1.087e+03 1.475e+03 2.223e+03, threshold=2.175e+03, percent-clipped=27.0 2023-06-24 00:52:13,778 INFO [train.py:996] (2/4) Epoch 9, batch 25650, loss[loss=0.235, simple_loss=0.2907, pruned_loss=0.08962, over 15607.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3108, pruned_loss=0.07813, over 4261317.22 frames. ], batch size: 60, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:52:55,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1617702.0, ans=0.125 2023-06-24 00:53:17,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1617822.0, ans=0.1 2023-06-24 00:53:31,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1617882.0, ans=0.125 2023-06-24 00:53:54,158 INFO [train.py:996] (2/4) Epoch 9, batch 25700, loss[loss=0.2107, simple_loss=0.2855, pruned_loss=0.06791, over 21589.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3091, pruned_loss=0.07984, over 4259989.91 frames. ], batch size: 263, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:54:08,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1617942.0, ans=0.1 2023-06-24 00:54:32,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1618002.0, ans=0.1 2023-06-24 00:54:36,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1618002.0, ans=0.04949747468305833 2023-06-24 00:54:39,943 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.59 vs. limit=12.0 2023-06-24 00:54:45,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1618062.0, ans=0.0 2023-06-24 00:55:24,783 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.450e+02 6.568e+02 8.876e+02 1.245e+03 3.057e+03, threshold=1.775e+03, percent-clipped=5.0 2023-06-24 00:55:39,597 INFO [train.py:996] (2/4) Epoch 9, batch 25750, loss[loss=0.2964, simple_loss=0.3749, pruned_loss=0.1089, over 21777.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3144, pruned_loss=0.08335, over 4269086.15 frames. ], batch size: 247, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:55:50,320 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.97 vs. limit=10.0 2023-06-24 00:56:05,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1618302.0, ans=0.0 2023-06-24 00:56:17,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1618302.0, ans=0.125 2023-06-24 00:56:22,689 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.08 vs. limit=10.0 2023-06-24 00:56:30,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1618362.0, ans=0.125 2023-06-24 00:57:06,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1618482.0, ans=0.0 2023-06-24 00:57:31,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1618542.0, ans=0.07 2023-06-24 00:57:33,088 INFO [train.py:996] (2/4) Epoch 9, batch 25800, loss[loss=0.2956, simple_loss=0.3801, pruned_loss=0.1056, over 21389.00 frames. ], tot_loss[loss=0.2505, simple_loss=0.3267, pruned_loss=0.0872, over 4273862.19 frames. ], batch size: 131, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:58:13,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1618662.0, ans=0.125 2023-06-24 00:58:14,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1618662.0, ans=0.5 2023-06-24 00:58:16,605 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=15.0 2023-06-24 00:58:18,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1618662.0, ans=0.0 2023-06-24 00:59:05,227 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.745e+02 6.884e+02 9.421e+02 1.458e+03 3.090e+03, threshold=1.884e+03, percent-clipped=11.0 2023-06-24 00:59:14,587 INFO [train.py:996] (2/4) Epoch 9, batch 25850, loss[loss=0.2715, simple_loss=0.3434, pruned_loss=0.09983, over 21757.00 frames. ], tot_loss[loss=0.25, simple_loss=0.3277, pruned_loss=0.08617, over 4280050.11 frames. ], batch size: 112, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 00:59:31,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1618902.0, ans=0.025 2023-06-24 00:59:34,095 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2023-06-24 01:00:02,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1618962.0, ans=0.0 2023-06-24 01:00:18,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1619022.0, ans=0.0 2023-06-24 01:00:37,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1619022.0, ans=0.0 2023-06-24 01:00:38,670 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:00:56,716 INFO [train.py:996] (2/4) Epoch 9, batch 25900, loss[loss=0.2909, simple_loss=0.3863, pruned_loss=0.0978, over 21717.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3285, pruned_loss=0.08713, over 4279654.89 frames. ], batch size: 389, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:01:35,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1619202.0, ans=0.0 2023-06-24 01:02:23,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1619382.0, ans=0.0 2023-06-24 01:02:24,004 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.44 vs. limit=12.0 2023-06-24 01:02:27,424 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.264e+02 6.970e+02 9.529e+02 1.427e+03 2.797e+03, threshold=1.906e+03, percent-clipped=4.0 2023-06-24 01:02:30,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1619382.0, ans=0.2 2023-06-24 01:02:37,404 INFO [train.py:996] (2/4) Epoch 9, batch 25950, loss[loss=0.2752, simple_loss=0.3466, pruned_loss=0.1019, over 21331.00 frames. ], tot_loss[loss=0.2562, simple_loss=0.3342, pruned_loss=0.08912, over 4275338.65 frames. ], batch size: 549, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:04:07,828 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-24 01:04:08,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1619682.0, ans=0.125 2023-06-24 01:04:21,699 INFO [train.py:996] (2/4) Epoch 9, batch 26000, loss[loss=0.2121, simple_loss=0.3026, pruned_loss=0.06074, over 21386.00 frames. ], tot_loss[loss=0.2534, simple_loss=0.3327, pruned_loss=0.08705, over 4274196.34 frames. ], batch size: 211, lr: 3.22e-03, grad_scale: 32.0 2023-06-24 01:04:40,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1619742.0, ans=0.5 2023-06-24 01:05:49,118 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.394e+02 5.949e+02 7.869e+02 1.155e+03 1.920e+03, threshold=1.574e+03, percent-clipped=1.0 2023-06-24 01:06:01,894 INFO [train.py:996] (2/4) Epoch 9, batch 26050, loss[loss=0.2152, simple_loss=0.2731, pruned_loss=0.07862, over 21129.00 frames. ], tot_loss[loss=0.255, simple_loss=0.3323, pruned_loss=0.08891, over 4281425.71 frames. ], batch size: 608, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:06:17,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1620042.0, ans=0.2 2023-06-24 01:06:46,750 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.19 vs. limit=15.0 2023-06-24 01:07:03,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1620222.0, ans=0.1 2023-06-24 01:07:11,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1620222.0, ans=0.0 2023-06-24 01:07:42,007 INFO [train.py:996] (2/4) Epoch 9, batch 26100, loss[loss=0.2498, simple_loss=0.3086, pruned_loss=0.09549, over 21678.00 frames. ], tot_loss[loss=0.2518, simple_loss=0.3266, pruned_loss=0.08847, over 4283919.47 frames. ], batch size: 473, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:08:29,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1620462.0, ans=0.0 2023-06-24 01:08:47,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1620522.0, ans=0.125 2023-06-24 01:09:13,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1620582.0, ans=0.125 2023-06-24 01:09:14,422 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.900e+02 5.600e+02 7.578e+02 1.132e+03 2.519e+03, threshold=1.516e+03, percent-clipped=12.0 2023-06-24 01:09:27,680 INFO [train.py:996] (2/4) Epoch 9, batch 26150, loss[loss=0.2288, simple_loss=0.3035, pruned_loss=0.07708, over 21704.00 frames. ], tot_loss[loss=0.2509, simple_loss=0.3241, pruned_loss=0.08887, over 4287802.76 frames. ], batch size: 351, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:09:33,801 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.78 vs. limit=8.0 2023-06-24 01:10:18,704 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=15.0 2023-06-24 01:11:08,823 INFO [train.py:996] (2/4) Epoch 9, batch 26200, loss[loss=0.241, simple_loss=0.3489, pruned_loss=0.06661, over 21855.00 frames. ], tot_loss[loss=0.2499, simple_loss=0.3253, pruned_loss=0.08724, over 4280217.07 frames. ], batch size: 371, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:11:45,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1621002.0, ans=0.125 2023-06-24 01:12:40,875 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.187e+02 6.061e+02 7.931e+02 1.081e+03 1.881e+03, threshold=1.586e+03, percent-clipped=8.0 2023-06-24 01:12:48,760 INFO [train.py:996] (2/4) Epoch 9, batch 26250, loss[loss=0.2691, simple_loss=0.3368, pruned_loss=0.1006, over 21784.00 frames. ], tot_loss[loss=0.2501, simple_loss=0.3287, pruned_loss=0.08572, over 4286697.82 frames. ], batch size: 441, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:12:55,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1621242.0, ans=0.125 2023-06-24 01:13:24,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1621362.0, ans=0.125 2023-06-24 01:14:27,381 INFO [train.py:996] (2/4) Epoch 9, batch 26300, loss[loss=0.2223, simple_loss=0.2924, pruned_loss=0.07612, over 21692.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.326, pruned_loss=0.08639, over 4288951.57 frames. ], batch size: 230, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:14:35,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1621542.0, ans=0.0 2023-06-24 01:14:47,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1621602.0, ans=0.125 2023-06-24 01:14:57,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1621602.0, ans=0.125 2023-06-24 01:15:39,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1621722.0, ans=0.0 2023-06-24 01:16:00,710 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.956e+02 5.501e+02 7.809e+02 1.118e+03 2.350e+03, threshold=1.562e+03, percent-clipped=10.0 2023-06-24 01:16:08,897 INFO [train.py:996] (2/4) Epoch 9, batch 26350, loss[loss=0.2699, simple_loss=0.3409, pruned_loss=0.09941, over 21862.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3246, pruned_loss=0.08687, over 4291929.68 frames. ], batch size: 371, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:17:02,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1622022.0, ans=0.09899494936611666 2023-06-24 01:17:09,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1622022.0, ans=0.125 2023-06-24 01:17:50,255 INFO [train.py:996] (2/4) Epoch 9, batch 26400, loss[loss=0.2002, simple_loss=0.2671, pruned_loss=0.06658, over 21756.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3192, pruned_loss=0.08694, over 4270121.76 frames. ], batch size: 124, lr: 3.22e-03, grad_scale: 32.0 2023-06-24 01:18:25,770 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=15.0 2023-06-24 01:18:38,986 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-24 01:19:09,060 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-24 01:19:24,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1622382.0, ans=0.125 2023-06-24 01:19:27,013 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.128e+02 6.613e+02 9.874e+02 1.376e+03 2.693e+03, threshold=1.975e+03, percent-clipped=17.0 2023-06-24 01:19:33,650 INFO [train.py:996] (2/4) Epoch 9, batch 26450, loss[loss=0.2582, simple_loss=0.3498, pruned_loss=0.08332, over 21718.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3199, pruned_loss=0.08689, over 4260702.11 frames. ], batch size: 298, lr: 3.22e-03, grad_scale: 16.0 2023-06-24 01:20:42,085 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=15.0 2023-06-24 01:21:01,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1622682.0, ans=0.04949747468305833 2023-06-24 01:21:16,728 INFO [train.py:996] (2/4) Epoch 9, batch 26500, loss[loss=0.1971, simple_loss=0.266, pruned_loss=0.06407, over 21422.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3221, pruned_loss=0.08556, over 4261606.91 frames. ], batch size: 211, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 01:21:21,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1622742.0, ans=0.1 2023-06-24 01:21:44,781 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.70 vs. limit=15.0 2023-06-24 01:22:17,484 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=22.5 2023-06-24 01:22:21,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1622862.0, ans=0.125 2023-06-24 01:22:59,832 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.503e+02 8.000e+02 1.204e+03 2.180e+03 3.765e+03, threshold=2.409e+03, percent-clipped=29.0 2023-06-24 01:23:05,389 INFO [train.py:996] (2/4) Epoch 9, batch 26550, loss[loss=0.1711, simple_loss=0.2535, pruned_loss=0.04436, over 21413.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3185, pruned_loss=0.0821, over 4255310.48 frames. ], batch size: 194, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 01:23:24,423 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.84 vs. limit=10.0 2023-06-24 01:23:24,450 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.88 vs. limit=6.0 2023-06-24 01:23:47,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1623102.0, ans=0.1 2023-06-24 01:24:09,996 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:24:10,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1623222.0, ans=0.04949747468305833 2023-06-24 01:24:10,636 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.66 vs. limit=10.0 2023-06-24 01:24:13,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1623222.0, ans=0.0 2023-06-24 01:24:28,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1623282.0, ans=0.2 2023-06-24 01:24:51,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1623282.0, ans=0.0 2023-06-24 01:24:55,378 INFO [train.py:996] (2/4) Epoch 9, batch 26600, loss[loss=0.2239, simple_loss=0.3037, pruned_loss=0.07202, over 21744.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.318, pruned_loss=0.08014, over 4251935.39 frames. ], batch size: 316, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 01:25:14,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1623402.0, ans=0.0 2023-06-24 01:25:19,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1623402.0, ans=0.125 2023-06-24 01:25:45,219 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-06-24 01:26:06,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1623522.0, ans=0.125 2023-06-24 01:26:33,770 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.884e+02 5.096e+02 6.459e+02 8.360e+02 2.532e+03, threshold=1.292e+03, percent-clipped=3.0 2023-06-24 01:26:37,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1623642.0, ans=0.125 2023-06-24 01:26:38,310 INFO [train.py:996] (2/4) Epoch 9, batch 26650, loss[loss=0.1655, simple_loss=0.2346, pruned_loss=0.04825, over 21795.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3112, pruned_loss=0.07894, over 4250056.93 frames. ], batch size: 118, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 01:27:15,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1623762.0, ans=0.1 2023-06-24 01:27:33,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1623822.0, ans=0.0 2023-06-24 01:27:53,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1623882.0, ans=0.0 2023-06-24 01:28:17,067 INFO [train.py:996] (2/4) Epoch 9, batch 26700, loss[loss=0.2462, simple_loss=0.3116, pruned_loss=0.0904, over 21319.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3037, pruned_loss=0.07596, over 4254479.10 frames. ], batch size: 176, lr: 3.22e-03, grad_scale: 8.0 2023-06-24 01:28:17,447 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:29:28,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1624122.0, ans=0.125 2023-06-24 01:29:55,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=1624182.0, ans=0.1 2023-06-24 01:29:58,466 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.138e+02 5.000e+02 7.039e+02 1.019e+03 2.446e+03, threshold=1.408e+03, percent-clipped=9.0 2023-06-24 01:30:03,532 INFO [train.py:996] (2/4) Epoch 9, batch 26750, loss[loss=0.2889, simple_loss=0.3607, pruned_loss=0.1085, over 21328.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3037, pruned_loss=0.07484, over 4266951.88 frames. ], batch size: 143, lr: 3.21e-03, grad_scale: 8.0 2023-06-24 01:30:15,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1624242.0, ans=0.125 2023-06-24 01:30:16,881 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 01:31:30,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1624482.0, ans=0.0 2023-06-24 01:31:44,730 INFO [train.py:996] (2/4) Epoch 9, batch 26800, loss[loss=0.2209, simple_loss=0.33, pruned_loss=0.05589, over 20000.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3106, pruned_loss=0.07831, over 4266255.12 frames. ], batch size: 703, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:31:52,536 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-24 01:31:59,448 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.61 vs. limit=6.0 2023-06-24 01:32:00,797 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.69 vs. limit=15.0 2023-06-24 01:32:12,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1624602.0, ans=0.0 2023-06-24 01:32:57,732 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.67 vs. limit=15.0 2023-06-24 01:33:17,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1624782.0, ans=0.125 2023-06-24 01:33:19,719 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.945e+02 6.341e+02 8.835e+02 1.201e+03 2.832e+03, threshold=1.767e+03, percent-clipped=16.0 2023-06-24 01:33:24,493 INFO [train.py:996] (2/4) Epoch 9, batch 26850, loss[loss=0.2054, simple_loss=0.2675, pruned_loss=0.07162, over 22034.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3119, pruned_loss=0.08129, over 4269960.12 frames. ], batch size: 103, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:33:45,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1624902.0, ans=0.125 2023-06-24 01:34:17,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1624962.0, ans=0.125 2023-06-24 01:34:54,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1625082.0, ans=0.125 2023-06-24 01:35:07,309 INFO [train.py:996] (2/4) Epoch 9, batch 26900, loss[loss=0.252, simple_loss=0.3047, pruned_loss=0.09967, over 21519.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3031, pruned_loss=0.0808, over 4272169.26 frames. ], batch size: 391, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:35:15,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1625142.0, ans=0.125 2023-06-24 01:35:29,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1625202.0, ans=0.125 2023-06-24 01:35:31,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1625202.0, ans=0.1 2023-06-24 01:35:40,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1625202.0, ans=0.125 2023-06-24 01:35:42,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1625262.0, ans=0.0 2023-06-24 01:36:08,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1625322.0, ans=0.0 2023-06-24 01:36:33,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1625382.0, ans=0.0 2023-06-24 01:36:37,221 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.966e+02 5.702e+02 7.755e+02 1.168e+03 3.900e+03, threshold=1.551e+03, percent-clipped=4.0 2023-06-24 01:36:37,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1625382.0, ans=0.025 2023-06-24 01:36:41,753 INFO [train.py:996] (2/4) Epoch 9, batch 26950, loss[loss=0.2498, simple_loss=0.3299, pruned_loss=0.08488, over 21646.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3019, pruned_loss=0.08066, over 4277966.69 frames. ], batch size: 247, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:37:32,314 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.70 vs. limit=6.0 2023-06-24 01:38:00,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1625622.0, ans=0.0 2023-06-24 01:38:02,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1625622.0, ans=0.04949747468305833 2023-06-24 01:38:23,718 INFO [train.py:996] (2/4) Epoch 9, batch 27000, loss[loss=0.2291, simple_loss=0.3266, pruned_loss=0.0658, over 21626.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3022, pruned_loss=0.07817, over 4271983.96 frames. ], batch size: 389, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:38:23,718 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 01:38:43,008 INFO [train.py:1028] (2/4) Epoch 9, validation: loss=0.2397, simple_loss=0.3375, pruned_loss=0.07102, over 1796401.00 frames. 2023-06-24 01:38:43,008 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-24 01:39:27,380 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.34 vs. limit=15.0 2023-06-24 01:40:08,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1625982.0, ans=0.1 2023-06-24 01:40:16,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1625982.0, ans=0.125 2023-06-24 01:40:18,512 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.845e+02 5.629e+02 9.070e+02 1.221e+03 2.568e+03, threshold=1.814e+03, percent-clipped=16.0 2023-06-24 01:40:23,208 INFO [train.py:996] (2/4) Epoch 9, batch 27050, loss[loss=0.2196, simple_loss=0.3101, pruned_loss=0.06451, over 21684.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3034, pruned_loss=0.07422, over 4271498.40 frames. ], batch size: 298, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:40:27,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1626042.0, ans=0.125 2023-06-24 01:40:33,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1626042.0, ans=0.07 2023-06-24 01:40:44,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1626102.0, ans=0.2 2023-06-24 01:41:20,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1626162.0, ans=0.125 2023-06-24 01:41:37,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1626222.0, ans=0.125 2023-06-24 01:42:04,826 INFO [train.py:996] (2/4) Epoch 9, batch 27100, loss[loss=0.2307, simple_loss=0.3085, pruned_loss=0.0765, over 21891.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3057, pruned_loss=0.07551, over 4274774.50 frames. ], batch size: 371, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:42:14,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1626342.0, ans=0.1 2023-06-24 01:43:07,484 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.16 vs. limit=10.0 2023-06-24 01:43:31,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1626582.0, ans=0.1 2023-06-24 01:43:42,159 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.346e+02 6.138e+02 8.870e+02 1.301e+03 2.299e+03, threshold=1.774e+03, percent-clipped=4.0 2023-06-24 01:43:47,576 INFO [train.py:996] (2/4) Epoch 9, batch 27150, loss[loss=0.2958, simple_loss=0.3825, pruned_loss=0.1045, over 21749.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3165, pruned_loss=0.07847, over 4280669.28 frames. ], batch size: 332, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:45:32,204 INFO [train.py:996] (2/4) Epoch 9, batch 27200, loss[loss=0.309, simple_loss=0.3828, pruned_loss=0.1176, over 21271.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3244, pruned_loss=0.08101, over 4277142.65 frames. ], batch size: 548, lr: 3.21e-03, grad_scale: 32.0 2023-06-24 01:45:51,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1626942.0, ans=0.2 2023-06-24 01:46:33,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1627062.0, ans=0.2 2023-06-24 01:47:17,487 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.204e+02 6.892e+02 9.906e+02 1.306e+03 3.118e+03, threshold=1.981e+03, percent-clipped=15.0 2023-06-24 01:47:22,711 INFO [train.py:996] (2/4) Epoch 9, batch 27250, loss[loss=0.2312, simple_loss=0.3089, pruned_loss=0.07675, over 20643.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3292, pruned_loss=0.08618, over 4280450.20 frames. ], batch size: 607, lr: 3.21e-03, grad_scale: 32.0 2023-06-24 01:47:29,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1627242.0, ans=0.035 2023-06-24 01:48:10,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1627362.0, ans=0.125 2023-06-24 01:48:11,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1627362.0, ans=0.125 2023-06-24 01:48:44,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1627422.0, ans=0.2 2023-06-24 01:49:03,870 INFO [train.py:996] (2/4) Epoch 9, batch 27300, loss[loss=0.2325, simple_loss=0.322, pruned_loss=0.07147, over 21652.00 frames. ], tot_loss[loss=0.2515, simple_loss=0.3303, pruned_loss=0.08639, over 4279538.31 frames. ], batch size: 230, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:50:02,921 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.76 vs. limit=15.0 2023-06-24 01:50:40,409 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.677e+02 5.255e+02 6.671e+02 8.856e+02 1.687e+03, threshold=1.334e+03, percent-clipped=0.0 2023-06-24 01:50:43,350 INFO [train.py:996] (2/4) Epoch 9, batch 27350, loss[loss=0.2273, simple_loss=0.3089, pruned_loss=0.07285, over 21242.00 frames. ], tot_loss[loss=0.2541, simple_loss=0.3329, pruned_loss=0.08767, over 4281142.13 frames. ], batch size: 143, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:50:44,395 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=22.5 2023-06-24 01:51:04,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1627902.0, ans=0.04949747468305833 2023-06-24 01:51:15,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1627902.0, ans=0.125 2023-06-24 01:51:19,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1627962.0, ans=0.05 2023-06-24 01:52:17,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1628082.0, ans=0.125 2023-06-24 01:52:17,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1628082.0, ans=0.125 2023-06-24 01:52:21,408 INFO [train.py:996] (2/4) Epoch 9, batch 27400, loss[loss=0.2215, simple_loss=0.2771, pruned_loss=0.0829, over 21224.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3285, pruned_loss=0.08714, over 4286286.75 frames. ], batch size: 143, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:52:33,788 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.22 vs. limit=22.5 2023-06-24 01:52:43,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1628202.0, ans=0.125 2023-06-24 01:52:48,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1628202.0, ans=0.125 2023-06-24 01:52:48,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1628202.0, ans=0.0 2023-06-24 01:52:59,488 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.18 vs. limit=10.0 2023-06-24 01:53:00,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1628262.0, ans=0.0 2023-06-24 01:53:54,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1628382.0, ans=0.2 2023-06-24 01:53:58,830 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.992e+02 5.150e+02 6.408e+02 1.007e+03 1.892e+03, threshold=1.282e+03, percent-clipped=7.0 2023-06-24 01:54:01,965 INFO [train.py:996] (2/4) Epoch 9, batch 27450, loss[loss=0.3263, simple_loss=0.3807, pruned_loss=0.136, over 21315.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3225, pruned_loss=0.08606, over 4278649.73 frames. ], batch size: 507, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:54:11,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1628442.0, ans=0.0 2023-06-24 01:54:21,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1628502.0, ans=0.0 2023-06-24 01:54:39,622 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=22.5 2023-06-24 01:55:00,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1628562.0, ans=0.125 2023-06-24 01:55:28,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1628682.0, ans=0.125 2023-06-24 01:55:39,262 INFO [train.py:996] (2/4) Epoch 9, batch 27500, loss[loss=0.2589, simple_loss=0.321, pruned_loss=0.09841, over 21370.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.321, pruned_loss=0.08623, over 4286184.64 frames. ], batch size: 144, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:56:07,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1628802.0, ans=0.125 2023-06-24 01:57:11,063 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.736e+02 5.167e+02 6.606e+02 9.536e+02 1.970e+03, threshold=1.321e+03, percent-clipped=8.0 2023-06-24 01:57:18,524 INFO [train.py:996] (2/4) Epoch 9, batch 27550, loss[loss=0.3353, simple_loss=0.4192, pruned_loss=0.1257, over 20012.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3165, pruned_loss=0.084, over 4285766.44 frames. ], batch size: 702, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 01:57:31,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1629042.0, ans=0.125 2023-06-24 01:57:50,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1629102.0, ans=0.0 2023-06-24 01:58:19,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1629222.0, ans=0.2 2023-06-24 01:58:56,721 INFO [train.py:996] (2/4) Epoch 9, batch 27600, loss[loss=0.2319, simple_loss=0.2953, pruned_loss=0.08424, over 21772.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3091, pruned_loss=0.0827, over 4287159.90 frames. ], batch size: 371, lr: 3.21e-03, grad_scale: 32.0 2023-06-24 01:59:10,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1629342.0, ans=0.125 2023-06-24 02:00:27,991 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.819e+02 6.454e+02 1.076e+03 1.748e+03 4.788e+03, threshold=2.152e+03, percent-clipped=40.0 2023-06-24 02:00:31,082 INFO [train.py:996] (2/4) Epoch 9, batch 27650, loss[loss=0.2415, simple_loss=0.3064, pruned_loss=0.08832, over 21745.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.303, pruned_loss=0.08181, over 4275238.48 frames. ], batch size: 298, lr: 3.21e-03, grad_scale: 32.0 2023-06-24 02:00:41,386 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.90 vs. limit=15.0 2023-06-24 02:01:18,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1629762.0, ans=0.0 2023-06-24 02:01:31,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1629762.0, ans=0.1 2023-06-24 02:01:44,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1629822.0, ans=0.125 2023-06-24 02:01:47,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1629822.0, ans=0.125 2023-06-24 02:02:14,132 INFO [train.py:996] (2/4) Epoch 9, batch 27700, loss[loss=0.2671, simple_loss=0.3422, pruned_loss=0.09597, over 21756.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.305, pruned_loss=0.08076, over 4275315.07 frames. ], batch size: 414, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:02:29,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1630002.0, ans=0.0 2023-06-24 02:02:32,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1630002.0, ans=0.125 2023-06-24 02:02:34,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1630002.0, ans=0.0 2023-06-24 02:02:35,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1630002.0, ans=0.0 2023-06-24 02:02:50,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1630002.0, ans=0.125 2023-06-24 02:03:06,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1630062.0, ans=0.2 2023-06-24 02:03:34,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1630182.0, ans=0.125 2023-06-24 02:03:53,631 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.601e+02 6.018e+02 9.406e+02 1.396e+03 2.807e+03, threshold=1.881e+03, percent-clipped=3.0 2023-06-24 02:03:55,110 INFO [train.py:996] (2/4) Epoch 9, batch 27750, loss[loss=0.2217, simple_loss=0.3336, pruned_loss=0.05493, over 20821.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3071, pruned_loss=0.07979, over 4276871.78 frames. ], batch size: 608, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:03:58,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1630242.0, ans=0.0 2023-06-24 02:03:58,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1630242.0, ans=0.125 2023-06-24 02:04:12,041 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.46 vs. limit=15.0 2023-06-24 02:04:45,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1630362.0, ans=0.0 2023-06-24 02:05:16,746 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.87 vs. limit=15.0 2023-06-24 02:05:20,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1630482.0, ans=0.1 2023-06-24 02:05:28,015 INFO [train.py:996] (2/4) Epoch 9, batch 27800, loss[loss=0.2616, simple_loss=0.3205, pruned_loss=0.1013, over 21624.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.306, pruned_loss=0.08018, over 4283954.74 frames. ], batch size: 548, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:06:31,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1630662.0, ans=0.125 2023-06-24 02:06:32,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1630662.0, ans=0.125 2023-06-24 02:06:39,153 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-24 02:06:53,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1630782.0, ans=0.1 2023-06-24 02:07:10,671 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.786e+02 6.303e+02 8.930e+02 1.307e+03 2.305e+03, threshold=1.786e+03, percent-clipped=6.0 2023-06-24 02:07:12,454 INFO [train.py:996] (2/4) Epoch 9, batch 27850, loss[loss=0.2501, simple_loss=0.325, pruned_loss=0.08762, over 21799.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3059, pruned_loss=0.08155, over 4292525.53 frames. ], batch size: 247, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:08:02,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1630962.0, ans=0.125 2023-06-24 02:08:23,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1631022.0, ans=0.2 2023-06-24 02:08:30,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1631022.0, ans=0.1 2023-06-24 02:08:49,369 INFO [train.py:996] (2/4) Epoch 9, batch 27900, loss[loss=0.294, simple_loss=0.4028, pruned_loss=0.09263, over 21180.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3162, pruned_loss=0.08283, over 4286077.63 frames. ], batch size: 548, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:09:20,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1631202.0, ans=0.125 2023-06-24 02:09:21,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1631202.0, ans=0.2 2023-06-24 02:09:26,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1631202.0, ans=0.2 2023-06-24 02:10:44,055 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.905e+02 5.823e+02 7.958e+02 1.181e+03 2.526e+03, threshold=1.592e+03, percent-clipped=8.0 2023-06-24 02:10:45,585 INFO [train.py:996] (2/4) Epoch 9, batch 27950, loss[loss=0.2232, simple_loss=0.3115, pruned_loss=0.06744, over 21721.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3165, pruned_loss=0.0797, over 4282136.07 frames. ], batch size: 247, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:10:51,550 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.71 vs. limit=10.0 2023-06-24 02:11:05,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1631502.0, ans=0.125 2023-06-24 02:12:18,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1631742.0, ans=0.125 2023-06-24 02:12:20,023 INFO [train.py:996] (2/4) Epoch 9, batch 28000, loss[loss=0.2017, simple_loss=0.2754, pruned_loss=0.06405, over 20137.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.312, pruned_loss=0.07616, over 4285968.49 frames. ], batch size: 703, lr: 3.21e-03, grad_scale: 32.0 2023-06-24 02:12:20,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1631742.0, ans=0.125 2023-06-24 02:12:44,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1631802.0, ans=0.0 2023-06-24 02:12:44,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1631802.0, ans=0.125 2023-06-24 02:12:49,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1631802.0, ans=0.09899494936611666 2023-06-24 02:13:15,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1631862.0, ans=15.0 2023-06-24 02:13:24,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1631922.0, ans=0.125 2023-06-24 02:14:06,382 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.113e+02 6.367e+02 8.563e+02 1.195e+03 2.843e+03, threshold=1.713e+03, percent-clipped=10.0 2023-06-24 02:14:06,416 INFO [train.py:996] (2/4) Epoch 9, batch 28050, loss[loss=0.2752, simple_loss=0.3521, pruned_loss=0.09919, over 21522.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3108, pruned_loss=0.07805, over 4286620.83 frames. ], batch size: 471, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:14:08,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1632042.0, ans=0.1 2023-06-24 02:14:50,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1632162.0, ans=0.2 2023-06-24 02:14:55,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1632162.0, ans=0.04949747468305833 2023-06-24 02:15:45,598 INFO [train.py:996] (2/4) Epoch 9, batch 28100, loss[loss=0.2223, simple_loss=0.2893, pruned_loss=0.07768, over 21444.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.308, pruned_loss=0.07757, over 4278818.87 frames. ], batch size: 389, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:16:04,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1632402.0, ans=0.125 2023-06-24 02:17:07,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1632582.0, ans=0.2 2023-06-24 02:17:22,317 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.058e+02 5.561e+02 8.006e+02 1.245e+03 3.714e+03, threshold=1.601e+03, percent-clipped=14.0 2023-06-24 02:17:22,346 INFO [train.py:996] (2/4) Epoch 9, batch 28150, loss[loss=0.2209, simple_loss=0.2847, pruned_loss=0.07853, over 21764.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.303, pruned_loss=0.07775, over 4268385.97 frames. ], batch size: 371, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:17:37,309 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.99 vs. limit=15.0 2023-06-24 02:17:40,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1632702.0, ans=0.2 2023-06-24 02:17:43,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1632702.0, ans=0.0 2023-06-24 02:18:06,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1632762.0, ans=0.125 2023-06-24 02:18:47,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1632882.0, ans=0.2 2023-06-24 02:18:56,302 INFO [train.py:996] (2/4) Epoch 9, batch 28200, loss[loss=0.2524, simple_loss=0.3214, pruned_loss=0.09172, over 21962.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3017, pruned_loss=0.07995, over 4267885.15 frames. ], batch size: 317, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:19:05,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1632942.0, ans=0.0 2023-06-24 02:19:44,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1633062.0, ans=0.0 2023-06-24 02:19:55,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1633122.0, ans=0.125 2023-06-24 02:20:14,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1633122.0, ans=0.125 2023-06-24 02:20:35,754 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.373e+02 7.070e+02 1.041e+03 1.545e+03 2.791e+03, threshold=2.082e+03, percent-clipped=22.0 2023-06-24 02:20:35,775 INFO [train.py:996] (2/4) Epoch 9, batch 28250, loss[loss=0.2166, simple_loss=0.2881, pruned_loss=0.07261, over 21664.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3048, pruned_loss=0.0826, over 4271583.29 frames. ], batch size: 332, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:20:39,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1633242.0, ans=0.125 2023-06-24 02:20:52,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1633302.0, ans=0.125 2023-06-24 02:21:11,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1633362.0, ans=0.2 2023-06-24 02:21:13,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1633362.0, ans=0.2 2023-06-24 02:21:50,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1633422.0, ans=22.5 2023-06-24 02:21:58,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1633422.0, ans=0.0 2023-06-24 02:22:05,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1633482.0, ans=0.1 2023-06-24 02:22:17,407 INFO [train.py:996] (2/4) Epoch 9, batch 28300, loss[loss=0.1846, simple_loss=0.2786, pruned_loss=0.04524, over 21764.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3022, pruned_loss=0.07992, over 4267398.65 frames. ], batch size: 316, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:22:19,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1633542.0, ans=0.07 2023-06-24 02:23:56,793 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.614e+02 5.879e+02 9.596e+02 1.261e+03 3.601e+03, threshold=1.919e+03, percent-clipped=6.0 2023-06-24 02:23:56,814 INFO [train.py:996] (2/4) Epoch 9, batch 28350, loss[loss=0.2318, simple_loss=0.3394, pruned_loss=0.0621, over 21567.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2974, pruned_loss=0.07432, over 4258856.50 frames. ], batch size: 441, lr: 3.21e-03, grad_scale: 16.0 2023-06-24 02:24:07,158 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=15.0 2023-06-24 02:25:20,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1634082.0, ans=0.125 2023-06-24 02:25:35,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1634082.0, ans=0.2 2023-06-24 02:25:40,914 INFO [train.py:996] (2/4) Epoch 9, batch 28400, loss[loss=0.3013, simple_loss=0.3545, pruned_loss=0.124, over 21331.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2951, pruned_loss=0.07525, over 4254773.39 frames. ], batch size: 471, lr: 3.21e-03, grad_scale: 32.0 2023-06-24 02:25:41,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1634142.0, ans=0.1 2023-06-24 02:25:52,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1634142.0, ans=0.0 2023-06-24 02:26:08,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1634202.0, ans=0.0 2023-06-24 02:26:20,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1634262.0, ans=0.2 2023-06-24 02:26:25,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1634262.0, ans=0.0 2023-06-24 02:26:47,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1634322.0, ans=0.125 2023-06-24 02:27:18,255 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.912e+02 5.958e+02 8.832e+02 1.277e+03 2.222e+03, threshold=1.766e+03, percent-clipped=4.0 2023-06-24 02:27:18,275 INFO [train.py:996] (2/4) Epoch 9, batch 28450, loss[loss=0.1925, simple_loss=0.2565, pruned_loss=0.06424, over 20779.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.2992, pruned_loss=0.07821, over 4262461.92 frames. ], batch size: 608, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:27:58,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1634562.0, ans=0.125 2023-06-24 02:28:42,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1634682.0, ans=0.0 2023-06-24 02:28:48,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1634682.0, ans=0.125 2023-06-24 02:28:55,994 INFO [train.py:996] (2/4) Epoch 9, batch 28500, loss[loss=0.2319, simple_loss=0.3051, pruned_loss=0.07932, over 21881.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3018, pruned_loss=0.08019, over 4271844.70 frames. ], batch size: 371, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:29:09,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1634742.0, ans=15.0 2023-06-24 02:29:27,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1634802.0, ans=0.2 2023-06-24 02:29:45,753 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.60 vs. limit=15.0 2023-06-24 02:29:46,807 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:29:50,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1634862.0, ans=0.0 2023-06-24 02:29:58,438 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-24 02:30:40,762 INFO [train.py:996] (2/4) Epoch 9, batch 28550, loss[loss=0.2197, simple_loss=0.2994, pruned_loss=0.06997, over 20807.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3107, pruned_loss=0.083, over 4276284.23 frames. ], batch size: 608, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:30:42,301 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.957e+02 5.998e+02 7.738e+02 1.217e+03 2.112e+03, threshold=1.548e+03, percent-clipped=6.0 2023-06-24 02:31:42,550 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.19 vs. limit=15.0 2023-06-24 02:32:08,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1635282.0, ans=0.125 2023-06-24 02:32:21,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1635282.0, ans=0.0 2023-06-24 02:32:24,243 INFO [train.py:996] (2/4) Epoch 9, batch 28600, loss[loss=0.2473, simple_loss=0.3184, pruned_loss=0.08807, over 21721.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3178, pruned_loss=0.08498, over 4274570.06 frames. ], batch size: 298, lr: 3.20e-03, grad_scale: 8.0 2023-06-24 02:32:52,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1635402.0, ans=0.0 2023-06-24 02:33:03,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1635462.0, ans=0.125 2023-06-24 02:33:33,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1635522.0, ans=0.125 2023-06-24 02:33:43,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1635582.0, ans=0.0 2023-06-24 02:34:01,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1635642.0, ans=0.2 2023-06-24 02:34:02,746 INFO [train.py:996] (2/4) Epoch 9, batch 28650, loss[loss=0.2311, simple_loss=0.2871, pruned_loss=0.0876, over 21279.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3135, pruned_loss=0.08467, over 4268717.11 frames. ], batch size: 177, lr: 3.20e-03, grad_scale: 8.0 2023-06-24 02:34:11,196 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.746e+02 6.069e+02 8.380e+02 1.162e+03 2.307e+03, threshold=1.676e+03, percent-clipped=7.0 2023-06-24 02:34:46,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1635762.0, ans=0.125 2023-06-24 02:35:03,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1635822.0, ans=0.125 2023-06-24 02:35:05,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1635822.0, ans=0.0 2023-06-24 02:35:32,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1635882.0, ans=0.125 2023-06-24 02:35:38,485 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:35:46,696 INFO [train.py:996] (2/4) Epoch 9, batch 28700, loss[loss=0.254, simple_loss=0.3253, pruned_loss=0.09139, over 21751.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3135, pruned_loss=0.08608, over 4265135.62 frames. ], batch size: 351, lr: 3.20e-03, grad_scale: 8.0 2023-06-24 02:37:04,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1636182.0, ans=0.125 2023-06-24 02:37:07,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=1636182.0, ans=0.02 2023-06-24 02:37:11,311 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:37:25,543 INFO [train.py:996] (2/4) Epoch 9, batch 28750, loss[loss=0.2627, simple_loss=0.3651, pruned_loss=0.08012, over 19868.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3138, pruned_loss=0.08686, over 4269742.56 frames. ], batch size: 703, lr: 3.20e-03, grad_scale: 8.0 2023-06-24 02:37:28,864 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.971e+02 6.417e+02 8.454e+02 1.129e+03 2.571e+03, threshold=1.691e+03, percent-clipped=6.0 2023-06-24 02:37:43,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1636302.0, ans=0.0 2023-06-24 02:38:59,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1636482.0, ans=0.0 2023-06-24 02:39:03,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1636542.0, ans=0.125 2023-06-24 02:39:04,497 INFO [train.py:996] (2/4) Epoch 9, batch 28800, loss[loss=0.2493, simple_loss=0.3206, pruned_loss=0.08901, over 21376.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3173, pruned_loss=0.08701, over 4276495.16 frames. ], batch size: 548, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:39:05,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1636542.0, ans=0.125 2023-06-24 02:39:05,971 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.58 vs. limit=15.0 2023-06-24 02:39:33,530 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-24 02:40:38,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1636782.0, ans=0.0 2023-06-24 02:40:43,029 INFO [train.py:996] (2/4) Epoch 9, batch 28850, loss[loss=0.2613, simple_loss=0.3171, pruned_loss=0.1027, over 21278.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.319, pruned_loss=0.08864, over 4279334.69 frames. ], batch size: 159, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:40:46,359 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.643e+02 7.030e+02 9.281e+02 1.224e+03 2.045e+03, threshold=1.856e+03, percent-clipped=4.0 2023-06-24 02:40:51,805 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.14 vs. limit=15.0 2023-06-24 02:41:20,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1636962.0, ans=0.1 2023-06-24 02:42:19,073 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.87 vs. limit=15.0 2023-06-24 02:42:21,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1637142.0, ans=0.2 2023-06-24 02:42:23,127 INFO [train.py:996] (2/4) Epoch 9, batch 28900, loss[loss=0.2593, simple_loss=0.3361, pruned_loss=0.09127, over 21384.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3228, pruned_loss=0.09062, over 4275862.54 frames. ], batch size: 548, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:42:39,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1637142.0, ans=0.125 2023-06-24 02:44:08,014 INFO [train.py:996] (2/4) Epoch 9, batch 28950, loss[loss=0.2533, simple_loss=0.3582, pruned_loss=0.07415, over 21209.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3214, pruned_loss=0.08901, over 4275184.01 frames. ], batch size: 548, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:44:11,409 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.380e+02 7.524e+02 1.128e+03 1.793e+03 3.083e+03, threshold=2.257e+03, percent-clipped=23.0 2023-06-24 02:44:18,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1637442.0, ans=0.0 2023-06-24 02:45:03,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1637562.0, ans=0.125 2023-06-24 02:45:08,172 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:45:09,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1637622.0, ans=0.125 2023-06-24 02:45:22,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1637622.0, ans=0.0 2023-06-24 02:45:47,869 INFO [train.py:996] (2/4) Epoch 9, batch 29000, loss[loss=0.2586, simple_loss=0.3383, pruned_loss=0.08944, over 21384.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3244, pruned_loss=0.08753, over 4274020.04 frames. ], batch size: 131, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:46:49,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1637922.0, ans=15.0 2023-06-24 02:47:32,702 INFO [train.py:996] (2/4) Epoch 9, batch 29050, loss[loss=0.2303, simple_loss=0.2975, pruned_loss=0.08157, over 21693.00 frames. ], tot_loss[loss=0.2497, simple_loss=0.3246, pruned_loss=0.08735, over 4272362.26 frames. ], batch size: 230, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:47:40,517 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.219e+02 6.530e+02 1.098e+03 1.738e+03 3.592e+03, threshold=2.195e+03, percent-clipped=7.0 2023-06-24 02:47:50,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1638042.0, ans=0.125 2023-06-24 02:48:04,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1638102.0, ans=0.0 2023-06-24 02:48:15,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1638162.0, ans=0.125 2023-06-24 02:48:17,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1638162.0, ans=0.125 2023-06-24 02:48:17,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1638162.0, ans=0.2 2023-06-24 02:48:22,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1638162.0, ans=0.125 2023-06-24 02:49:11,746 INFO [train.py:996] (2/4) Epoch 9, batch 29100, loss[loss=0.2038, simple_loss=0.2709, pruned_loss=0.06838, over 21755.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3167, pruned_loss=0.0856, over 4272460.16 frames. ], batch size: 112, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:50:01,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1638462.0, ans=0.125 2023-06-24 02:50:04,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1638462.0, ans=0.125 2023-06-24 02:50:47,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1638582.0, ans=0.035 2023-06-24 02:50:54,259 INFO [train.py:996] (2/4) Epoch 9, batch 29150, loss[loss=0.2305, simple_loss=0.2771, pruned_loss=0.09193, over 20057.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3137, pruned_loss=0.08334, over 4271947.49 frames. ], batch size: 703, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 02:50:57,277 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.333e+02 5.735e+02 8.237e+02 1.411e+03 3.649e+03, threshold=1.647e+03, percent-clipped=7.0 2023-06-24 02:51:57,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1638822.0, ans=0.125 2023-06-24 02:52:21,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1638882.0, ans=0.0 2023-06-24 02:52:32,383 INFO [train.py:996] (2/4) Epoch 9, batch 29200, loss[loss=0.2096, simple_loss=0.2711, pruned_loss=0.07406, over 20203.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3094, pruned_loss=0.08291, over 4264317.19 frames. ], batch size: 703, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:53:09,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1639062.0, ans=0.125 2023-06-24 02:53:33,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1639122.0, ans=0.125 2023-06-24 02:54:06,529 INFO [train.py:996] (2/4) Epoch 9, batch 29250, loss[loss=0.2674, simple_loss=0.3529, pruned_loss=0.09089, over 21710.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.308, pruned_loss=0.08075, over 4264326.84 frames. ], batch size: 415, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:54:09,740 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.813e+02 6.323e+02 1.080e+03 1.364e+03 2.361e+03, threshold=2.161e+03, percent-clipped=10.0 2023-06-24 02:54:26,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1639302.0, ans=0.125 2023-06-24 02:54:43,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1639362.0, ans=0.125 2023-06-24 02:55:45,681 INFO [train.py:996] (2/4) Epoch 9, batch 29300, loss[loss=0.1919, simple_loss=0.2482, pruned_loss=0.0678, over 19962.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3081, pruned_loss=0.07943, over 4270518.72 frames. ], batch size: 703, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:56:14,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1639602.0, ans=0.07 2023-06-24 02:56:15,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1639602.0, ans=0.2 2023-06-24 02:56:42,660 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 02:56:52,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1639722.0, ans=0.125 2023-06-24 02:57:21,345 INFO [train.py:996] (2/4) Epoch 9, batch 29350, loss[loss=0.2068, simple_loss=0.277, pruned_loss=0.06833, over 21110.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3041, pruned_loss=0.07888, over 4266385.25 frames. ], batch size: 143, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:57:29,477 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.886e+02 5.877e+02 8.410e+02 1.271e+03 3.253e+03, threshold=1.682e+03, percent-clipped=5.0 2023-06-24 02:57:37,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1639842.0, ans=0.2 2023-06-24 02:57:48,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1639902.0, ans=0.1 2023-06-24 02:58:25,991 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.38 vs. limit=15.0 2023-06-24 02:58:42,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1640082.0, ans=15.0 2023-06-24 02:59:02,511 INFO [train.py:996] (2/4) Epoch 9, batch 29400, loss[loss=0.1272, simple_loss=0.1855, pruned_loss=0.0344, over 21844.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3048, pruned_loss=0.0769, over 4267800.84 frames. ], batch size: 118, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 02:59:26,491 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-24 02:59:29,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1640202.0, ans=0.2 2023-06-24 02:59:29,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1640202.0, ans=0.125 2023-06-24 02:59:32,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1640202.0, ans=0.125 2023-06-24 02:59:35,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1640202.0, ans=0.0 2023-06-24 03:00:38,188 INFO [train.py:996] (2/4) Epoch 9, batch 29450, loss[loss=0.1812, simple_loss=0.249, pruned_loss=0.05672, over 21593.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3036, pruned_loss=0.07627, over 4265940.41 frames. ], batch size: 263, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:00:43,279 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.242e+02 8.018e+02 1.543e+03 2.395e+03 4.126e+03, threshold=3.085e+03, percent-clipped=41.0 2023-06-24 03:00:53,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1640442.0, ans=0.1 2023-06-24 03:00:56,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1640442.0, ans=0.0 2023-06-24 03:01:12,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=1640502.0, ans=0.02 2023-06-24 03:01:12,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1640502.0, ans=0.0 2023-06-24 03:01:16,196 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=22.5 2023-06-24 03:01:43,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1640622.0, ans=0.2 2023-06-24 03:01:48,355 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.95 vs. limit=10.0 2023-06-24 03:02:13,441 INFO [train.py:996] (2/4) Epoch 9, batch 29500, loss[loss=0.2552, simple_loss=0.3238, pruned_loss=0.09325, over 21981.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3078, pruned_loss=0.07956, over 4275586.79 frames. ], batch size: 373, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:02:31,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1640742.0, ans=0.0 2023-06-24 03:03:28,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1640922.0, ans=0.125 2023-06-24 03:03:41,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1640982.0, ans=0.0 2023-06-24 03:03:52,436 INFO [train.py:996] (2/4) Epoch 9, batch 29550, loss[loss=0.2463, simple_loss=0.3087, pruned_loss=0.09193, over 21456.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3071, pruned_loss=0.08138, over 4280753.97 frames. ], batch size: 144, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:03:57,062 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.109e+02 5.276e+02 6.489e+02 8.099e+02 1.842e+03, threshold=1.298e+03, percent-clipped=0.0 2023-06-24 03:04:15,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1641102.0, ans=22.5 2023-06-24 03:04:16,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1641102.0, ans=10.0 2023-06-24 03:04:54,771 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.63 vs. limit=15.0 2023-06-24 03:05:33,525 INFO [train.py:996] (2/4) Epoch 9, batch 29600, loss[loss=0.3188, simple_loss=0.4441, pruned_loss=0.09673, over 19786.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3141, pruned_loss=0.08386, over 4283423.67 frames. ], batch size: 702, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 03:06:00,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1641402.0, ans=0.1 2023-06-24 03:06:26,760 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.89 vs. limit=15.0 2023-06-24 03:07:01,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1641582.0, ans=0.125 2023-06-24 03:07:16,565 INFO [train.py:996] (2/4) Epoch 9, batch 29650, loss[loss=0.2195, simple_loss=0.2911, pruned_loss=0.07393, over 21507.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3121, pruned_loss=0.08076, over 4289058.51 frames. ], batch size: 548, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 03:07:21,399 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.535e+02 6.438e+02 9.787e+02 1.351e+03 2.800e+03, threshold=1.957e+03, percent-clipped=29.0 2023-06-24 03:07:38,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1641702.0, ans=0.1 2023-06-24 03:08:09,409 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-24 03:08:56,641 INFO [train.py:996] (2/4) Epoch 9, batch 29700, loss[loss=0.2057, simple_loss=0.2767, pruned_loss=0.06733, over 21698.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3145, pruned_loss=0.08103, over 4293333.68 frames. ], batch size: 263, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:09:07,954 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.85 vs. limit=22.5 2023-06-24 03:10:34,579 INFO [train.py:996] (2/4) Epoch 9, batch 29750, loss[loss=0.2062, simple_loss=0.275, pruned_loss=0.06875, over 20161.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3183, pruned_loss=0.08081, over 4287161.14 frames. ], batch size: 702, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:10:40,757 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.742e+02 5.529e+02 6.914e+02 9.553e+02 2.350e+03, threshold=1.383e+03, percent-clipped=6.0 2023-06-24 03:11:55,864 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.77 vs. limit=15.0 2023-06-24 03:12:13,536 INFO [train.py:996] (2/4) Epoch 9, batch 29800, loss[loss=0.239, simple_loss=0.3067, pruned_loss=0.08564, over 21431.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3197, pruned_loss=0.08177, over 4291417.49 frames. ], batch size: 211, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:12:23,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1642542.0, ans=0.0 2023-06-24 03:12:27,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1642602.0, ans=0.125 2023-06-24 03:12:45,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1642602.0, ans=0.125 2023-06-24 03:12:59,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1642662.0, ans=0.1 2023-06-24 03:13:21,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1642722.0, ans=0.125 2023-06-24 03:13:48,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1642782.0, ans=0.125 2023-06-24 03:13:51,069 INFO [train.py:996] (2/4) Epoch 9, batch 29850, loss[loss=0.2304, simple_loss=0.3381, pruned_loss=0.06129, over 19790.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3156, pruned_loss=0.07914, over 4295193.95 frames. ], batch size: 703, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:13:57,484 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.819e+02 7.548e+02 1.159e+03 1.635e+03 3.345e+03, threshold=2.317e+03, percent-clipped=36.0 2023-06-24 03:14:12,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1642902.0, ans=0.1 2023-06-24 03:15:01,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1643022.0, ans=10.0 2023-06-24 03:15:07,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1643082.0, ans=0.125 2023-06-24 03:15:29,133 INFO [train.py:996] (2/4) Epoch 9, batch 29900, loss[loss=0.2708, simple_loss=0.3455, pruned_loss=0.09804, over 21457.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3154, pruned_loss=0.08136, over 4298106.44 frames. ], batch size: 131, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:16:50,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1643382.0, ans=0.125 2023-06-24 03:16:54,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1643382.0, ans=0.0 2023-06-24 03:17:08,326 INFO [train.py:996] (2/4) Epoch 9, batch 29950, loss[loss=0.2708, simple_loss=0.3306, pruned_loss=0.1055, over 21627.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3178, pruned_loss=0.08517, over 4295085.67 frames. ], batch size: 263, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:17:19,364 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.107e+02 5.748e+02 7.806e+02 1.232e+03 2.482e+03, threshold=1.561e+03, percent-clipped=2.0 2023-06-24 03:17:21,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1643442.0, ans=0.0 2023-06-24 03:17:26,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1643442.0, ans=0.125 2023-06-24 03:18:08,400 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.32 vs. limit=22.5 2023-06-24 03:18:39,662 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.41 vs. limit=10.0 2023-06-24 03:18:50,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1643682.0, ans=0.0 2023-06-24 03:19:00,236 INFO [train.py:996] (2/4) Epoch 9, batch 30000, loss[loss=0.2149, simple_loss=0.2924, pruned_loss=0.06871, over 20831.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3196, pruned_loss=0.08528, over 4287656.60 frames. ], batch size: 611, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 03:19:00,236 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 03:19:17,062 INFO [train.py:1028] (2/4) Epoch 9, validation: loss=0.2502, simple_loss=0.3471, pruned_loss=0.07663, over 1796401.00 frames. 2023-06-24 03:19:17,063 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-24 03:19:34,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1643742.0, ans=0.5 2023-06-24 03:19:35,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1643742.0, ans=0.0 2023-06-24 03:19:59,513 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.35 vs. limit=15.0 2023-06-24 03:20:25,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1643922.0, ans=0.0 2023-06-24 03:20:37,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1643922.0, ans=0.1 2023-06-24 03:21:03,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1644042.0, ans=0.0 2023-06-24 03:21:04,951 INFO [train.py:996] (2/4) Epoch 9, batch 30050, loss[loss=0.2962, simple_loss=0.4072, pruned_loss=0.09258, over 21649.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3211, pruned_loss=0.08116, over 4281881.39 frames. ], batch size: 441, lr: 3.20e-03, grad_scale: 32.0 2023-06-24 03:21:11,486 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.143e+02 7.535e+02 1.024e+03 1.337e+03 2.624e+03, threshold=2.049e+03, percent-clipped=15.0 2023-06-24 03:21:11,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1644042.0, ans=0.125 2023-06-24 03:21:36,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1644102.0, ans=0.1 2023-06-24 03:22:41,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1644282.0, ans=0.125 2023-06-24 03:22:44,501 INFO [train.py:996] (2/4) Epoch 9, batch 30100, loss[loss=0.2247, simple_loss=0.2804, pruned_loss=0.08446, over 21261.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3215, pruned_loss=0.08159, over 4276803.99 frames. ], batch size: 177, lr: 3.20e-03, grad_scale: 16.0 2023-06-24 03:22:56,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1644342.0, ans=10.0 2023-06-24 03:22:58,031 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=15.0 2023-06-24 03:23:09,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1644402.0, ans=0.0 2023-06-24 03:23:26,931 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.72 vs. limit=15.0 2023-06-24 03:23:51,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1644522.0, ans=0.125 2023-06-24 03:24:10,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1644582.0, ans=0.125 2023-06-24 03:24:13,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1644582.0, ans=0.035 2023-06-24 03:24:29,638 INFO [train.py:996] (2/4) Epoch 9, batch 30150, loss[loss=0.2674, simple_loss=0.3397, pruned_loss=0.09755, over 21228.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3184, pruned_loss=0.08262, over 4277610.03 frames. ], batch size: 143, lr: 3.19e-03, grad_scale: 16.0 2023-06-24 03:24:34,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1644642.0, ans=0.125 2023-06-24 03:24:38,061 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.432e+02 6.683e+02 1.059e+03 1.463e+03 4.541e+03, threshold=2.119e+03, percent-clipped=12.0 2023-06-24 03:24:43,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1644642.0, ans=0.07 2023-06-24 03:24:47,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1644702.0, ans=0.125 2023-06-24 03:25:15,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1644762.0, ans=0.1 2023-06-24 03:25:33,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1644822.0, ans=0.1 2023-06-24 03:25:43,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1644822.0, ans=0.0 2023-06-24 03:25:48,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1644822.0, ans=0.125 2023-06-24 03:26:12,258 INFO [train.py:996] (2/4) Epoch 9, batch 30200, loss[loss=0.2232, simple_loss=0.2996, pruned_loss=0.07342, over 20674.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3188, pruned_loss=0.08178, over 4270302.70 frames. ], batch size: 607, lr: 3.19e-03, grad_scale: 16.0 2023-06-24 03:26:23,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1644942.0, ans=0.0 2023-06-24 03:27:32,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1645122.0, ans=0.125 2023-06-24 03:27:57,816 INFO [train.py:996] (2/4) Epoch 9, batch 30250, loss[loss=0.3261, simple_loss=0.4246, pruned_loss=0.1138, over 21530.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3269, pruned_loss=0.08394, over 4276321.76 frames. ], batch size: 471, lr: 3.19e-03, grad_scale: 16.0 2023-06-24 03:28:05,415 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.017e+02 5.727e+02 7.436e+02 1.048e+03 2.592e+03, threshold=1.487e+03, percent-clipped=2.0 2023-06-24 03:28:13,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1645242.0, ans=0.125 2023-06-24 03:28:21,021 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.29 vs. limit=10.0 2023-06-24 03:28:31,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1645302.0, ans=0.125 2023-06-24 03:28:35,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1645302.0, ans=0.1 2023-06-24 03:28:38,848 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.98 vs. limit=6.0 2023-06-24 03:28:45,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1645362.0, ans=0.1 2023-06-24 03:29:36,891 INFO [train.py:996] (2/4) Epoch 9, batch 30300, loss[loss=0.1938, simple_loss=0.2618, pruned_loss=0.06293, over 21394.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3247, pruned_loss=0.08439, over 4263914.30 frames. ], batch size: 211, lr: 3.19e-03, grad_scale: 16.0 2023-06-24 03:29:45,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1645542.0, ans=0.2 2023-06-24 03:29:48,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1645542.0, ans=0.2 2023-06-24 03:31:10,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1645782.0, ans=0.0 2023-06-24 03:31:28,041 INFO [train.py:996] (2/4) Epoch 9, batch 30350, loss[loss=0.285, simple_loss=0.3476, pruned_loss=0.1112, over 21562.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3264, pruned_loss=0.08614, over 4262735.59 frames. ], batch size: 414, lr: 3.19e-03, grad_scale: 16.0 2023-06-24 03:31:36,354 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.062e+02 6.729e+02 9.654e+02 1.457e+03 3.930e+03, threshold=1.931e+03, percent-clipped=23.0 2023-06-24 03:31:52,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1645902.0, ans=0.0 2023-06-24 03:31:55,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1645902.0, ans=0.0 2023-06-24 03:32:21,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1646022.0, ans=0.125 2023-06-24 03:32:47,116 INFO [train.py:996] (2/4) Epoch 9, batch 30400, loss[loss=0.2058, simple_loss=0.2622, pruned_loss=0.07466, over 20352.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3185, pruned_loss=0.08373, over 4252541.72 frames. ], batch size: 703, lr: 3.19e-03, grad_scale: 32.0 2023-06-24 03:33:15,250 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-24 03:33:17,896 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=22.5 2023-06-24 03:33:26,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1646262.0, ans=0.1 2023-06-24 03:34:12,619 INFO [train.py:996] (2/4) Epoch 9, batch 30450, loss[loss=0.2704, simple_loss=0.3839, pruned_loss=0.07844, over 20020.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3191, pruned_loss=0.08251, over 4195778.12 frames. ], batch size: 702, lr: 3.19e-03, grad_scale: 16.0 2023-06-24 03:34:21,424 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.261e+02 7.756e+02 1.127e+03 2.078e+03 9.482e+03, threshold=2.254e+03, percent-clipped=27.0 2023-06-24 03:34:30,667 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:34:55,303 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.41 vs. limit=15.0 2023-06-24 03:34:59,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1646562.0, ans=0.5 2023-06-24 03:35:13,214 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=15.0 2023-06-24 03:36:54,763 INFO [train.py:996] (2/4) Epoch 10, batch 0, loss[loss=0.2067, simple_loss=0.2764, pruned_loss=0.06852, over 21776.00 frames. ], tot_loss[loss=0.2067, simple_loss=0.2764, pruned_loss=0.06852, over 21776.00 frames. ], batch size: 317, lr: 3.02e-03, grad_scale: 32.0 2023-06-24 03:36:54,764 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 03:37:07,567 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.1.encoder.layers.1.self_attn_weights, attn_weights_entropy = tensor([2.2010, 4.4467, 2.5192, 1.9502], device='cuda:2') 2023-06-24 03:37:10,567 INFO [train.py:1028] (2/4) Epoch 10, validation: loss=0.2396, simple_loss=0.3488, pruned_loss=0.06521, over 1796401.00 frames. 2023-06-24 03:37:10,568 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-24 03:37:22,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1646712.0, ans=0.125 2023-06-24 03:37:28,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1646712.0, ans=0.2 2023-06-24 03:37:30,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1646712.0, ans=0.0 2023-06-24 03:38:20,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1646892.0, ans=0.0 2023-06-24 03:38:42,559 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.19 vs. limit=12.0 2023-06-24 03:38:46,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=1646952.0, ans=0.02 2023-06-24 03:38:49,150 INFO [train.py:996] (2/4) Epoch 10, batch 50, loss[loss=0.2516, simple_loss=0.334, pruned_loss=0.08462, over 21727.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3246, pruned_loss=0.08385, over 960573.76 frames. ], batch size: 298, lr: 3.02e-03, grad_scale: 16.0 2023-06-24 03:39:02,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1647012.0, ans=0.0 2023-06-24 03:39:18,462 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.516e+02 8.479e+02 1.547e+03 2.623e+03 5.891e+03, threshold=3.095e+03, percent-clipped=28.0 2023-06-24 03:39:36,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1647132.0, ans=0.125 2023-06-24 03:39:48,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1647132.0, ans=0.1 2023-06-24 03:40:29,200 INFO [train.py:996] (2/4) Epoch 10, batch 100, loss[loss=0.3141, simple_loss=0.3817, pruned_loss=0.1233, over 21376.00 frames. ], tot_loss[loss=0.2547, simple_loss=0.3373, pruned_loss=0.08605, over 1689702.56 frames. ], batch size: 507, lr: 3.02e-03, grad_scale: 16.0 2023-06-24 03:40:50,883 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=12.0 2023-06-24 03:41:41,913 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.03 vs. limit=15.0 2023-06-24 03:42:05,166 INFO [train.py:996] (2/4) Epoch 10, batch 150, loss[loss=0.2837, simple_loss=0.3759, pruned_loss=0.09579, over 21469.00 frames. ], tot_loss[loss=0.2552, simple_loss=0.3387, pruned_loss=0.08586, over 2257526.29 frames. ], batch size: 471, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:42:39,493 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.031e+02 6.057e+02 8.805e+02 1.461e+03 2.839e+03, threshold=1.761e+03, percent-clipped=0.0 2023-06-24 03:43:05,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1647792.0, ans=0.125 2023-06-24 03:43:21,253 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.37 vs. limit=15.0 2023-06-24 03:43:46,536 INFO [train.py:996] (2/4) Epoch 10, batch 200, loss[loss=0.2364, simple_loss=0.3254, pruned_loss=0.0737, over 21642.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.335, pruned_loss=0.08327, over 2700795.48 frames. ], batch size: 263, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:43:48,873 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-24 03:44:09,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1647912.0, ans=0.0 2023-06-24 03:44:19,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1647972.0, ans=0.0 2023-06-24 03:44:44,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1648092.0, ans=0.0 2023-06-24 03:45:18,471 INFO [train.py:996] (2/4) Epoch 10, batch 250, loss[loss=0.263, simple_loss=0.3195, pruned_loss=0.1032, over 21586.00 frames. ], tot_loss[loss=0.2472, simple_loss=0.3296, pruned_loss=0.08242, over 3056692.15 frames. ], batch size: 548, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:45:26,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1648212.0, ans=0.0 2023-06-24 03:45:48,952 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.127e+02 6.491e+02 8.586e+02 1.362e+03 2.608e+03, threshold=1.717e+03, percent-clipped=13.0 2023-06-24 03:45:54,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1648272.0, ans=0.125 2023-06-24 03:45:55,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1648272.0, ans=0.125 2023-06-24 03:46:29,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1648392.0, ans=0.125 2023-06-24 03:46:58,104 INFO [train.py:996] (2/4) Epoch 10, batch 300, loss[loss=0.236, simple_loss=0.3031, pruned_loss=0.08444, over 21375.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.326, pruned_loss=0.08358, over 3327759.69 frames. ], batch size: 143, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:47:18,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1648572.0, ans=0.125 2023-06-24 03:48:06,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1648692.0, ans=0.125 2023-06-24 03:48:10,187 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.48 vs. limit=15.0 2023-06-24 03:48:34,864 INFO [train.py:996] (2/4) Epoch 10, batch 350, loss[loss=0.2103, simple_loss=0.2867, pruned_loss=0.06694, over 21569.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3175, pruned_loss=0.0812, over 3532277.16 frames. ], batch size: 212, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:49:05,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1648872.0, ans=0.2 2023-06-24 03:49:06,275 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.104e+02 6.936e+02 9.570e+02 1.355e+03 2.301e+03, threshold=1.914e+03, percent-clipped=7.0 2023-06-24 03:49:09,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1648872.0, ans=0.125 2023-06-24 03:49:18,248 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-06-24 03:49:26,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1648932.0, ans=0.0 2023-06-24 03:49:58,521 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=12.0 2023-06-24 03:50:02,211 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.54 vs. limit=22.5 2023-06-24 03:50:10,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1649052.0, ans=0.025 2023-06-24 03:50:14,758 INFO [train.py:996] (2/4) Epoch 10, batch 400, loss[loss=0.2552, simple_loss=0.3307, pruned_loss=0.08979, over 21875.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3112, pruned_loss=0.0796, over 3697800.67 frames. ], batch size: 107, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 03:51:36,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1649352.0, ans=0.0 2023-06-24 03:51:36,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1649352.0, ans=0.1 2023-06-24 03:51:57,933 INFO [train.py:996] (2/4) Epoch 10, batch 450, loss[loss=0.19, simple_loss=0.2596, pruned_loss=0.06023, over 21632.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3078, pruned_loss=0.07775, over 3828996.64 frames. ], batch size: 298, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 03:52:04,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1649412.0, ans=0.04949747468305833 2023-06-24 03:52:22,593 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.065e+02 6.467e+02 1.066e+03 1.544e+03 3.388e+03, threshold=2.132e+03, percent-clipped=13.0 2023-06-24 03:52:23,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1649472.0, ans=0.125 2023-06-24 03:52:35,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1649532.0, ans=0.125 2023-06-24 03:53:12,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1649652.0, ans=0.125 2023-06-24 03:53:18,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1649652.0, ans=0.125 2023-06-24 03:53:29,633 INFO [train.py:996] (2/4) Epoch 10, batch 500, loss[loss=0.2366, simple_loss=0.3083, pruned_loss=0.08245, over 21609.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3092, pruned_loss=0.07792, over 3932474.76 frames. ], batch size: 391, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 03:54:19,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1649832.0, ans=0.125 2023-06-24 03:54:45,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1649952.0, ans=0.125 2023-06-24 03:55:07,315 INFO [train.py:996] (2/4) Epoch 10, batch 550, loss[loss=0.2134, simple_loss=0.286, pruned_loss=0.07037, over 21476.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3102, pruned_loss=0.07769, over 4008150.45 frames. ], batch size: 131, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 03:55:23,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1650012.0, ans=0.0 2023-06-24 03:55:25,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1650012.0, ans=0.125 2023-06-24 03:55:28,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1650072.0, ans=0.125 2023-06-24 03:55:32,655 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.909e+02 8.917e+02 1.249e+03 2.003e+03 3.580e+03, threshold=2.497e+03, percent-clipped=21.0 2023-06-24 03:56:11,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1650192.0, ans=0.125 2023-06-24 03:56:17,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1650192.0, ans=0.125 2023-06-24 03:56:28,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1650252.0, ans=0.0 2023-06-24 03:56:35,865 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 03:56:40,136 INFO [train.py:996] (2/4) Epoch 10, batch 600, loss[loss=0.2247, simple_loss=0.3316, pruned_loss=0.05888, over 21731.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3134, pruned_loss=0.07796, over 4067580.20 frames. ], batch size: 298, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:57:17,441 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-24 03:58:02,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1650552.0, ans=0.125 2023-06-24 03:58:04,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1650552.0, ans=10.0 2023-06-24 03:58:13,507 INFO [train.py:996] (2/4) Epoch 10, batch 650, loss[loss=0.2494, simple_loss=0.3131, pruned_loss=0.09282, over 15127.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3158, pruned_loss=0.07941, over 4104731.29 frames. ], batch size: 61, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 03:58:44,448 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.130e+02 7.027e+02 1.084e+03 1.748e+03 3.374e+03, threshold=2.167e+03, percent-clipped=5.0 2023-06-24 03:58:51,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1650672.0, ans=0.1 2023-06-24 03:59:45,375 INFO [train.py:996] (2/4) Epoch 10, batch 700, loss[loss=0.2275, simple_loss=0.2873, pruned_loss=0.0839, over 21328.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3224, pruned_loss=0.08198, over 4147299.58 frames. ], batch size: 144, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 04:00:36,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1651032.0, ans=0.2 2023-06-24 04:00:54,562 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.96 vs. limit=15.0 2023-06-24 04:01:03,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1651092.0, ans=0.0 2023-06-24 04:01:04,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1651152.0, ans=0.95 2023-06-24 04:01:27,472 INFO [train.py:996] (2/4) Epoch 10, batch 750, loss[loss=0.2363, simple_loss=0.2928, pruned_loss=0.08991, over 21480.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3199, pruned_loss=0.0817, over 4178868.68 frames. ], batch size: 548, lr: 3.01e-03, grad_scale: 8.0 2023-06-24 04:01:32,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1651212.0, ans=0.0 2023-06-24 04:01:38,908 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.50 vs. limit=10.0 2023-06-24 04:01:46,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1651272.0, ans=0.125 2023-06-24 04:01:53,754 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.183e+02 6.075e+02 9.989e+02 1.388e+03 3.247e+03, threshold=1.998e+03, percent-clipped=7.0 2023-06-24 04:02:07,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1651332.0, ans=0.2 2023-06-24 04:02:09,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1651332.0, ans=0.125 2023-06-24 04:02:11,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1651332.0, ans=0.125 2023-06-24 04:02:14,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1651332.0, ans=0.125 2023-06-24 04:02:25,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1651392.0, ans=0.2 2023-06-24 04:02:31,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1651392.0, ans=6.0 2023-06-24 04:02:35,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1651392.0, ans=0.025 2023-06-24 04:02:41,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1651452.0, ans=0.125 2023-06-24 04:02:43,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1651452.0, ans=0.125 2023-06-24 04:03:01,135 INFO [train.py:996] (2/4) Epoch 10, batch 800, loss[loss=0.2604, simple_loss=0.3241, pruned_loss=0.09841, over 21699.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3156, pruned_loss=0.08149, over 4207699.20 frames. ], batch size: 441, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:03:23,748 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:04:00,881 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.90 vs. limit=15.0 2023-06-24 04:04:28,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1651752.0, ans=0.125 2023-06-24 04:04:39,118 INFO [train.py:996] (2/4) Epoch 10, batch 850, loss[loss=0.2155, simple_loss=0.2765, pruned_loss=0.07728, over 21723.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3117, pruned_loss=0.08093, over 4230160.88 frames. ], batch size: 247, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:05:06,130 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:05:10,503 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.255e+02 6.435e+02 1.007e+03 1.415e+03 2.798e+03, threshold=2.014e+03, percent-clipped=8.0 2023-06-24 04:05:30,062 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:05:30,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1651932.0, ans=0.1 2023-06-24 04:05:33,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1651932.0, ans=0.1 2023-06-24 04:05:56,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1652052.0, ans=0.0 2023-06-24 04:06:16,296 INFO [train.py:996] (2/4) Epoch 10, batch 900, loss[loss=0.2137, simple_loss=0.2856, pruned_loss=0.0709, over 21158.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3095, pruned_loss=0.0811, over 4246809.31 frames. ], batch size: 143, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:06:18,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1652112.0, ans=0.125 2023-06-24 04:07:25,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1652292.0, ans=10.0 2023-06-24 04:07:39,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1652352.0, ans=0.125 2023-06-24 04:07:51,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1652352.0, ans=0.125 2023-06-24 04:08:04,374 INFO [train.py:996] (2/4) Epoch 10, batch 950, loss[loss=0.2463, simple_loss=0.3112, pruned_loss=0.09073, over 21909.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.308, pruned_loss=0.08065, over 4252140.02 frames. ], batch size: 414, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:08:27,070 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.827e+02 5.619e+02 7.658e+02 1.220e+03 3.060e+03, threshold=1.532e+03, percent-clipped=1.0 2023-06-24 04:08:45,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1652532.0, ans=0.0 2023-06-24 04:08:56,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1652592.0, ans=0.1 2023-06-24 04:09:39,793 INFO [train.py:996] (2/4) Epoch 10, batch 1000, loss[loss=0.2364, simple_loss=0.3238, pruned_loss=0.07451, over 21402.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3092, pruned_loss=0.08064, over 4264728.61 frames. ], batch size: 548, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:09:40,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1652712.0, ans=0.125 2023-06-24 04:09:51,349 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.99 vs. limit=15.0 2023-06-24 04:10:02,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1652772.0, ans=0.0 2023-06-24 04:10:47,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1652892.0, ans=0.0 2023-06-24 04:11:19,948 INFO [train.py:996] (2/4) Epoch 10, batch 1050, loss[loss=0.2144, simple_loss=0.3082, pruned_loss=0.06029, over 21623.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3088, pruned_loss=0.08029, over 4273098.34 frames. ], batch size: 441, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:11:26,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1653012.0, ans=0.1 2023-06-24 04:11:46,771 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.344e+02 8.659e+02 1.096e+03 1.679e+03 3.356e+03, threshold=2.191e+03, percent-clipped=32.0 2023-06-24 04:12:51,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1653252.0, ans=0.125 2023-06-24 04:12:53,884 INFO [train.py:996] (2/4) Epoch 10, batch 1100, loss[loss=0.2533, simple_loss=0.3521, pruned_loss=0.0772, over 21844.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.307, pruned_loss=0.07885, over 4278586.38 frames. ], batch size: 371, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:13:13,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1653372.0, ans=0.125 2023-06-24 04:13:14,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1653372.0, ans=0.2 2023-06-24 04:13:20,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1653372.0, ans=0.0 2023-06-24 04:13:50,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1653492.0, ans=0.125 2023-06-24 04:13:53,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1653492.0, ans=0.125 2023-06-24 04:14:32,906 INFO [train.py:996] (2/4) Epoch 10, batch 1150, loss[loss=0.3065, simple_loss=0.3547, pruned_loss=0.1291, over 21683.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3067, pruned_loss=0.07894, over 4279794.04 frames. ], batch size: 507, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:14:47,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1653612.0, ans=0.2 2023-06-24 04:15:00,221 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.939e+02 5.998e+02 8.488e+02 1.315e+03 2.677e+03, threshold=1.698e+03, percent-clipped=3.0 2023-06-24 04:15:14,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1653732.0, ans=0.1 2023-06-24 04:15:20,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1653732.0, ans=0.0 2023-06-24 04:15:23,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1653732.0, ans=0.125 2023-06-24 04:15:27,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1653732.0, ans=0.0 2023-06-24 04:15:49,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1653792.0, ans=0.0 2023-06-24 04:15:51,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1653792.0, ans=0.125 2023-06-24 04:16:00,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1653852.0, ans=0.125 2023-06-24 04:16:17,783 INFO [train.py:996] (2/4) Epoch 10, batch 1200, loss[loss=0.2159, simple_loss=0.3008, pruned_loss=0.06548, over 21361.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3084, pruned_loss=0.07954, over 4279601.89 frames. ], batch size: 194, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:17:06,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1654032.0, ans=0.05 2023-06-24 04:17:56,439 INFO [train.py:996] (2/4) Epoch 10, batch 1250, loss[loss=0.2057, simple_loss=0.2897, pruned_loss=0.0608, over 21842.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3113, pruned_loss=0.08019, over 4282658.33 frames. ], batch size: 298, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:18:19,038 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.265e+02 6.637e+02 9.533e+02 1.248e+03 2.697e+03, threshold=1.907e+03, percent-clipped=13.0 2023-06-24 04:19:35,623 INFO [train.py:996] (2/4) Epoch 10, batch 1300, loss[loss=0.2435, simple_loss=0.3358, pruned_loss=0.07562, over 21655.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3131, pruned_loss=0.08003, over 4290569.43 frames. ], batch size: 263, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:19:37,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1654512.0, ans=0.125 2023-06-24 04:21:14,456 INFO [train.py:996] (2/4) Epoch 10, batch 1350, loss[loss=0.2452, simple_loss=0.3104, pruned_loss=0.09003, over 21836.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3148, pruned_loss=0.08074, over 4288761.49 frames. ], batch size: 371, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:21:22,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1654812.0, ans=0.125 2023-06-24 04:21:25,155 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.46 vs. limit=10.0 2023-06-24 04:21:42,999 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.278e+02 6.930e+02 9.209e+02 1.385e+03 4.036e+03, threshold=1.842e+03, percent-clipped=12.0 2023-06-24 04:21:48,684 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.75 vs. limit=15.0 2023-06-24 04:22:14,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1654992.0, ans=0.0 2023-06-24 04:22:36,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1655052.0, ans=0.0 2023-06-24 04:22:49,017 INFO [train.py:996] (2/4) Epoch 10, batch 1400, loss[loss=0.2553, simple_loss=0.3307, pruned_loss=0.08996, over 21380.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.314, pruned_loss=0.08072, over 4288047.67 frames. ], batch size: 548, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:23:39,919 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=15.0 2023-06-24 04:23:50,852 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=12.0 2023-06-24 04:24:13,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1655352.0, ans=0.07 2023-06-24 04:24:28,473 INFO [train.py:996] (2/4) Epoch 10, batch 1450, loss[loss=0.2519, simple_loss=0.3364, pruned_loss=0.08366, over 21332.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3157, pruned_loss=0.08202, over 4282046.30 frames. ], batch size: 549, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:24:32,160 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=15.0 2023-06-24 04:24:56,573 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.216e+02 6.383e+02 1.021e+03 1.504e+03 2.934e+03, threshold=2.041e+03, percent-clipped=11.0 2023-06-24 04:25:02,962 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.33 vs. limit=6.0 2023-06-24 04:25:13,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1655532.0, ans=0.2 2023-06-24 04:25:32,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1655592.0, ans=0.125 2023-06-24 04:25:36,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1655592.0, ans=0.125 2023-06-24 04:25:36,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1655592.0, ans=0.125 2023-06-24 04:25:51,184 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.26 vs. limit=15.0 2023-06-24 04:25:52,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1655652.0, ans=0.0 2023-06-24 04:25:52,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1655652.0, ans=0.125 2023-06-24 04:26:07,441 INFO [train.py:996] (2/4) Epoch 10, batch 1500, loss[loss=0.2661, simple_loss=0.3299, pruned_loss=0.1012, over 21365.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3168, pruned_loss=0.08241, over 4279958.52 frames. ], batch size: 159, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:26:14,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1655712.0, ans=0.1 2023-06-24 04:26:38,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1655772.0, ans=0.0 2023-06-24 04:27:41,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1655952.0, ans=0.2 2023-06-24 04:27:50,585 INFO [train.py:996] (2/4) Epoch 10, batch 1550, loss[loss=0.1685, simple_loss=0.2538, pruned_loss=0.04161, over 21358.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3135, pruned_loss=0.08116, over 4275646.35 frames. ], batch size: 131, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:28:08,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1656012.0, ans=0.0 2023-06-24 04:28:24,607 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.204e+02 5.706e+02 8.802e+02 1.256e+03 2.211e+03, threshold=1.760e+03, percent-clipped=1.0 2023-06-24 04:28:25,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1656072.0, ans=0.1 2023-06-24 04:28:28,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1656072.0, ans=0.125 2023-06-24 04:29:00,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1656192.0, ans=0.125 2023-06-24 04:29:35,767 INFO [train.py:996] (2/4) Epoch 10, batch 1600, loss[loss=0.2269, simple_loss=0.3008, pruned_loss=0.0765, over 21809.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.31, pruned_loss=0.0802, over 4272585.26 frames. ], batch size: 351, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:29:53,459 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.58 vs. limit=15.0 2023-06-24 04:30:46,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1656492.0, ans=0.125 2023-06-24 04:31:15,780 INFO [train.py:996] (2/4) Epoch 10, batch 1650, loss[loss=0.2598, simple_loss=0.3259, pruned_loss=0.0968, over 21769.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3081, pruned_loss=0.07937, over 4277078.77 frames. ], batch size: 389, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:31:42,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1656672.0, ans=0.125 2023-06-24 04:31:44,373 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.945e+02 6.141e+02 9.207e+02 1.280e+03 2.509e+03, threshold=1.841e+03, percent-clipped=8.0 2023-06-24 04:31:56,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1656732.0, ans=0.035 2023-06-24 04:31:59,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1656732.0, ans=0.0 2023-06-24 04:32:39,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1656852.0, ans=0.0 2023-06-24 04:32:41,508 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.84 vs. limit=10.0 2023-06-24 04:32:57,150 INFO [train.py:996] (2/4) Epoch 10, batch 1700, loss[loss=0.2077, simple_loss=0.3171, pruned_loss=0.04913, over 21012.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3094, pruned_loss=0.07902, over 4282494.66 frames. ], batch size: 607, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:33:39,363 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.46 vs. limit=15.0 2023-06-24 04:34:31,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1657152.0, ans=15.0 2023-06-24 04:34:45,114 INFO [train.py:996] (2/4) Epoch 10, batch 1750, loss[loss=0.2752, simple_loss=0.3465, pruned_loss=0.1019, over 21789.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3116, pruned_loss=0.07931, over 4279838.13 frames. ], batch size: 441, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:34:53,019 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.12 vs. limit=15.0 2023-06-24 04:35:21,551 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.068e+02 6.335e+02 9.144e+02 1.525e+03 4.256e+03, threshold=1.829e+03, percent-clipped=17.0 2023-06-24 04:36:05,107 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.54 vs. limit=15.0 2023-06-24 04:36:14,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1657452.0, ans=0.125 2023-06-24 04:36:17,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1657452.0, ans=0.125 2023-06-24 04:36:18,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1657452.0, ans=0.125 2023-06-24 04:36:31,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1657512.0, ans=0.0 2023-06-24 04:36:32,785 INFO [train.py:996] (2/4) Epoch 10, batch 1800, loss[loss=0.1889, simple_loss=0.2627, pruned_loss=0.05759, over 21469.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3084, pruned_loss=0.07619, over 4284720.78 frames. ], batch size: 195, lr: 3.01e-03, grad_scale: 32.0 2023-06-24 04:36:53,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1657572.0, ans=0.2 2023-06-24 04:37:06,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1657572.0, ans=0.125 2023-06-24 04:38:13,201 INFO [train.py:996] (2/4) Epoch 10, batch 1850, loss[loss=0.2476, simple_loss=0.312, pruned_loss=0.09167, over 21887.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3096, pruned_loss=0.07458, over 4284281.33 frames. ], batch size: 124, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:38:34,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1657872.0, ans=0.0 2023-06-24 04:38:35,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1657872.0, ans=0.125 2023-06-24 04:38:43,369 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.154e+02 6.426e+02 1.042e+03 1.664e+03 4.444e+03, threshold=2.085e+03, percent-clipped=25.0 2023-06-24 04:38:48,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1657932.0, ans=0.125 2023-06-24 04:39:05,115 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-24 04:39:32,525 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.35 vs. limit=22.5 2023-06-24 04:39:52,044 INFO [train.py:996] (2/4) Epoch 10, batch 1900, loss[loss=0.2352, simple_loss=0.309, pruned_loss=0.0807, over 21600.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3127, pruned_loss=0.07512, over 4284693.48 frames. ], batch size: 212, lr: 3.01e-03, grad_scale: 16.0 2023-06-24 04:40:22,260 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.80 vs. limit=22.5 2023-06-24 04:41:31,874 INFO [train.py:996] (2/4) Epoch 10, batch 1950, loss[loss=0.1921, simple_loss=0.2542, pruned_loss=0.06503, over 21633.00 frames. ], tot_loss[loss=0.231, simple_loss=0.31, pruned_loss=0.07602, over 4286117.44 frames. ], batch size: 282, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 04:41:47,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1658412.0, ans=0.0 2023-06-24 04:42:02,919 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.830e+02 7.074e+02 9.115e+02 1.415e+03 2.823e+03, threshold=1.823e+03, percent-clipped=5.0 2023-06-24 04:42:18,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1658532.0, ans=0.2 2023-06-24 04:42:41,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1658592.0, ans=0.125 2023-06-24 04:43:00,853 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=12.0 2023-06-24 04:43:06,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1658652.0, ans=0.0 2023-06-24 04:43:12,631 INFO [train.py:996] (2/4) Epoch 10, batch 2000, loss[loss=0.2117, simple_loss=0.2732, pruned_loss=0.07511, over 21730.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3069, pruned_loss=0.07452, over 4285203.49 frames. ], batch size: 351, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:43:14,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1658712.0, ans=0.125 2023-06-24 04:43:37,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1658772.0, ans=0.125 2023-06-24 04:44:42,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1658952.0, ans=0.125 2023-06-24 04:44:46,547 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-06-24 04:44:57,314 INFO [train.py:996] (2/4) Epoch 10, batch 2050, loss[loss=0.2206, simple_loss=0.2953, pruned_loss=0.07299, over 21885.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3081, pruned_loss=0.07528, over 4291337.05 frames. ], batch size: 332, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:45:00,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1659012.0, ans=0.0 2023-06-24 04:45:07,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1659012.0, ans=0.0 2023-06-24 04:45:28,484 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.378e+02 7.380e+02 1.174e+03 1.683e+03 3.998e+03, threshold=2.349e+03, percent-clipped=22.0 2023-06-24 04:45:41,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1659132.0, ans=0.125 2023-06-24 04:45:47,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1659132.0, ans=0.125 2023-06-24 04:45:54,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1659192.0, ans=0.0 2023-06-24 04:46:37,781 INFO [train.py:996] (2/4) Epoch 10, batch 2100, loss[loss=0.2453, simple_loss=0.3185, pruned_loss=0.0861, over 21212.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3122, pruned_loss=0.07715, over 4297305.60 frames. ], batch size: 159, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:47:21,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1659432.0, ans=0.0 2023-06-24 04:48:17,978 INFO [train.py:996] (2/4) Epoch 10, batch 2150, loss[loss=0.2343, simple_loss=0.3143, pruned_loss=0.07715, over 21691.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.31, pruned_loss=0.07762, over 4298418.36 frames. ], batch size: 391, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:48:21,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1659612.0, ans=0.0 2023-06-24 04:48:48,467 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.248e+02 6.485e+02 1.170e+03 1.690e+03 3.411e+03, threshold=2.340e+03, percent-clipped=8.0 2023-06-24 04:49:40,191 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=22.5 2023-06-24 04:49:42,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1659852.0, ans=0.0 2023-06-24 04:49:58,081 INFO [train.py:996] (2/4) Epoch 10, batch 2200, loss[loss=0.2237, simple_loss=0.3056, pruned_loss=0.07088, over 21328.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3108, pruned_loss=0.07821, over 4296654.03 frames. ], batch size: 131, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:50:00,267 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 04:50:19,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1659972.0, ans=0.125 2023-06-24 04:50:49,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1660032.0, ans=0.125 2023-06-24 04:51:05,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1660092.0, ans=0.125 2023-06-24 04:51:25,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1660152.0, ans=0.125 2023-06-24 04:51:25,562 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.47 vs. limit=6.0 2023-06-24 04:51:37,239 INFO [train.py:996] (2/4) Epoch 10, batch 2250, loss[loss=0.1698, simple_loss=0.244, pruned_loss=0.04779, over 21410.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3076, pruned_loss=0.07719, over 4288090.17 frames. ], batch size: 131, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 04:52:08,789 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.133e+02 6.896e+02 1.012e+03 1.519e+03 4.116e+03, threshold=2.025e+03, percent-clipped=5.0 2023-06-24 04:53:15,532 INFO [train.py:996] (2/4) Epoch 10, batch 2300, loss[loss=0.1955, simple_loss=0.2557, pruned_loss=0.06764, over 21492.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3041, pruned_loss=0.0766, over 4282509.53 frames. ], batch size: 195, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 04:53:33,528 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.61 vs. limit=15.0 2023-06-24 04:53:43,812 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.48 vs. limit=15.0 2023-06-24 04:54:06,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1660632.0, ans=0.0 2023-06-24 04:54:09,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1660632.0, ans=0.125 2023-06-24 04:54:21,832 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.34 vs. limit=22.5 2023-06-24 04:54:55,355 INFO [train.py:996] (2/4) Epoch 10, batch 2350, loss[loss=0.2435, simple_loss=0.3032, pruned_loss=0.09189, over 21190.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3019, pruned_loss=0.07707, over 4282403.26 frames. ], batch size: 159, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 04:55:15,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1660872.0, ans=0.0 2023-06-24 04:55:32,516 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.196e+02 7.285e+02 1.033e+03 1.548e+03 3.497e+03, threshold=2.065e+03, percent-clipped=14.0 2023-06-24 04:55:36,809 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.20 vs. limit=15.0 2023-06-24 04:55:39,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1660932.0, ans=0.125 2023-06-24 04:55:43,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1660932.0, ans=0.0 2023-06-24 04:56:34,551 INFO [train.py:996] (2/4) Epoch 10, batch 2400, loss[loss=0.2721, simple_loss=0.3319, pruned_loss=0.1062, over 21332.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3049, pruned_loss=0.07919, over 4278263.83 frames. ], batch size: 159, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:56:52,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1661112.0, ans=0.1 2023-06-24 04:57:30,382 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.66 vs. limit=15.0 2023-06-24 04:57:53,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1661292.0, ans=0.125 2023-06-24 04:58:00,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1661352.0, ans=0.0 2023-06-24 04:58:18,977 INFO [train.py:996] (2/4) Epoch 10, batch 2450, loss[loss=0.2417, simple_loss=0.3118, pruned_loss=0.08579, over 21618.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.309, pruned_loss=0.08141, over 4280536.10 frames. ], batch size: 441, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 04:58:50,921 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.358e+02 7.393e+02 1.203e+03 1.868e+03 3.512e+03, threshold=2.405e+03, percent-clipped=21.0 2023-06-24 04:59:58,051 INFO [train.py:996] (2/4) Epoch 10, batch 2500, loss[loss=0.2301, simple_loss=0.3073, pruned_loss=0.07643, over 21530.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3094, pruned_loss=0.08152, over 4273983.39 frames. ], batch size: 441, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:00:01,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1661712.0, ans=0.0 2023-06-24 05:00:19,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1661772.0, ans=0.125 2023-06-24 05:00:32,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1661772.0, ans=0.125 2023-06-24 05:00:37,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1661832.0, ans=0.125 2023-06-24 05:00:39,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1661832.0, ans=0.125 2023-06-24 05:01:33,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1661952.0, ans=0.2 2023-06-24 05:01:39,203 INFO [train.py:996] (2/4) Epoch 10, batch 2550, loss[loss=0.2889, simple_loss=0.3591, pruned_loss=0.1093, over 21783.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3097, pruned_loss=0.08109, over 4271839.04 frames. ], batch size: 118, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:01:53,301 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.69 vs. limit=15.0 2023-06-24 05:02:01,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1662072.0, ans=0.125 2023-06-24 05:02:11,848 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.129e+02 7.144e+02 9.794e+02 1.361e+03 2.807e+03, threshold=1.959e+03, percent-clipped=4.0 2023-06-24 05:02:21,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1662132.0, ans=0.0 2023-06-24 05:02:30,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1662132.0, ans=0.0 2023-06-24 05:02:36,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1662192.0, ans=0.2 2023-06-24 05:02:51,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1662192.0, ans=0.0 2023-06-24 05:02:55,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1662252.0, ans=0.1 2023-06-24 05:02:58,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1662252.0, ans=15.0 2023-06-24 05:03:00,478 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:03:07,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1662252.0, ans=0.125 2023-06-24 05:03:17,553 INFO [train.py:996] (2/4) Epoch 10, batch 2600, loss[loss=0.1815, simple_loss=0.2453, pruned_loss=0.05885, over 21399.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3079, pruned_loss=0.08142, over 4276733.05 frames. ], batch size: 211, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:04:31,662 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:04:43,187 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:04:58,946 INFO [train.py:996] (2/4) Epoch 10, batch 2650, loss[loss=0.2328, simple_loss=0.2892, pruned_loss=0.08817, over 21790.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3096, pruned_loss=0.08208, over 4275971.41 frames. ], batch size: 247, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:05:01,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1662612.0, ans=0.125 2023-06-24 05:05:13,294 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=12.0 2023-06-24 05:05:25,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1662672.0, ans=0.2 2023-06-24 05:05:31,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1662672.0, ans=0.125 2023-06-24 05:05:32,839 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.149e+02 7.187e+02 9.516e+02 1.311e+03 3.015e+03, threshold=1.903e+03, percent-clipped=11.0 2023-06-24 05:05:52,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1662732.0, ans=0.05 2023-06-24 05:05:53,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1662732.0, ans=0.0 2023-06-24 05:06:40,810 INFO [train.py:996] (2/4) Epoch 10, batch 2700, loss[loss=0.2225, simple_loss=0.2884, pruned_loss=0.07833, over 21632.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3099, pruned_loss=0.0829, over 4268939.05 frames. ], batch size: 230, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:07:08,829 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.72 vs. limit=12.0 2023-06-24 05:07:29,579 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.20 vs. limit=10.0 2023-06-24 05:08:21,440 INFO [train.py:996] (2/4) Epoch 10, batch 2750, loss[loss=0.244, simple_loss=0.3455, pruned_loss=0.07125, over 20883.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3107, pruned_loss=0.08288, over 4267942.77 frames. ], batch size: 607, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:08:54,972 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.507e+02 8.097e+02 1.070e+03 1.539e+03 2.944e+03, threshold=2.139e+03, percent-clipped=11.0 2023-06-24 05:09:07,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1663332.0, ans=0.125 2023-06-24 05:09:26,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1663392.0, ans=0.025 2023-06-24 05:09:49,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1663452.0, ans=0.0 2023-06-24 05:10:07,006 INFO [train.py:996] (2/4) Epoch 10, batch 2800, loss[loss=0.2774, simple_loss=0.3674, pruned_loss=0.09366, over 21668.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3115, pruned_loss=0.0828, over 4267730.04 frames. ], batch size: 389, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 05:11:03,431 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.17 vs. limit=15.0 2023-06-24 05:11:26,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1663752.0, ans=0.0 2023-06-24 05:11:47,575 INFO [train.py:996] (2/4) Epoch 10, batch 2850, loss[loss=0.2047, simple_loss=0.2689, pruned_loss=0.07028, over 21582.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3134, pruned_loss=0.0833, over 4262327.62 frames. ], batch size: 230, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:12:27,733 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.081e+02 7.888e+02 1.288e+03 1.995e+03 6.558e+03, threshold=2.577e+03, percent-clipped=20.0 2023-06-24 05:12:34,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1663932.0, ans=0.0 2023-06-24 05:13:26,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1664112.0, ans=0.0 2023-06-24 05:13:27,318 INFO [train.py:996] (2/4) Epoch 10, batch 2900, loss[loss=0.248, simple_loss=0.3082, pruned_loss=0.09394, over 21937.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.311, pruned_loss=0.08293, over 4265853.34 frames. ], batch size: 316, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:14:21,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1664232.0, ans=0.125 2023-06-24 05:14:24,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1664292.0, ans=0.0 2023-06-24 05:14:44,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1664352.0, ans=0.0 2023-06-24 05:14:47,293 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:14:57,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1664352.0, ans=0.0 2023-06-24 05:15:05,613 INFO [train.py:996] (2/4) Epoch 10, batch 2950, loss[loss=0.2217, simple_loss=0.2952, pruned_loss=0.07411, over 21865.00 frames. ], tot_loss[loss=0.238, simple_loss=0.311, pruned_loss=0.08251, over 4274245.46 frames. ], batch size: 298, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:15:45,389 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.895e+02 6.745e+02 8.632e+02 1.337e+03 3.191e+03, threshold=1.726e+03, percent-clipped=2.0 2023-06-24 05:16:00,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1664532.0, ans=0.125 2023-06-24 05:16:03,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1664592.0, ans=0.125 2023-06-24 05:16:08,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1664592.0, ans=0.1 2023-06-24 05:16:44,600 INFO [train.py:996] (2/4) Epoch 10, batch 3000, loss[loss=0.2268, simple_loss=0.327, pruned_loss=0.06329, over 19889.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3153, pruned_loss=0.08381, over 4277395.72 frames. ], batch size: 703, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:16:44,601 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 05:17:00,556 INFO [train.py:1028] (2/4) Epoch 10, validation: loss=0.2505, simple_loss=0.3452, pruned_loss=0.07794, over 1796401.00 frames. 2023-06-24 05:17:00,557 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-24 05:17:11,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1664712.0, ans=10.0 2023-06-24 05:17:37,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1664772.0, ans=10.0 2023-06-24 05:17:44,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1664832.0, ans=10.0 2023-06-24 05:18:35,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1664952.0, ans=0.0 2023-06-24 05:18:45,509 INFO [train.py:996] (2/4) Epoch 10, batch 3050, loss[loss=0.2203, simple_loss=0.318, pruned_loss=0.06133, over 21678.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3169, pruned_loss=0.08248, over 4273293.62 frames. ], batch size: 441, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:18:50,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1665012.0, ans=0.0 2023-06-24 05:18:51,362 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=22.5 2023-06-24 05:18:53,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1665012.0, ans=0.125 2023-06-24 05:19:17,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1665072.0, ans=0.2 2023-06-24 05:19:21,824 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.132e+02 6.265e+02 9.515e+02 1.393e+03 2.651e+03, threshold=1.903e+03, percent-clipped=13.0 2023-06-24 05:20:06,216 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.71 vs. limit=15.0 2023-06-24 05:20:25,938 INFO [train.py:996] (2/4) Epoch 10, batch 3100, loss[loss=0.2482, simple_loss=0.3303, pruned_loss=0.08307, over 21297.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3159, pruned_loss=0.08148, over 4276380.19 frames. ], batch size: 548, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:20:37,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1665312.0, ans=0.1 2023-06-24 05:21:11,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1665432.0, ans=0.05 2023-06-24 05:21:11,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1665432.0, ans=0.1 2023-06-24 05:21:36,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1665492.0, ans=0.125 2023-06-24 05:22:10,427 INFO [train.py:996] (2/4) Epoch 10, batch 3150, loss[loss=0.2242, simple_loss=0.3055, pruned_loss=0.07148, over 20719.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3176, pruned_loss=0.08175, over 4277740.75 frames. ], batch size: 607, lr: 3.00e-03, grad_scale: 8.0 2023-06-24 05:22:55,170 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.369e+02 7.727e+02 1.146e+03 1.592e+03 4.239e+03, threshold=2.292e+03, percent-clipped=10.0 2023-06-24 05:23:03,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1665732.0, ans=0.2 2023-06-24 05:23:47,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1665852.0, ans=0.125 2023-06-24 05:23:50,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1665852.0, ans=0.09899494936611666 2023-06-24 05:23:53,244 INFO [train.py:996] (2/4) Epoch 10, batch 3200, loss[loss=0.2353, simple_loss=0.3317, pruned_loss=0.06944, over 21711.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3192, pruned_loss=0.08239, over 4282630.76 frames. ], batch size: 441, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:23:58,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1665912.0, ans=0.0 2023-06-24 05:24:13,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1665972.0, ans=0.0 2023-06-24 05:24:16,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1665972.0, ans=0.125 2023-06-24 05:24:18,554 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.10 vs. limit=22.5 2023-06-24 05:24:40,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1666032.0, ans=0.5 2023-06-24 05:24:44,169 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.79 vs. limit=10.0 2023-06-24 05:24:59,213 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2023-06-24 05:25:32,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1666212.0, ans=0.125 2023-06-24 05:25:33,999 INFO [train.py:996] (2/4) Epoch 10, batch 3250, loss[loss=0.1972, simple_loss=0.264, pruned_loss=0.06523, over 21620.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3196, pruned_loss=0.08323, over 4281039.40 frames. ], batch size: 231, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:25:35,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1666212.0, ans=0.0 2023-06-24 05:26:04,386 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.68 vs. limit=15.0 2023-06-24 05:26:16,580 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.011e+02 5.714e+02 8.207e+02 1.474e+03 3.383e+03, threshold=1.641e+03, percent-clipped=8.0 2023-06-24 05:27:04,997 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.46 vs. limit=15.0 2023-06-24 05:27:15,029 INFO [train.py:996] (2/4) Epoch 10, batch 3300, loss[loss=0.2202, simple_loss=0.3171, pruned_loss=0.06165, over 21839.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.315, pruned_loss=0.08178, over 4284045.89 frames. ], batch size: 317, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:27:22,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1666512.0, ans=0.125 2023-06-24 05:27:43,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1666572.0, ans=0.125 2023-06-24 05:27:56,519 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=10.50 vs. limit=10.0 2023-06-24 05:28:07,056 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-24 05:28:55,170 INFO [train.py:996] (2/4) Epoch 10, batch 3350, loss[loss=0.2749, simple_loss=0.3328, pruned_loss=0.1085, over 21346.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3169, pruned_loss=0.08221, over 4281560.79 frames. ], batch size: 159, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:29:24,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1666872.0, ans=0.2 2023-06-24 05:29:42,337 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.528e+02 7.146e+02 1.056e+03 1.768e+03 3.632e+03, threshold=2.111e+03, percent-clipped=30.0 2023-06-24 05:30:22,435 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.17 vs. limit=15.0 2023-06-24 05:30:38,688 INFO [train.py:996] (2/4) Epoch 10, batch 3400, loss[loss=0.2343, simple_loss=0.2939, pruned_loss=0.08742, over 20010.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3175, pruned_loss=0.08331, over 4284410.41 frames. ], batch size: 704, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:31:23,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1667232.0, ans=0.1 2023-06-24 05:31:56,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1667352.0, ans=0.07 2023-06-24 05:32:00,505 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.21 vs. limit=10.0 2023-06-24 05:32:15,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1667352.0, ans=0.125 2023-06-24 05:32:18,951 INFO [train.py:996] (2/4) Epoch 10, batch 3450, loss[loss=0.2194, simple_loss=0.28, pruned_loss=0.07946, over 21312.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3126, pruned_loss=0.08285, over 4291174.85 frames. ], batch size: 131, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:32:43,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1667472.0, ans=0.125 2023-06-24 05:32:57,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1667472.0, ans=0.125 2023-06-24 05:32:59,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1667532.0, ans=0.125 2023-06-24 05:33:00,167 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.256e+02 9.142e+02 1.242e+03 1.836e+03 3.790e+03, threshold=2.483e+03, percent-clipped=19.0 2023-06-24 05:33:12,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1667532.0, ans=0.125 2023-06-24 05:34:01,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1667712.0, ans=0.2 2023-06-24 05:34:01,641 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-24 05:34:02,171 INFO [train.py:996] (2/4) Epoch 10, batch 3500, loss[loss=0.217, simple_loss=0.2935, pruned_loss=0.0703, over 21233.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3189, pruned_loss=0.08565, over 4287292.80 frames. ], batch size: 608, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:34:49,798 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.66 vs. limit=15.0 2023-06-24 05:35:14,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1667892.0, ans=0.125 2023-06-24 05:35:41,127 INFO [train.py:996] (2/4) Epoch 10, batch 3550, loss[loss=0.2325, simple_loss=0.308, pruned_loss=0.07857, over 21394.00 frames. ], tot_loss[loss=0.2482, simple_loss=0.3227, pruned_loss=0.08682, over 4284906.50 frames. ], batch size: 131, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:35:41,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1668012.0, ans=0.125 2023-06-24 05:36:00,818 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=15.0 2023-06-24 05:36:06,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1668072.0, ans=0.1 2023-06-24 05:36:22,737 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.441e+02 8.129e+02 1.130e+03 1.802e+03 3.924e+03, threshold=2.259e+03, percent-clipped=11.0 2023-06-24 05:36:44,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1668192.0, ans=0.0 2023-06-24 05:36:54,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1668192.0, ans=0.1 2023-06-24 05:37:20,192 INFO [train.py:996] (2/4) Epoch 10, batch 3600, loss[loss=0.2384, simple_loss=0.3072, pruned_loss=0.08481, over 21602.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3173, pruned_loss=0.08652, over 4275184.04 frames. ], batch size: 263, lr: 3.00e-03, grad_scale: 32.0 2023-06-24 05:38:15,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1668432.0, ans=0.125 2023-06-24 05:38:16,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1668492.0, ans=0.125 2023-06-24 05:38:36,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1668492.0, ans=0.1 2023-06-24 05:38:58,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1668552.0, ans=0.0 2023-06-24 05:39:00,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1668612.0, ans=0.125 2023-06-24 05:39:01,633 INFO [train.py:996] (2/4) Epoch 10, batch 3650, loss[loss=0.2024, simple_loss=0.3166, pruned_loss=0.04404, over 20848.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3181, pruned_loss=0.08692, over 4272683.00 frames. ], batch size: 607, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:39:32,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1668672.0, ans=0.09899494936611666 2023-06-24 05:39:39,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1668672.0, ans=0.125 2023-06-24 05:39:43,917 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.119e+02 6.332e+02 8.468e+02 1.461e+03 3.139e+03, threshold=1.694e+03, percent-clipped=4.0 2023-06-24 05:39:44,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1668732.0, ans=0.1 2023-06-24 05:39:52,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1668732.0, ans=0.0 2023-06-24 05:40:35,758 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.22 vs. limit=15.0 2023-06-24 05:40:39,708 INFO [train.py:996] (2/4) Epoch 10, batch 3700, loss[loss=0.2766, simple_loss=0.3368, pruned_loss=0.1082, over 21799.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3143, pruned_loss=0.08544, over 4274681.12 frames. ], batch size: 441, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:41:40,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1669092.0, ans=0.125 2023-06-24 05:41:55,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1669092.0, ans=0.1 2023-06-24 05:42:20,657 INFO [train.py:996] (2/4) Epoch 10, batch 3750, loss[loss=0.2485, simple_loss=0.3143, pruned_loss=0.09136, over 21894.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3141, pruned_loss=0.08497, over 4281084.71 frames. ], batch size: 351, lr: 3.00e-03, grad_scale: 16.0 2023-06-24 05:42:27,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1669212.0, ans=0.125 2023-06-24 05:42:29,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1669212.0, ans=0.125 2023-06-24 05:42:41,458 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.63 vs. limit=10.0 2023-06-24 05:42:45,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1669272.0, ans=0.125 2023-06-24 05:43:00,068 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.152e+02 7.119e+02 9.667e+02 1.381e+03 3.413e+03, threshold=1.933e+03, percent-clipped=11.0 2023-06-24 05:43:17,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1669332.0, ans=0.0 2023-06-24 05:43:23,500 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-24 05:43:45,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1669452.0, ans=0.1 2023-06-24 05:44:00,656 INFO [train.py:996] (2/4) Epoch 10, batch 3800, loss[loss=0.2027, simple_loss=0.2815, pruned_loss=0.06188, over 21628.00 frames. ], tot_loss[loss=0.238, simple_loss=0.311, pruned_loss=0.08249, over 4280897.29 frames. ], batch size: 263, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 05:44:07,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1669512.0, ans=0.125 2023-06-24 05:44:34,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1669632.0, ans=0.125 2023-06-24 05:45:26,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1669752.0, ans=0.0 2023-06-24 05:45:34,316 INFO [train.py:996] (2/4) Epoch 10, batch 3850, loss[loss=0.206, simple_loss=0.2666, pruned_loss=0.07267, over 21506.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3101, pruned_loss=0.0834, over 4271206.52 frames. ], batch size: 230, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 05:45:48,870 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.10 vs. limit=22.5 2023-06-24 05:46:04,502 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=22.5 2023-06-24 05:46:06,325 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.754e+02 6.682e+02 9.971e+02 1.611e+03 3.519e+03, threshold=1.994e+03, percent-clipped=16.0 2023-06-24 05:46:32,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1669992.0, ans=0.0 2023-06-24 05:46:37,234 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 05:46:54,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1670052.0, ans=0.125 2023-06-24 05:46:57,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1670052.0, ans=0.125 2023-06-24 05:47:06,818 INFO [train.py:996] (2/4) Epoch 10, batch 3900, loss[loss=0.2669, simple_loss=0.3206, pruned_loss=0.1065, over 21864.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.308, pruned_loss=0.08387, over 4269237.82 frames. ], batch size: 107, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 05:47:07,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1670112.0, ans=0.0 2023-06-24 05:47:43,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1670232.0, ans=0.125 2023-06-24 05:48:08,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1670292.0, ans=0.125 2023-06-24 05:48:51,304 INFO [train.py:996] (2/4) Epoch 10, batch 3950, loss[loss=0.2137, simple_loss=0.2796, pruned_loss=0.07388, over 21804.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3099, pruned_loss=0.08316, over 4263698.17 frames. ], batch size: 118, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 05:49:28,744 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.660e+02 7.100e+02 1.207e+03 1.862e+03 3.460e+03, threshold=2.413e+03, percent-clipped=21.0 2023-06-24 05:49:46,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1670592.0, ans=0.125 2023-06-24 05:49:58,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1670592.0, ans=0.2 2023-06-24 05:49:59,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1670592.0, ans=0.07 2023-06-24 05:50:03,102 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.70 vs. limit=10.0 2023-06-24 05:50:08,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1670652.0, ans=0.125 2023-06-24 05:50:29,174 INFO [train.py:996] (2/4) Epoch 10, batch 4000, loss[loss=0.1925, simple_loss=0.2602, pruned_loss=0.06236, over 21418.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3037, pruned_loss=0.08025, over 4264220.37 frames. ], batch size: 212, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 05:50:31,724 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-24 05:50:33,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1670712.0, ans=0.1 2023-06-24 05:51:46,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1670892.0, ans=0.2 2023-06-24 05:51:46,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1670892.0, ans=0.125 2023-06-24 05:52:01,138 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.94 vs. limit=15.0 2023-06-24 05:52:09,790 INFO [train.py:996] (2/4) Epoch 10, batch 4050, loss[loss=0.2162, simple_loss=0.3084, pruned_loss=0.06199, over 21393.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3031, pruned_loss=0.07868, over 4265877.76 frames. ], batch size: 211, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 05:52:22,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1671012.0, ans=0.125 2023-06-24 05:52:51,705 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.639e+02 6.579e+02 8.856e+02 1.407e+03 2.917e+03, threshold=1.771e+03, percent-clipped=4.0 2023-06-24 05:53:34,060 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.81 vs. limit=22.5 2023-06-24 05:53:42,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1671252.0, ans=0.2 2023-06-24 05:53:49,030 INFO [train.py:996] (2/4) Epoch 10, batch 4100, loss[loss=0.2683, simple_loss=0.3301, pruned_loss=0.1033, over 21687.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3063, pruned_loss=0.07976, over 4275489.48 frames. ], batch size: 507, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 05:53:49,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1671312.0, ans=0.125 2023-06-24 05:54:02,852 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.76 vs. limit=22.5 2023-06-24 05:54:27,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1671372.0, ans=0.125 2023-06-24 05:54:50,752 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-24 05:54:53,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1671492.0, ans=0.2 2023-06-24 05:54:53,770 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.90 vs. limit=15.0 2023-06-24 05:55:28,589 INFO [train.py:996] (2/4) Epoch 10, batch 4150, loss[loss=0.2775, simple_loss=0.3441, pruned_loss=0.1054, over 21553.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3075, pruned_loss=0.07708, over 4267117.17 frames. ], batch size: 441, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 05:55:32,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1671612.0, ans=0.1 2023-06-24 05:55:33,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1671612.0, ans=0.125 2023-06-24 05:56:08,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1671732.0, ans=0.035 2023-06-24 05:56:17,909 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.253e+02 6.615e+02 8.834e+02 1.095e+03 2.475e+03, threshold=1.767e+03, percent-clipped=7.0 2023-06-24 05:56:48,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1671792.0, ans=0.125 2023-06-24 05:57:03,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1671852.0, ans=0.0 2023-06-24 05:57:08,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1671912.0, ans=0.0 2023-06-24 05:57:09,375 INFO [train.py:996] (2/4) Epoch 10, batch 4200, loss[loss=0.2253, simple_loss=0.2845, pruned_loss=0.08306, over 21819.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3073, pruned_loss=0.07657, over 4268411.26 frames. ], batch size: 107, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 05:58:47,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1672152.0, ans=0.0 2023-06-24 05:58:57,441 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1672152.0, ans=0.125 2023-06-24 05:58:59,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1672212.0, ans=0.0 2023-06-24 05:59:00,769 INFO [train.py:996] (2/4) Epoch 10, batch 4250, loss[loss=0.226, simple_loss=0.2949, pruned_loss=0.07852, over 20854.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3126, pruned_loss=0.07847, over 4268037.35 frames. ], batch size: 611, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 05:59:33,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1672272.0, ans=0.1 2023-06-24 05:59:40,888 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.102e+02 7.026e+02 9.985e+02 1.582e+03 3.548e+03, threshold=1.997e+03, percent-clipped=19.0 2023-06-24 05:59:57,463 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=5.95 vs. limit=15.0 2023-06-24 06:00:43,116 INFO [train.py:996] (2/4) Epoch 10, batch 4300, loss[loss=0.2289, simple_loss=0.3413, pruned_loss=0.05825, over 20749.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.319, pruned_loss=0.07997, over 4271469.08 frames. ], batch size: 608, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:00:44,149 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=15.0 2023-06-24 06:00:46,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1672512.0, ans=0.125 2023-06-24 06:01:24,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1672632.0, ans=0.1 2023-06-24 06:02:26,288 INFO [train.py:996] (2/4) Epoch 10, batch 4350, loss[loss=0.2256, simple_loss=0.2923, pruned_loss=0.0795, over 21603.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3172, pruned_loss=0.07861, over 4267057.51 frames. ], batch size: 298, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:03:05,530 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.531e+02 6.687e+02 1.042e+03 1.785e+03 5.548e+03, threshold=2.083e+03, percent-clipped=20.0 2023-06-24 06:03:25,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1672992.0, ans=0.125 2023-06-24 06:04:04,309 INFO [train.py:996] (2/4) Epoch 10, batch 4400, loss[loss=0.2206, simple_loss=0.2877, pruned_loss=0.07671, over 21764.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3129, pruned_loss=0.07774, over 4269949.07 frames. ], batch size: 112, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 06:04:58,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1673232.0, ans=0.125 2023-06-24 06:04:58,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1673232.0, ans=0.0 2023-06-24 06:05:20,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1673292.0, ans=0.95 2023-06-24 06:05:33,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1673352.0, ans=0.2 2023-06-24 06:05:45,292 INFO [train.py:996] (2/4) Epoch 10, batch 4450, loss[loss=0.2638, simple_loss=0.318, pruned_loss=0.1048, over 21282.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3214, pruned_loss=0.07987, over 4275772.23 frames. ], batch size: 471, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:06:20,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1673532.0, ans=0.0 2023-06-24 06:06:30,818 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.122e+02 7.408e+02 1.039e+03 1.368e+03 2.536e+03, threshold=2.077e+03, percent-clipped=7.0 2023-06-24 06:06:37,859 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-24 06:07:23,219 INFO [train.py:996] (2/4) Epoch 10, batch 4500, loss[loss=0.2448, simple_loss=0.3607, pruned_loss=0.06448, over 21202.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3241, pruned_loss=0.08213, over 4284530.29 frames. ], batch size: 548, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:07:57,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1673772.0, ans=0.1 2023-06-24 06:08:17,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1673832.0, ans=0.1 2023-06-24 06:08:19,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1673832.0, ans=0.125 2023-06-24 06:09:07,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1674012.0, ans=0.2 2023-06-24 06:09:08,452 INFO [train.py:996] (2/4) Epoch 10, batch 4550, loss[loss=0.2746, simple_loss=0.3512, pruned_loss=0.09902, over 21481.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3249, pruned_loss=0.08179, over 4278675.29 frames. ], batch size: 211, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:09:10,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1674012.0, ans=0.125 2023-06-24 06:09:16,962 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:09:48,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1674132.0, ans=0.035 2023-06-24 06:09:55,277 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.449e+02 7.044e+02 9.475e+02 1.574e+03 2.834e+03, threshold=1.895e+03, percent-clipped=10.0 2023-06-24 06:10:11,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1674192.0, ans=0.025 2023-06-24 06:10:49,455 INFO [train.py:996] (2/4) Epoch 10, batch 4600, loss[loss=0.228, simple_loss=0.2982, pruned_loss=0.07892, over 21328.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.328, pruned_loss=0.08347, over 4280629.09 frames. ], batch size: 176, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:10:49,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1674312.0, ans=0.125 2023-06-24 06:10:53,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1674312.0, ans=0.2 2023-06-24 06:12:19,627 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:12:27,162 INFO [train.py:996] (2/4) Epoch 10, batch 4650, loss[loss=0.1637, simple_loss=0.2401, pruned_loss=0.04365, over 21495.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3223, pruned_loss=0.08242, over 4276632.20 frames. ], batch size: 212, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:12:39,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1674612.0, ans=0.2 2023-06-24 06:12:42,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1674612.0, ans=0.1 2023-06-24 06:12:59,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1674672.0, ans=0.125 2023-06-24 06:13:08,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1674732.0, ans=0.125 2023-06-24 06:13:08,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1674732.0, ans=0.04949747468305833 2023-06-24 06:13:18,204 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.211e+02 6.703e+02 9.514e+02 1.360e+03 2.442e+03, threshold=1.903e+03, percent-clipped=9.0 2023-06-24 06:13:24,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1674732.0, ans=0.0 2023-06-24 06:13:34,443 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=15.0 2023-06-24 06:13:40,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1674792.0, ans=0.125 2023-06-24 06:14:05,739 INFO [train.py:996] (2/4) Epoch 10, batch 4700, loss[loss=0.2047, simple_loss=0.2657, pruned_loss=0.07189, over 21236.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.312, pruned_loss=0.07954, over 4279444.86 frames. ], batch size: 159, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:14:29,081 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.24 vs. limit=15.0 2023-06-24 06:14:38,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1674972.0, ans=0.1 2023-06-24 06:14:39,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1674972.0, ans=0.2 2023-06-24 06:14:48,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1675032.0, ans=0.0 2023-06-24 06:14:56,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1675032.0, ans=0.1 2023-06-24 06:15:38,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1675152.0, ans=0.0 2023-06-24 06:15:44,552 INFO [train.py:996] (2/4) Epoch 10, batch 4750, loss[loss=0.2341, simple_loss=0.2963, pruned_loss=0.08599, over 21603.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3063, pruned_loss=0.0799, over 4278847.05 frames. ], batch size: 263, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:16:17,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1675272.0, ans=0.125 2023-06-24 06:16:35,095 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.386e+02 6.636e+02 9.717e+02 1.456e+03 3.310e+03, threshold=1.943e+03, percent-clipped=9.0 2023-06-24 06:16:56,584 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1675392.0, ans=0.0 2023-06-24 06:16:59,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1675392.0, ans=0.1 2023-06-24 06:17:24,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=1675452.0, ans=0.02 2023-06-24 06:17:27,576 INFO [train.py:996] (2/4) Epoch 10, batch 4800, loss[loss=0.2149, simple_loss=0.2845, pruned_loss=0.07266, over 21135.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3068, pruned_loss=0.08005, over 4285790.69 frames. ], batch size: 143, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 06:17:50,374 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.72 vs. limit=15.0 2023-06-24 06:17:54,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1675572.0, ans=0.0 2023-06-24 06:18:29,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1675692.0, ans=0.0 2023-06-24 06:18:36,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1675692.0, ans=0.125 2023-06-24 06:18:57,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1675752.0, ans=0.2 2023-06-24 06:19:05,390 INFO [train.py:996] (2/4) Epoch 10, batch 4850, loss[loss=0.2152, simple_loss=0.2822, pruned_loss=0.07403, over 21845.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3068, pruned_loss=0.07984, over 4284146.92 frames. ], batch size: 118, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:19:53,822 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.198e+02 7.060e+02 1.085e+03 1.594e+03 2.809e+03, threshold=2.169e+03, percent-clipped=13.0 2023-06-24 06:19:58,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1675932.0, ans=0.125 2023-06-24 06:20:28,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1676052.0, ans=15.0 2023-06-24 06:20:42,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1676052.0, ans=0.5 2023-06-24 06:20:45,006 INFO [train.py:996] (2/4) Epoch 10, batch 4900, loss[loss=0.2519, simple_loss=0.3965, pruned_loss=0.05363, over 20834.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3087, pruned_loss=0.08036, over 4283803.56 frames. ], batch size: 607, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:21:06,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1676172.0, ans=0.0 2023-06-24 06:21:06,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1676172.0, ans=0.125 2023-06-24 06:21:25,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1676232.0, ans=0.1 2023-06-24 06:21:32,388 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-24 06:21:49,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1676292.0, ans=0.1 2023-06-24 06:22:11,598 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.95 vs. limit=15.0 2023-06-24 06:22:14,725 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.79 vs. limit=15.0 2023-06-24 06:22:20,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1676352.0, ans=0.0 2023-06-24 06:22:29,689 INFO [train.py:996] (2/4) Epoch 10, batch 4950, loss[loss=0.208, simple_loss=0.3096, pruned_loss=0.05316, over 21770.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3124, pruned_loss=0.07876, over 4283382.00 frames. ], batch size: 282, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:23:12,527 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.855e+02 5.943e+02 9.817e+02 1.512e+03 3.334e+03, threshold=1.963e+03, percent-clipped=7.0 2023-06-24 06:24:03,550 INFO [train.py:996] (2/4) Epoch 10, batch 5000, loss[loss=0.2088, simple_loss=0.2931, pruned_loss=0.06224, over 21611.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.312, pruned_loss=0.07668, over 4280509.83 frames. ], batch size: 230, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:24:12,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1676712.0, ans=0.125 2023-06-24 06:24:46,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1676832.0, ans=0.2 2023-06-24 06:24:51,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1676832.0, ans=0.125 2023-06-24 06:25:04,022 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.97 vs. limit=10.0 2023-06-24 06:25:16,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1676892.0, ans=0.125 2023-06-24 06:25:29,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1676952.0, ans=0.0 2023-06-24 06:25:38,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1676952.0, ans=0.125 2023-06-24 06:25:41,899 INFO [train.py:996] (2/4) Epoch 10, batch 5050, loss[loss=0.2346, simple_loss=0.3034, pruned_loss=0.08285, over 21822.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3114, pruned_loss=0.07771, over 4286877.32 frames. ], batch size: 282, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:26:30,085 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.758e+02 6.527e+02 8.985e+02 1.399e+03 2.450e+03, threshold=1.797e+03, percent-clipped=5.0 2023-06-24 06:26:35,216 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:27:21,191 INFO [train.py:996] (2/4) Epoch 10, batch 5100, loss[loss=0.1738, simple_loss=0.2579, pruned_loss=0.04484, over 21797.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3088, pruned_loss=0.07837, over 4295231.82 frames. ], batch size: 247, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:27:57,487 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-06-24 06:27:59,165 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=15.0 2023-06-24 06:28:18,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1677492.0, ans=0.125 2023-06-24 06:28:42,100 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=15.0 2023-06-24 06:29:01,696 INFO [train.py:996] (2/4) Epoch 10, batch 5150, loss[loss=0.2159, simple_loss=0.289, pruned_loss=0.07138, over 21628.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3077, pruned_loss=0.07922, over 4298863.71 frames. ], batch size: 263, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:29:06,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1677612.0, ans=0.125 2023-06-24 06:29:21,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1677672.0, ans=0.2 2023-06-24 06:29:32,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1677672.0, ans=0.0 2023-06-24 06:29:50,177 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.999e+02 6.243e+02 9.247e+02 1.548e+03 4.552e+03, threshold=1.849e+03, percent-clipped=17.0 2023-06-24 06:30:05,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1677792.0, ans=0.0 2023-06-24 06:30:41,397 INFO [train.py:996] (2/4) Epoch 10, batch 5200, loss[loss=0.2524, simple_loss=0.3447, pruned_loss=0.08001, over 21853.00 frames. ], tot_loss[loss=0.235, simple_loss=0.31, pruned_loss=0.07999, over 4292727.17 frames. ], batch size: 316, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 06:30:51,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1677912.0, ans=0.125 2023-06-24 06:31:09,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1677972.0, ans=0.125 2023-06-24 06:31:25,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1678032.0, ans=10.0 2023-06-24 06:32:18,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1678212.0, ans=0.0 2023-06-24 06:32:20,307 INFO [train.py:996] (2/4) Epoch 10, batch 5250, loss[loss=0.2168, simple_loss=0.294, pruned_loss=0.06979, over 21777.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3152, pruned_loss=0.07932, over 4288403.66 frames. ], batch size: 112, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 06:32:33,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1678212.0, ans=0.1 2023-06-24 06:32:35,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1678212.0, ans=0.125 2023-06-24 06:32:54,843 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.20 vs. limit=22.5 2023-06-24 06:33:09,824 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.205e+02 5.630e+02 8.256e+02 1.146e+03 2.990e+03, threshold=1.651e+03, percent-clipped=4.0 2023-06-24 06:33:16,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1678332.0, ans=0.125 2023-06-24 06:34:00,822 INFO [train.py:996] (2/4) Epoch 10, batch 5300, loss[loss=0.2416, simple_loss=0.3067, pruned_loss=0.08822, over 21903.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3135, pruned_loss=0.07907, over 4288427.01 frames. ], batch size: 371, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 06:34:29,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1678572.0, ans=0.0 2023-06-24 06:34:36,313 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 06:34:39,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1678572.0, ans=0.125 2023-06-24 06:35:38,867 INFO [train.py:996] (2/4) Epoch 10, batch 5350, loss[loss=0.2267, simple_loss=0.2915, pruned_loss=0.08093, over 21722.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3124, pruned_loss=0.08039, over 4289366.27 frames. ], batch size: 112, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:35:41,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1678812.0, ans=0.2 2023-06-24 06:36:04,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1678872.0, ans=0.125 2023-06-24 06:36:23,611 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.431e+02 6.516e+02 8.462e+02 1.218e+03 2.526e+03, threshold=1.692e+03, percent-clipped=10.0 2023-06-24 06:36:50,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1678992.0, ans=0.125 2023-06-24 06:37:13,505 INFO [train.py:996] (2/4) Epoch 10, batch 5400, loss[loss=0.2324, simple_loss=0.3038, pruned_loss=0.08048, over 21754.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3094, pruned_loss=0.08091, over 4299494.53 frames. ], batch size: 441, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:38:32,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1679292.0, ans=0.125 2023-06-24 06:38:58,884 INFO [train.py:996] (2/4) Epoch 10, batch 5450, loss[loss=0.2768, simple_loss=0.3631, pruned_loss=0.09527, over 21527.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3104, pruned_loss=0.07924, over 4294844.57 frames. ], batch size: 471, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:39:07,362 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.60 vs. limit=6.0 2023-06-24 06:39:08,770 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.16 vs. limit=15.0 2023-06-24 06:39:53,558 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.102e+02 6.727e+02 1.128e+03 1.838e+03 3.883e+03, threshold=2.256e+03, percent-clipped=29.0 2023-06-24 06:40:10,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1679592.0, ans=0.1 2023-06-24 06:40:43,772 INFO [train.py:996] (2/4) Epoch 10, batch 5500, loss[loss=0.2665, simple_loss=0.3644, pruned_loss=0.0843, over 21481.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3158, pruned_loss=0.07697, over 4295828.09 frames. ], batch size: 471, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:40:44,880 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.96 vs. limit=15.0 2023-06-24 06:40:50,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1679712.0, ans=0.125 2023-06-24 06:41:28,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1679832.0, ans=0.1 2023-06-24 06:42:24,507 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.26 vs. limit=22.5 2023-06-24 06:42:31,487 INFO [train.py:996] (2/4) Epoch 10, batch 5550, loss[loss=0.2167, simple_loss=0.3092, pruned_loss=0.06208, over 21710.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3171, pruned_loss=0.07457, over 4288567.35 frames. ], batch size: 298, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:43:06,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1680132.0, ans=0.125 2023-06-24 06:43:16,337 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.836e+02 5.799e+02 8.739e+02 1.452e+03 3.739e+03, threshold=1.748e+03, percent-clipped=10.0 2023-06-24 06:44:16,153 INFO [train.py:996] (2/4) Epoch 10, batch 5600, loss[loss=0.3365, simple_loss=0.4219, pruned_loss=0.1256, over 21410.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3154, pruned_loss=0.07202, over 4282119.63 frames. ], batch size: 507, lr: 2.99e-03, grad_scale: 32.0 2023-06-24 06:44:55,761 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.47 vs. limit=6.0 2023-06-24 06:45:25,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1680492.0, ans=0.125 2023-06-24 06:45:55,093 INFO [train.py:996] (2/4) Epoch 10, batch 5650, loss[loss=0.274, simple_loss=0.3429, pruned_loss=0.1025, over 21757.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3187, pruned_loss=0.07404, over 4277775.42 frames. ], batch size: 441, lr: 2.99e-03, grad_scale: 16.0 2023-06-24 06:46:46,042 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.674e+02 6.342e+02 8.278e+02 1.256e+03 3.323e+03, threshold=1.656e+03, percent-clipped=10.0 2023-06-24 06:46:49,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1680732.0, ans=0.125 2023-06-24 06:47:34,640 INFO [train.py:996] (2/4) Epoch 10, batch 5700, loss[loss=0.2103, simple_loss=0.2777, pruned_loss=0.07142, over 21252.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3174, pruned_loss=0.0759, over 4276137.61 frames. ], batch size: 608, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:47:46,143 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-24 06:48:18,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1681032.0, ans=0.0 2023-06-24 06:48:52,337 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=12.0 2023-06-24 06:49:15,853 INFO [train.py:996] (2/4) Epoch 10, batch 5750, loss[loss=0.1837, simple_loss=0.2744, pruned_loss=0.04652, over 21676.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3143, pruned_loss=0.07316, over 4271961.73 frames. ], batch size: 263, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:49:58,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1681332.0, ans=0.125 2023-06-24 06:50:12,185 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.981e+02 6.985e+02 1.085e+03 1.966e+03 4.482e+03, threshold=2.170e+03, percent-clipped=31.0 2023-06-24 06:50:31,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1681392.0, ans=0.1 2023-06-24 06:50:54,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1681512.0, ans=0.125 2023-06-24 06:50:55,673 INFO [train.py:996] (2/4) Epoch 10, batch 5800, loss[loss=0.2384, simple_loss=0.333, pruned_loss=0.07188, over 21829.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3122, pruned_loss=0.07128, over 4277713.87 frames. ], batch size: 316, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:51:25,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1681572.0, ans=0.0 2023-06-24 06:52:02,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1681692.0, ans=0.0 2023-06-24 06:52:16,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1681692.0, ans=0.2 2023-06-24 06:52:27,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1681752.0, ans=0.0 2023-06-24 06:52:40,267 INFO [train.py:996] (2/4) Epoch 10, batch 5850, loss[loss=0.1848, simple_loss=0.281, pruned_loss=0.04426, over 21729.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.3101, pruned_loss=0.06768, over 4275885.13 frames. ], batch size: 247, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:52:50,365 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.95 vs. limit=15.0 2023-06-24 06:53:02,503 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-06-24 06:53:36,615 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.591e+02 5.271e+02 8.161e+02 1.450e+03 2.978e+03, threshold=1.632e+03, percent-clipped=6.0 2023-06-24 06:53:54,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1681992.0, ans=0.125 2023-06-24 06:54:23,926 INFO [train.py:996] (2/4) Epoch 10, batch 5900, loss[loss=0.2254, simple_loss=0.294, pruned_loss=0.07843, over 21243.00 frames. ], tot_loss[loss=0.2146, simple_loss=0.303, pruned_loss=0.0631, over 4273279.23 frames. ], batch size: 608, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:54:34,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1682112.0, ans=0.0 2023-06-24 06:55:09,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1682232.0, ans=0.1 2023-06-24 06:55:16,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1682232.0, ans=0.125 2023-06-24 06:55:35,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1682352.0, ans=0.0 2023-06-24 06:55:56,698 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=4.70 vs. limit=6.0 2023-06-24 06:56:02,439 INFO [train.py:996] (2/4) Epoch 10, batch 5950, loss[loss=0.2022, simple_loss=0.2664, pruned_loss=0.06902, over 21581.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.3005, pruned_loss=0.06597, over 4276178.96 frames. ], batch size: 263, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:56:24,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1682472.0, ans=0.2 2023-06-24 06:56:37,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1682472.0, ans=0.125 2023-06-24 06:56:37,805 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.24 vs. limit=15.0 2023-06-24 06:56:52,536 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.371e+02 5.849e+02 7.878e+02 1.100e+03 2.007e+03, threshold=1.576e+03, percent-clipped=6.0 2023-06-24 06:57:31,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1682652.0, ans=0.0 2023-06-24 06:57:40,427 INFO [train.py:996] (2/4) Epoch 10, batch 6000, loss[loss=0.1741, simple_loss=0.2388, pruned_loss=0.05465, over 21470.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2965, pruned_loss=0.06886, over 4272529.10 frames. ], batch size: 212, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 06:57:40,427 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 06:57:54,888 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.2663, 4.2718, 3.8868, 3.8688], device='cuda:2') 2023-06-24 06:57:59,728 INFO [train.py:1028] (2/4) Epoch 10, validation: loss=0.2611, simple_loss=0.3564, pruned_loss=0.0829, over 1796401.00 frames. 2023-06-24 06:57:59,728 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-24 06:58:00,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1682712.0, ans=0.2 2023-06-24 06:58:04,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1682712.0, ans=0.1 2023-06-24 06:58:30,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1682772.0, ans=0.0 2023-06-24 06:59:16,299 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-24 06:59:38,434 INFO [train.py:996] (2/4) Epoch 10, batch 6050, loss[loss=0.2199, simple_loss=0.2857, pruned_loss=0.07699, over 21834.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2921, pruned_loss=0.07015, over 4278990.73 frames. ], batch size: 107, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 06:59:51,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1683012.0, ans=0.125 2023-06-24 07:00:00,569 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-24 07:00:05,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1683072.0, ans=0.125 2023-06-24 07:00:26,521 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.624e+02 5.765e+02 7.695e+02 1.085e+03 2.275e+03, threshold=1.539e+03, percent-clipped=10.0 2023-06-24 07:00:44,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1683192.0, ans=0.0 2023-06-24 07:01:16,725 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.92 vs. limit=10.0 2023-06-24 07:01:17,062 INFO [train.py:996] (2/4) Epoch 10, batch 6100, loss[loss=0.2447, simple_loss=0.3102, pruned_loss=0.08964, over 21520.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2924, pruned_loss=0.06953, over 4281715.41 frames. ], batch size: 131, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:02:45,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1683552.0, ans=0.125 2023-06-24 07:02:59,567 INFO [train.py:996] (2/4) Epoch 10, batch 6150, loss[loss=0.204, simple_loss=0.2794, pruned_loss=0.06433, over 21500.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2963, pruned_loss=0.07174, over 4288901.36 frames. ], batch size: 195, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:03:51,698 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.961e+02 6.743e+02 9.296e+02 1.382e+03 3.230e+03, threshold=1.859e+03, percent-clipped=18.0 2023-06-24 07:04:38,156 INFO [train.py:996] (2/4) Epoch 10, batch 6200, loss[loss=0.238, simple_loss=0.313, pruned_loss=0.08153, over 21369.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3004, pruned_loss=0.0727, over 4287118.53 frames. ], batch size: 211, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:04:39,025 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.05 vs. limit=12.0 2023-06-24 07:04:43,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1683912.0, ans=0.0 2023-06-24 07:04:44,931 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:05:27,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1684032.0, ans=0.125 2023-06-24 07:06:02,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1684152.0, ans=0.035 2023-06-24 07:06:17,931 INFO [train.py:996] (2/4) Epoch 10, batch 6250, loss[loss=0.2324, simple_loss=0.3127, pruned_loss=0.07603, over 21465.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3039, pruned_loss=0.07261, over 4284683.45 frames. ], batch size: 211, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:06:50,257 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-06-24 07:07:04,161 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.548e-03 2023-06-24 07:07:09,760 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.397e+02 7.386e+02 1.187e+03 1.704e+03 4.027e+03, threshold=2.375e+03, percent-clipped=21.0 2023-06-24 07:07:31,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1684452.0, ans=0.125 2023-06-24 07:07:56,045 INFO [train.py:996] (2/4) Epoch 10, batch 6300, loss[loss=0.2142, simple_loss=0.3198, pruned_loss=0.05434, over 20876.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.307, pruned_loss=0.07124, over 4294361.00 frames. ], batch size: 608, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:08:49,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1684632.0, ans=0.1 2023-06-24 07:08:58,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1684692.0, ans=0.0 2023-06-24 07:08:58,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1684692.0, ans=0.2 2023-06-24 07:09:16,532 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-06-24 07:09:18,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1684752.0, ans=0.0 2023-06-24 07:09:34,389 INFO [train.py:996] (2/4) Epoch 10, batch 6350, loss[loss=0.2385, simple_loss=0.3118, pruned_loss=0.08254, over 21819.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3099, pruned_loss=0.07418, over 4285257.47 frames. ], batch size: 351, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:10:12,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1684872.0, ans=0.125 2023-06-24 07:10:17,305 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=22.5 2023-06-24 07:10:20,005 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:10:26,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1684932.0, ans=0.07 2023-06-24 07:10:27,521 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.068e+02 5.999e+02 8.518e+02 1.321e+03 2.305e+03, threshold=1.704e+03, percent-clipped=0.0 2023-06-24 07:10:48,708 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:10:52,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1684992.0, ans=0.125 2023-06-24 07:11:14,590 INFO [train.py:996] (2/4) Epoch 10, batch 6400, loss[loss=0.2578, simple_loss=0.3275, pruned_loss=0.09402, over 21596.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3161, pruned_loss=0.07903, over 4282919.31 frames. ], batch size: 415, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:11:26,840 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.22 vs. limit=12.0 2023-06-24 07:12:00,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1685232.0, ans=0.125 2023-06-24 07:12:07,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1685232.0, ans=0.2 2023-06-24 07:12:26,781 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=22.5 2023-06-24 07:12:27,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1685292.0, ans=0.1 2023-06-24 07:12:35,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1685292.0, ans=0.125 2023-06-24 07:12:44,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1685352.0, ans=0.0 2023-06-24 07:12:59,593 INFO [train.py:996] (2/4) Epoch 10, batch 6450, loss[loss=0.2083, simple_loss=0.3023, pruned_loss=0.05712, over 21409.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3177, pruned_loss=0.07829, over 4283373.79 frames. ], batch size: 211, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:13:42,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1685532.0, ans=0.025 2023-06-24 07:13:56,601 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.370e+02 9.137e+02 1.216e+03 1.629e+03 2.950e+03, threshold=2.432e+03, percent-clipped=21.0 2023-06-24 07:14:03,514 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1685592.0, ans=0.0 2023-06-24 07:14:16,920 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.14 vs. limit=10.0 2023-06-24 07:14:19,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1685652.0, ans=0.2 2023-06-24 07:14:41,230 INFO [train.py:996] (2/4) Epoch 10, batch 6500, loss[loss=0.2071, simple_loss=0.2856, pruned_loss=0.06428, over 21521.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3102, pruned_loss=0.07719, over 4270334.68 frames. ], batch size: 389, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:14:55,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1685712.0, ans=0.2 2023-06-24 07:15:02,309 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.45 vs. limit=15.0 2023-06-24 07:15:08,496 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.79 vs. limit=15.0 2023-06-24 07:15:09,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff3.min_abs, batch_count=1685772.0, ans=0.2 2023-06-24 07:15:11,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1685772.0, ans=0.125 2023-06-24 07:16:19,598 INFO [train.py:996] (2/4) Epoch 10, batch 6550, loss[loss=0.2475, simple_loss=0.3514, pruned_loss=0.07178, over 21504.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3088, pruned_loss=0.07667, over 4277557.79 frames. ], batch size: 471, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:16:32,906 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.99 vs. limit=10.0 2023-06-24 07:16:44,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1686072.0, ans=0.125 2023-06-24 07:17:06,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1686132.0, ans=0.125 2023-06-24 07:17:10,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1686132.0, ans=0.125 2023-06-24 07:17:15,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1686132.0, ans=0.2 2023-06-24 07:17:18,457 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.101e+02 5.692e+02 8.877e+02 1.428e+03 2.273e+03, threshold=1.775e+03, percent-clipped=0.0 2023-06-24 07:17:21,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1686192.0, ans=0.0 2023-06-24 07:18:03,358 INFO [train.py:996] (2/4) Epoch 10, batch 6600, loss[loss=0.1884, simple_loss=0.2559, pruned_loss=0.06045, over 21814.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3054, pruned_loss=0.07703, over 4257314.27 frames. ], batch size: 98, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:18:23,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1686372.0, ans=0.125 2023-06-24 07:18:40,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1686432.0, ans=0.125 2023-06-24 07:19:02,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1686492.0, ans=0.125 2023-06-24 07:19:17,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1686552.0, ans=0.0 2023-06-24 07:19:20,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1686552.0, ans=0.0 2023-06-24 07:19:37,643 INFO [train.py:996] (2/4) Epoch 10, batch 6650, loss[loss=0.2375, simple_loss=0.2991, pruned_loss=0.0879, over 21588.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2986, pruned_loss=0.07366, over 4258539.56 frames. ], batch size: 391, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:19:41,586 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.45 vs. limit=10.0 2023-06-24 07:19:46,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1686612.0, ans=0.0 2023-06-24 07:19:50,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1686612.0, ans=0.0 2023-06-24 07:20:06,448 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.56 vs. limit=15.0 2023-06-24 07:20:31,553 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.031e+02 5.733e+02 1.031e+03 1.471e+03 3.342e+03, threshold=2.062e+03, percent-clipped=12.0 2023-06-24 07:20:57,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1686792.0, ans=0.1 2023-06-24 07:21:03,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1686852.0, ans=0.125 2023-06-24 07:21:08,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1686852.0, ans=0.0 2023-06-24 07:21:15,684 INFO [train.py:996] (2/4) Epoch 10, batch 6700, loss[loss=0.1858, simple_loss=0.2688, pruned_loss=0.05141, over 15841.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2945, pruned_loss=0.07351, over 4259670.37 frames. ], batch size: 60, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:21:51,400 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-24 07:22:13,744 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.87 vs. limit=15.0 2023-06-24 07:22:44,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1687152.0, ans=0.125 2023-06-24 07:22:53,749 INFO [train.py:996] (2/4) Epoch 10, batch 6750, loss[loss=0.2186, simple_loss=0.291, pruned_loss=0.07312, over 21884.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.2924, pruned_loss=0.074, over 4266386.36 frames. ], batch size: 118, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:22:57,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1687212.0, ans=0.09899494936611666 2023-06-24 07:23:35,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1687332.0, ans=0.125 2023-06-24 07:23:36,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1687332.0, ans=0.125 2023-06-24 07:23:37,486 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-24 07:23:47,786 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.403e+02 6.480e+02 8.146e+02 1.101e+03 1.861e+03, threshold=1.629e+03, percent-clipped=0.0 2023-06-24 07:23:51,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1687392.0, ans=0.125 2023-06-24 07:23:52,974 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-24 07:24:08,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1687392.0, ans=0.2 2023-06-24 07:24:32,353 INFO [train.py:996] (2/4) Epoch 10, batch 6800, loss[loss=0.1842, simple_loss=0.2507, pruned_loss=0.05883, over 21249.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.2956, pruned_loss=0.07672, over 4268575.33 frames. ], batch size: 176, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:24:33,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1687512.0, ans=0.125 2023-06-24 07:24:34,651 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-06-24 07:24:39,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1687512.0, ans=6.0 2023-06-24 07:24:41,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1687512.0, ans=0.0 2023-06-24 07:25:40,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1687692.0, ans=0.2 2023-06-24 07:25:49,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1687752.0, ans=0.1 2023-06-24 07:25:49,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1687752.0, ans=0.1 2023-06-24 07:26:10,766 INFO [train.py:996] (2/4) Epoch 10, batch 6850, loss[loss=0.241, simple_loss=0.2868, pruned_loss=0.09756, over 21511.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2936, pruned_loss=0.07757, over 4277706.59 frames. ], batch size: 511, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:26:14,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1687812.0, ans=0.0 2023-06-24 07:27:08,124 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.455e+02 6.000e+02 9.900e+02 1.479e+03 3.025e+03, threshold=1.980e+03, percent-clipped=16.0 2023-06-24 07:27:18,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1687992.0, ans=0.125 2023-06-24 07:27:31,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1688052.0, ans=0.0 2023-06-24 07:27:32,028 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.79 vs. limit=22.5 2023-06-24 07:27:32,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1688052.0, ans=0.2 2023-06-24 07:27:44,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1688052.0, ans=0.125 2023-06-24 07:27:44,612 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=22.5 2023-06-24 07:27:46,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1688052.0, ans=0.1 2023-06-24 07:27:51,228 INFO [train.py:996] (2/4) Epoch 10, batch 6900, loss[loss=0.2347, simple_loss=0.296, pruned_loss=0.08671, over 21906.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2945, pruned_loss=0.07747, over 4279961.62 frames. ], batch size: 107, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:29:32,981 INFO [train.py:996] (2/4) Epoch 10, batch 6950, loss[loss=0.2667, simple_loss=0.3304, pruned_loss=0.1015, over 21351.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2969, pruned_loss=0.07494, over 4283911.17 frames. ], batch size: 159, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:30:34,214 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.410e+02 6.339e+02 9.938e+02 1.554e+03 2.681e+03, threshold=1.988e+03, percent-clipped=10.0 2023-06-24 07:30:37,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1688592.0, ans=0.125 2023-06-24 07:30:59,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1688652.0, ans=0.2 2023-06-24 07:31:12,643 INFO [train.py:996] (2/4) Epoch 10, batch 7000, loss[loss=0.2635, simple_loss=0.3895, pruned_loss=0.06876, over 19838.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3002, pruned_loss=0.077, over 4272029.72 frames. ], batch size: 702, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:32:28,501 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:32:31,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1688892.0, ans=0.125 2023-06-24 07:32:52,792 INFO [train.py:996] (2/4) Epoch 10, batch 7050, loss[loss=0.173, simple_loss=0.2545, pruned_loss=0.04578, over 21354.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2977, pruned_loss=0.07508, over 4267165.02 frames. ], batch size: 211, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:33:09,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1689012.0, ans=0.0 2023-06-24 07:33:40,183 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.17 vs. limit=15.0 2023-06-24 07:34:00,081 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.677e+02 8.213e+02 1.284e+03 1.969e+03 3.755e+03, threshold=2.569e+03, percent-clipped=21.0 2023-06-24 07:34:03,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1689192.0, ans=0.125 2023-06-24 07:34:17,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1689252.0, ans=0.0 2023-06-24 07:34:41,690 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2023-06-24 07:34:43,732 INFO [train.py:996] (2/4) Epoch 10, batch 7100, loss[loss=0.267, simple_loss=0.3396, pruned_loss=0.09717, over 21560.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3032, pruned_loss=0.07673, over 4268591.85 frames. ], batch size: 414, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:34:49,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1689312.0, ans=0.0 2023-06-24 07:34:51,328 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.22 vs. limit=6.0 2023-06-24 07:35:18,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1689372.0, ans=0.125 2023-06-24 07:35:31,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1689432.0, ans=0.125 2023-06-24 07:36:24,351 INFO [train.py:996] (2/4) Epoch 10, batch 7150, loss[loss=0.2351, simple_loss=0.3087, pruned_loss=0.08073, over 21592.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.298, pruned_loss=0.07378, over 4259554.45 frames. ], batch size: 230, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:36:46,289 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-06-24 07:36:47,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1689672.0, ans=0.2 2023-06-24 07:37:04,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1689732.0, ans=0.125 2023-06-24 07:37:20,920 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.107e+02 6.261e+02 9.228e+02 1.360e+03 3.235e+03, threshold=1.846e+03, percent-clipped=6.0 2023-06-24 07:37:29,730 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.95 vs. limit=15.0 2023-06-24 07:38:04,263 INFO [train.py:996] (2/4) Epoch 10, batch 7200, loss[loss=0.2106, simple_loss=0.2794, pruned_loss=0.07091, over 21808.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2994, pruned_loss=0.07544, over 4264890.03 frames. ], batch size: 352, lr: 2.98e-03, grad_scale: 32.0 2023-06-24 07:38:48,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1690032.0, ans=0.125 2023-06-24 07:39:10,483 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:39:44,053 INFO [train.py:996] (2/4) Epoch 10, batch 7250, loss[loss=0.2234, simple_loss=0.2819, pruned_loss=0.08247, over 22029.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2962, pruned_loss=0.07626, over 4264744.55 frames. ], batch size: 103, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:40:45,632 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.252e+02 6.130e+02 8.584e+02 1.247e+03 2.821e+03, threshold=1.717e+03, percent-clipped=3.0 2023-06-24 07:40:52,675 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=15.0 2023-06-24 07:40:53,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1690392.0, ans=0.2 2023-06-24 07:41:17,459 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=15.0 2023-06-24 07:41:22,777 INFO [train.py:996] (2/4) Epoch 10, batch 7300, loss[loss=0.212, simple_loss=0.2716, pruned_loss=0.07621, over 21720.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2898, pruned_loss=0.07551, over 4267987.62 frames. ], batch size: 300, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:41:57,691 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.56 vs. limit=15.0 2023-06-24 07:42:27,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1690692.0, ans=0.04949747468305833 2023-06-24 07:42:29,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1690692.0, ans=0.09899494936611666 2023-06-24 07:42:39,935 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:42:41,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1690692.0, ans=0.1 2023-06-24 07:42:43,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1690692.0, ans=0.2 2023-06-24 07:42:58,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1690752.0, ans=0.2 2023-06-24 07:43:08,159 INFO [train.py:996] (2/4) Epoch 10, batch 7350, loss[loss=0.2753, simple_loss=0.3408, pruned_loss=0.105, over 21582.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2895, pruned_loss=0.07629, over 4267127.43 frames. ], batch size: 389, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:44:00,928 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:44:06,751 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.919e+02 7.422e+02 1.084e+03 1.496e+03 4.269e+03, threshold=2.168e+03, percent-clipped=20.0 2023-06-24 07:44:22,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1690992.0, ans=0.0 2023-06-24 07:44:28,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1691052.0, ans=0.125 2023-06-24 07:44:28,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1691052.0, ans=0.0 2023-06-24 07:44:45,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1691052.0, ans=0.125 2023-06-24 07:44:49,768 INFO [train.py:996] (2/4) Epoch 10, batch 7400, loss[loss=0.267, simple_loss=0.3329, pruned_loss=0.1005, over 21331.00 frames. ], tot_loss[loss=0.229, simple_loss=0.2986, pruned_loss=0.0797, over 4273424.53 frames. ], batch size: 159, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:44:57,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1691112.0, ans=0.1 2023-06-24 07:45:06,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1691112.0, ans=0.125 2023-06-24 07:45:48,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1691232.0, ans=0.1 2023-06-24 07:46:00,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1691292.0, ans=0.0 2023-06-24 07:46:09,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1691352.0, ans=0.0 2023-06-24 07:46:16,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1691352.0, ans=0.025 2023-06-24 07:46:29,127 INFO [train.py:996] (2/4) Epoch 10, batch 7450, loss[loss=0.2385, simple_loss=0.2989, pruned_loss=0.08911, over 21524.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2965, pruned_loss=0.07787, over 4271968.73 frames. ], batch size: 263, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:47:15,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1691532.0, ans=0.0 2023-06-24 07:47:23,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1691532.0, ans=0.125 2023-06-24 07:47:32,869 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.371e+02 6.014e+02 8.284e+02 1.438e+03 2.557e+03, threshold=1.657e+03, percent-clipped=4.0 2023-06-24 07:48:04,698 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.07 vs. limit=10.0 2023-06-24 07:48:15,228 INFO [train.py:996] (2/4) Epoch 10, batch 7500, loss[loss=0.238, simple_loss=0.3237, pruned_loss=0.07611, over 21433.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3013, pruned_loss=0.07929, over 4269845.48 frames. ], batch size: 131, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:48:38,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1691772.0, ans=0.2 2023-06-24 07:48:52,108 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-24 07:49:36,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1691952.0, ans=0.125 2023-06-24 07:49:56,497 INFO [train.py:996] (2/4) Epoch 10, batch 7550, loss[loss=0.1916, simple_loss=0.2852, pruned_loss=0.04897, over 21626.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.309, pruned_loss=0.07809, over 4276889.08 frames. ], batch size: 230, lr: 2.98e-03, grad_scale: 16.0 2023-06-24 07:50:06,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1692012.0, ans=0.0 2023-06-24 07:50:40,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1692132.0, ans=0.125 2023-06-24 07:50:52,204 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.375e+02 7.111e+02 1.164e+03 1.789e+03 2.953e+03, threshold=2.328e+03, percent-clipped=32.0 2023-06-24 07:51:00,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1692192.0, ans=0.0 2023-06-24 07:51:18,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1692252.0, ans=0.125 2023-06-24 07:51:34,167 INFO [train.py:996] (2/4) Epoch 10, batch 7600, loss[loss=0.1879, simple_loss=0.258, pruned_loss=0.05892, over 16653.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3076, pruned_loss=0.07688, over 4280925.87 frames. ], batch size: 60, lr: 2.97e-03, grad_scale: 32.0 2023-06-24 07:51:50,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1692312.0, ans=0.1 2023-06-24 07:52:10,946 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.83 vs. limit=15.0 2023-06-24 07:52:52,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1692552.0, ans=0.125 2023-06-24 07:53:07,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1692612.0, ans=0.125 2023-06-24 07:53:09,042 INFO [train.py:996] (2/4) Epoch 10, batch 7650, loss[loss=0.235, simple_loss=0.2978, pruned_loss=0.08608, over 21829.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3061, pruned_loss=0.07781, over 4283555.52 frames. ], batch size: 247, lr: 2.97e-03, grad_scale: 32.0 2023-06-24 07:53:25,538 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-24 07:53:36,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1692672.0, ans=0.125 2023-06-24 07:53:50,106 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.68 vs. limit=10.0 2023-06-24 07:54:11,425 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.476e+02 5.649e+02 6.864e+02 9.584e+02 1.979e+03, threshold=1.373e+03, percent-clipped=0.0 2023-06-24 07:54:20,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1692792.0, ans=0.0 2023-06-24 07:54:23,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1692792.0, ans=0.125 2023-06-24 07:54:29,410 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.60 vs. limit=5.0 2023-06-24 07:54:58,720 INFO [train.py:996] (2/4) Epoch 10, batch 7700, loss[loss=0.2661, simple_loss=0.3368, pruned_loss=0.09768, over 21794.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3084, pruned_loss=0.08067, over 4284773.67 frames. ], batch size: 441, lr: 2.97e-03, grad_scale: 32.0 2023-06-24 07:55:06,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1692912.0, ans=0.0 2023-06-24 07:55:11,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1692912.0, ans=0.125 2023-06-24 07:55:35,160 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:55:41,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1693032.0, ans=0.125 2023-06-24 07:56:18,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1693152.0, ans=0.125 2023-06-24 07:56:40,744 INFO [train.py:996] (2/4) Epoch 10, batch 7750, loss[loss=0.1613, simple_loss=0.224, pruned_loss=0.04933, over 17208.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3124, pruned_loss=0.08056, over 4284033.59 frames. ], batch size: 62, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 07:56:43,000 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 07:57:42,754 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.643e+02 8.175e+02 1.275e+03 1.822e+03 5.282e+03, threshold=2.550e+03, percent-clipped=41.0 2023-06-24 07:58:21,722 INFO [train.py:996] (2/4) Epoch 10, batch 7800, loss[loss=0.2048, simple_loss=0.2723, pruned_loss=0.06871, over 21473.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.315, pruned_loss=0.08113, over 4281850.01 frames. ], batch size: 195, lr: 2.97e-03, grad_scale: 8.0 2023-06-24 07:59:20,834 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=15.0 2023-06-24 08:00:00,766 INFO [train.py:996] (2/4) Epoch 10, batch 7850, loss[loss=0.2194, simple_loss=0.2815, pruned_loss=0.07869, over 21591.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3083, pruned_loss=0.08025, over 4281369.29 frames. ], batch size: 298, lr: 2.97e-03, grad_scale: 8.0 2023-06-24 08:00:28,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1693872.0, ans=0.2 2023-06-24 08:00:41,899 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=15.0 2023-06-24 08:00:43,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1693932.0, ans=0.2 2023-06-24 08:01:02,399 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.278e+02 6.400e+02 8.758e+02 1.300e+03 4.376e+03, threshold=1.752e+03, percent-clipped=3.0 2023-06-24 08:01:17,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1693992.0, ans=0.04949747468305833 2023-06-24 08:01:41,732 INFO [train.py:996] (2/4) Epoch 10, batch 7900, loss[loss=0.297, simple_loss=0.3865, pruned_loss=0.1038, over 21445.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3057, pruned_loss=0.08057, over 4276818.04 frames. ], batch size: 507, lr: 2.97e-03, grad_scale: 8.0 2023-06-24 08:01:54,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1694112.0, ans=0.1 2023-06-24 08:02:18,434 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.54 vs. limit=15.0 2023-06-24 08:03:03,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1694292.0, ans=0.2 2023-06-24 08:03:28,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1694412.0, ans=0.0 2023-06-24 08:03:29,352 INFO [train.py:996] (2/4) Epoch 10, batch 7950, loss[loss=0.2545, simple_loss=0.3267, pruned_loss=0.09121, over 21822.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3121, pruned_loss=0.08064, over 4274315.89 frames. ], batch size: 282, lr: 2.97e-03, grad_scale: 8.0 2023-06-24 08:04:36,159 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.376e+02 7.152e+02 9.880e+02 1.480e+03 4.841e+03, threshold=1.976e+03, percent-clipped=16.0 2023-06-24 08:04:43,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1694592.0, ans=0.125 2023-06-24 08:04:58,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1694652.0, ans=0.1 2023-06-24 08:05:16,251 INFO [train.py:996] (2/4) Epoch 10, batch 8000, loss[loss=0.2304, simple_loss=0.3224, pruned_loss=0.06918, over 21867.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3169, pruned_loss=0.08255, over 4264878.68 frames. ], batch size: 316, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:05:33,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1694712.0, ans=0.125 2023-06-24 08:05:34,094 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.49 vs. limit=15.0 2023-06-24 08:05:45,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1694772.0, ans=10.0 2023-06-24 08:06:00,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1694832.0, ans=0.2 2023-06-24 08:06:20,572 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=22.5 2023-06-24 08:06:30,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1694892.0, ans=0.5 2023-06-24 08:07:06,331 INFO [train.py:996] (2/4) Epoch 10, batch 8050, loss[loss=0.1604, simple_loss=0.1973, pruned_loss=0.06172, over 16284.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3208, pruned_loss=0.08301, over 4260656.66 frames. ], batch size: 60, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:07:09,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1695012.0, ans=0.0 2023-06-24 08:08:07,470 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.312e+02 7.120e+02 8.807e+02 1.513e+03 2.630e+03, threshold=1.761e+03, percent-clipped=8.0 2023-06-24 08:08:46,515 INFO [train.py:996] (2/4) Epoch 10, batch 8100, loss[loss=0.2392, simple_loss=0.3079, pruned_loss=0.08521, over 21834.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3201, pruned_loss=0.08393, over 4268864.38 frames. ], batch size: 282, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:10:05,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1695492.0, ans=0.125 2023-06-24 08:10:36,952 INFO [train.py:996] (2/4) Epoch 10, batch 8150, loss[loss=0.2985, simple_loss=0.4, pruned_loss=0.09848, over 21703.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3294, pruned_loss=0.08485, over 4270526.19 frames. ], batch size: 414, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:11:03,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1695672.0, ans=0.1 2023-06-24 08:11:43,629 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.039e+02 7.483e+02 1.136e+03 1.755e+03 5.961e+03, threshold=2.271e+03, percent-clipped=24.0 2023-06-24 08:12:17,681 INFO [train.py:996] (2/4) Epoch 10, batch 8200, loss[loss=0.2291, simple_loss=0.2781, pruned_loss=0.09002, over 21436.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3198, pruned_loss=0.08184, over 4268051.69 frames. ], batch size: 160, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:13:22,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1696092.0, ans=0.125 2023-06-24 08:13:24,987 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.77 vs. limit=15.0 2023-06-24 08:13:45,762 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.98 vs. limit=15.0 2023-06-24 08:13:57,375 INFO [train.py:996] (2/4) Epoch 10, batch 8250, loss[loss=0.23, simple_loss=0.3279, pruned_loss=0.06601, over 21807.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3145, pruned_loss=0.08067, over 4261995.62 frames. ], batch size: 316, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:13:59,729 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=22.5 2023-06-24 08:14:20,994 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.09 vs. limit=15.0 2023-06-24 08:15:04,592 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.899e+02 6.467e+02 8.535e+02 1.267e+03 3.280e+03, threshold=1.707e+03, percent-clipped=4.0 2023-06-24 08:15:05,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1696392.0, ans=0.2 2023-06-24 08:15:31,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1696452.0, ans=0.1 2023-06-24 08:15:38,162 INFO [train.py:996] (2/4) Epoch 10, batch 8300, loss[loss=0.2048, simple_loss=0.2813, pruned_loss=0.06412, over 21335.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3141, pruned_loss=0.07813, over 4264099.82 frames. ], batch size: 176, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:15:51,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1696512.0, ans=10.0 2023-06-24 08:16:31,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1696632.0, ans=0.2 2023-06-24 08:17:07,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1696752.0, ans=0.1 2023-06-24 08:17:18,875 INFO [train.py:996] (2/4) Epoch 10, batch 8350, loss[loss=0.2476, simple_loss=0.3265, pruned_loss=0.08437, over 20744.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3129, pruned_loss=0.076, over 4262277.44 frames. ], batch size: 607, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:18:09,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1696932.0, ans=0.0 2023-06-24 08:18:29,880 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.321e+02 6.071e+02 7.237e+02 1.103e+03 3.221e+03, threshold=1.447e+03, percent-clipped=5.0 2023-06-24 08:18:37,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1696992.0, ans=0.0 2023-06-24 08:18:41,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1697052.0, ans=0.2 2023-06-24 08:18:46,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1697052.0, ans=0.1 2023-06-24 08:18:51,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1697052.0, ans=0.125 2023-06-24 08:18:59,492 INFO [train.py:996] (2/4) Epoch 10, batch 8400, loss[loss=0.1819, simple_loss=0.2651, pruned_loss=0.04933, over 21160.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3088, pruned_loss=0.07349, over 4265218.70 frames. ], batch size: 176, lr: 2.97e-03, grad_scale: 32.0 2023-06-24 08:19:09,979 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-24 08:19:12,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1697112.0, ans=0.0 2023-06-24 08:20:39,132 INFO [train.py:996] (2/4) Epoch 10, batch 8450, loss[loss=0.2033, simple_loss=0.2777, pruned_loss=0.0644, over 21858.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3075, pruned_loss=0.07245, over 4268667.49 frames. ], batch size: 282, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:20:48,591 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.17 vs. limit=12.0 2023-06-24 08:21:29,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1697532.0, ans=0.125 2023-06-24 08:21:51,745 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.782e+02 9.053e+02 1.298e+03 1.951e+03 3.847e+03, threshold=2.596e+03, percent-clipped=39.0 2023-06-24 08:22:13,265 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=22.5 2023-06-24 08:22:23,918 INFO [train.py:996] (2/4) Epoch 10, batch 8500, loss[loss=0.2745, simple_loss=0.3202, pruned_loss=0.1144, over 21385.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3051, pruned_loss=0.07462, over 4272533.86 frames. ], batch size: 508, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:22:25,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1697712.0, ans=0.95 2023-06-24 08:22:48,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1697772.0, ans=0.125 2023-06-24 08:24:03,932 INFO [train.py:996] (2/4) Epoch 10, batch 8550, loss[loss=0.2479, simple_loss=0.3248, pruned_loss=0.0855, over 21447.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3103, pruned_loss=0.07741, over 4272775.91 frames. ], batch size: 194, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:24:23,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1698072.0, ans=0.035 2023-06-24 08:24:51,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1698132.0, ans=0.125 2023-06-24 08:25:09,874 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.596e+02 7.231e+02 1.146e+03 1.740e+03 4.216e+03, threshold=2.291e+03, percent-clipped=13.0 2023-06-24 08:25:23,148 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2023-06-24 08:25:48,536 INFO [train.py:996] (2/4) Epoch 10, batch 8600, loss[loss=0.2492, simple_loss=0.3266, pruned_loss=0.08589, over 21687.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3168, pruned_loss=0.07998, over 4266834.03 frames. ], batch size: 351, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:26:25,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1698372.0, ans=0.1 2023-06-24 08:26:31,424 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.33 vs. limit=15.0 2023-06-24 08:27:00,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1698492.0, ans=0.2 2023-06-24 08:27:28,160 INFO [train.py:996] (2/4) Epoch 10, batch 8650, loss[loss=0.1649, simple_loss=0.2628, pruned_loss=0.0335, over 21652.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3218, pruned_loss=0.08048, over 4277533.42 frames. ], batch size: 230, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:27:39,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1698612.0, ans=0.125 2023-06-24 08:28:28,423 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.818e+02 6.800e+02 1.041e+03 1.467e+03 2.492e+03, threshold=2.082e+03, percent-clipped=1.0 2023-06-24 08:28:52,991 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.77 vs. limit=12.0 2023-06-24 08:28:57,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1698852.0, ans=0.125 2023-06-24 08:28:59,766 INFO [train.py:996] (2/4) Epoch 10, batch 8700, loss[loss=0.2151, simple_loss=0.2676, pruned_loss=0.08134, over 20130.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3129, pruned_loss=0.0773, over 4270219.41 frames. ], batch size: 704, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:29:00,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1698912.0, ans=0.125 2023-06-24 08:30:20,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1699152.0, ans=0.2 2023-06-24 08:30:24,775 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.48 vs. limit=15.0 2023-06-24 08:30:46,986 INFO [train.py:996] (2/4) Epoch 10, batch 8750, loss[loss=0.2256, simple_loss=0.2916, pruned_loss=0.07984, over 21772.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3084, pruned_loss=0.07857, over 4273538.48 frames. ], batch size: 298, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:31:05,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1699212.0, ans=0.0 2023-06-24 08:31:07,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1699272.0, ans=0.1 2023-06-24 08:31:29,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1699332.0, ans=0.0 2023-06-24 08:31:34,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1699332.0, ans=0.95 2023-06-24 08:31:50,103 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.900e+02 7.163e+02 1.016e+03 1.538e+03 3.044e+03, threshold=2.032e+03, percent-clipped=7.0 2023-06-24 08:32:32,734 INFO [train.py:996] (2/4) Epoch 10, batch 8800, loss[loss=0.2503, simple_loss=0.3523, pruned_loss=0.07413, over 19844.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3156, pruned_loss=0.08159, over 4274781.18 frames. ], batch size: 702, lr: 2.97e-03, grad_scale: 32.0 2023-06-24 08:32:46,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1699512.0, ans=0.1 2023-06-24 08:33:48,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1699692.0, ans=0.1 2023-06-24 08:33:52,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1699752.0, ans=0.1 2023-06-24 08:34:04,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1699752.0, ans=0.125 2023-06-24 08:34:07,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1699752.0, ans=0.2 2023-06-24 08:34:18,595 INFO [train.py:996] (2/4) Epoch 10, batch 8850, loss[loss=0.2769, simple_loss=0.3633, pruned_loss=0.09523, over 21340.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3224, pruned_loss=0.08362, over 4273272.26 frames. ], batch size: 159, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:34:28,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1699812.0, ans=0.125 2023-06-24 08:34:30,190 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.72 vs. limit=12.0 2023-06-24 08:34:46,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1699872.0, ans=0.125 2023-06-24 08:35:17,532 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.300e+02 6.165e+02 8.058e+02 1.044e+03 1.938e+03, threshold=1.612e+03, percent-clipped=0.0 2023-06-24 08:35:38,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1700052.0, ans=0.0 2023-06-24 08:35:41,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1700052.0, ans=0.07 2023-06-24 08:35:59,441 INFO [train.py:996] (2/4) Epoch 10, batch 8900, loss[loss=0.2528, simple_loss=0.3032, pruned_loss=0.1013, over 21268.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3157, pruned_loss=0.08257, over 4275424.35 frames. ], batch size: 471, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:36:03,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1700112.0, ans=0.2 2023-06-24 08:36:20,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1700172.0, ans=0.1 2023-06-24 08:36:48,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1700232.0, ans=0.0 2023-06-24 08:37:43,122 INFO [train.py:996] (2/4) Epoch 10, batch 8950, loss[loss=0.2779, simple_loss=0.3874, pruned_loss=0.08421, over 19814.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3188, pruned_loss=0.0812, over 4271838.01 frames. ], batch size: 702, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:37:53,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1700412.0, ans=0.1 2023-06-24 08:38:32,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1700532.0, ans=10.0 2023-06-24 08:38:56,967 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.364e+02 1.062e+03 1.597e+03 2.316e+03 4.236e+03, threshold=3.193e+03, percent-clipped=50.0 2023-06-24 08:39:22,925 INFO [train.py:996] (2/4) Epoch 10, batch 9000, loss[loss=0.2014, simple_loss=0.2699, pruned_loss=0.0664, over 21670.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3129, pruned_loss=0.08063, over 4276229.96 frames. ], batch size: 282, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:39:22,926 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 08:39:36,749 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.4138, 1.7023, 2.0959, 1.6494, 1.1860, 2.0193, 2.1658, 1.3235], device='cuda:2') 2023-06-24 08:39:39,592 INFO [train.py:1028] (2/4) Epoch 10, validation: loss=0.2679, simple_loss=0.3599, pruned_loss=0.08793, over 1796401.00 frames. 2023-06-24 08:39:39,593 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-24 08:39:46,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1700712.0, ans=0.0 2023-06-24 08:40:48,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1700892.0, ans=0.125 2023-06-24 08:40:51,171 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.09 vs. limit=15.0 2023-06-24 08:41:06,760 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1700952.0, ans=0.09899494936611666 2023-06-24 08:41:18,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1700952.0, ans=0.0 2023-06-24 08:41:21,493 INFO [train.py:996] (2/4) Epoch 10, batch 9050, loss[loss=0.2306, simple_loss=0.312, pruned_loss=0.0746, over 21676.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3065, pruned_loss=0.07723, over 4278936.84 frames. ], batch size: 351, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:41:56,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1701072.0, ans=0.125 2023-06-24 08:42:37,324 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.440e+02 8.915e+02 1.381e+03 2.151e+03 3.467e+03, threshold=2.763e+03, percent-clipped=3.0 2023-06-24 08:43:08,678 INFO [train.py:996] (2/4) Epoch 10, batch 9100, loss[loss=0.2347, simple_loss=0.3224, pruned_loss=0.07351, over 21455.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3119, pruned_loss=0.07924, over 4279575.77 frames. ], batch size: 131, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:43:58,281 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.54 vs. limit=22.5 2023-06-24 08:44:13,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1701492.0, ans=0.1 2023-06-24 08:44:28,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1701552.0, ans=0.2 2023-06-24 08:44:49,835 INFO [train.py:996] (2/4) Epoch 10, batch 9150, loss[loss=0.259, simple_loss=0.355, pruned_loss=0.08148, over 21651.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3171, pruned_loss=0.07734, over 4279531.47 frames. ], batch size: 389, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:45:59,083 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.914e+02 7.155e+02 1.028e+03 1.622e+03 3.048e+03, threshold=2.056e+03, percent-clipped=2.0 2023-06-24 08:46:11,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1701852.0, ans=0.125 2023-06-24 08:46:40,275 INFO [train.py:996] (2/4) Epoch 10, batch 9200, loss[loss=0.2775, simple_loss=0.3582, pruned_loss=0.09839, over 21472.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3185, pruned_loss=0.07636, over 4272415.74 frames. ], batch size: 507, lr: 2.97e-03, grad_scale: 32.0 2023-06-24 08:46:53,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1701912.0, ans=15.0 2023-06-24 08:47:26,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1702032.0, ans=0.0 2023-06-24 08:48:03,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1702152.0, ans=0.2 2023-06-24 08:48:20,289 INFO [train.py:996] (2/4) Epoch 10, batch 9250, loss[loss=0.2239, simple_loss=0.2836, pruned_loss=0.08213, over 21191.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3197, pruned_loss=0.07955, over 4280658.00 frames. ], batch size: 143, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:48:21,354 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.93 vs. limit=22.5 2023-06-24 08:48:54,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1702272.0, ans=0.0 2023-06-24 08:49:01,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1702332.0, ans=0.1 2023-06-24 08:49:21,333 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.372e+02 7.455e+02 9.438e+02 1.547e+03 2.905e+03, threshold=1.888e+03, percent-clipped=9.0 2023-06-24 08:49:22,497 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.12 vs. limit=15.0 2023-06-24 08:49:28,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1702452.0, ans=0.1 2023-06-24 08:50:06,165 INFO [train.py:996] (2/4) Epoch 10, batch 9300, loss[loss=0.24, simple_loss=0.3012, pruned_loss=0.08937, over 21242.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3147, pruned_loss=0.07999, over 4281343.01 frames. ], batch size: 159, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:50:22,299 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.95 vs. limit=15.0 2023-06-24 08:50:23,899 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=22.5 2023-06-24 08:51:02,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten.whitening_limit, batch_count=1702692.0, ans=22.5 2023-06-24 08:51:27,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1702752.0, ans=0.1 2023-06-24 08:51:30,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1702752.0, ans=0.125 2023-06-24 08:51:43,754 INFO [train.py:996] (2/4) Epoch 10, batch 9350, loss[loss=0.3107, simple_loss=0.3756, pruned_loss=0.1229, over 21793.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3216, pruned_loss=0.08047, over 4285052.28 frames. ], batch size: 441, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:51:59,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1702872.0, ans=0.0 2023-06-24 08:52:04,927 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.55 vs. limit=15.0 2023-06-24 08:52:41,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1702992.0, ans=0.125 2023-06-24 08:53:01,209 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.560e+02 6.681e+02 9.424e+02 1.664e+03 4.543e+03, threshold=1.885e+03, percent-clipped=14.0 2023-06-24 08:53:19,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1703052.0, ans=0.125 2023-06-24 08:53:25,338 INFO [train.py:996] (2/4) Epoch 10, batch 9400, loss[loss=0.2279, simple_loss=0.2834, pruned_loss=0.08626, over 21709.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3218, pruned_loss=0.08058, over 4282727.60 frames. ], batch size: 112, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:53:39,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1703172.0, ans=0.125 2023-06-24 08:53:47,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1703172.0, ans=0.125 2023-06-24 08:55:04,511 INFO [train.py:996] (2/4) Epoch 10, batch 9450, loss[loss=0.1852, simple_loss=0.2512, pruned_loss=0.05964, over 21669.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3138, pruned_loss=0.08016, over 4273738.76 frames. ], batch size: 282, lr: 2.97e-03, grad_scale: 16.0 2023-06-24 08:55:15,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1703412.0, ans=0.125 2023-06-24 08:55:33,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1703472.0, ans=0.0 2023-06-24 08:55:56,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1703532.0, ans=0.125 2023-06-24 08:56:12,428 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=12.72 vs. limit=15.0 2023-06-24 08:56:19,383 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.658e+02 7.388e+02 1.013e+03 1.627e+03 3.415e+03, threshold=2.026e+03, percent-clipped=14.0 2023-06-24 08:56:28,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1703652.0, ans=0.0 2023-06-24 08:56:35,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1703652.0, ans=0.125 2023-06-24 08:56:37,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1703652.0, ans=0.0 2023-06-24 08:56:42,885 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-24 08:56:43,442 INFO [train.py:996] (2/4) Epoch 10, batch 9500, loss[loss=0.1859, simple_loss=0.2748, pruned_loss=0.04849, over 21803.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3061, pruned_loss=0.07821, over 4270528.98 frames. ], batch size: 316, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 08:57:04,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1703772.0, ans=0.07 2023-06-24 08:57:48,168 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=12.0 2023-06-24 08:57:55,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1703892.0, ans=0.2 2023-06-24 08:58:13,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1703952.0, ans=0.125 2023-06-24 08:58:20,278 INFO [train.py:996] (2/4) Epoch 10, batch 9550, loss[loss=0.2467, simple_loss=0.3323, pruned_loss=0.08057, over 21298.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.309, pruned_loss=0.07936, over 4271284.92 frames. ], batch size: 548, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 08:58:29,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1704012.0, ans=0.2 2023-06-24 08:58:45,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1704072.0, ans=0.1 2023-06-24 08:59:23,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1704192.0, ans=0.1 2023-06-24 08:59:33,963 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.417e+02 6.895e+02 1.023e+03 1.419e+03 2.349e+03, threshold=2.046e+03, percent-clipped=3.0 2023-06-24 08:59:51,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1704252.0, ans=0.0 2023-06-24 08:59:57,568 INFO [train.py:996] (2/4) Epoch 10, batch 9600, loss[loss=0.2128, simple_loss=0.2897, pruned_loss=0.06794, over 21946.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3109, pruned_loss=0.0802, over 4278000.65 frames. ], batch size: 316, lr: 2.96e-03, grad_scale: 32.0 2023-06-24 08:59:57,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1704312.0, ans=0.0 2023-06-24 09:01:17,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1704492.0, ans=6.0 2023-06-24 09:01:29,256 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.84 vs. limit=22.5 2023-06-24 09:01:37,634 INFO [train.py:996] (2/4) Epoch 10, batch 9650, loss[loss=0.2651, simple_loss=0.3363, pruned_loss=0.09692, over 21789.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3116, pruned_loss=0.08064, over 4281229.57 frames. ], batch size: 441, lr: 2.96e-03, grad_scale: 32.0 2023-06-24 09:02:03,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1704672.0, ans=0.0 2023-06-24 09:02:06,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1704672.0, ans=0.125 2023-06-24 09:02:18,917 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.15 vs. limit=22.5 2023-06-24 09:02:37,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1704732.0, ans=0.125 2023-06-24 09:02:45,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1704792.0, ans=0.125 2023-06-24 09:02:50,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1704792.0, ans=0.125 2023-06-24 09:02:55,063 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.754e+02 6.745e+02 1.012e+03 1.363e+03 3.649e+03, threshold=2.025e+03, percent-clipped=11.0 2023-06-24 09:03:17,695 INFO [train.py:996] (2/4) Epoch 10, batch 9700, loss[loss=0.2508, simple_loss=0.3227, pruned_loss=0.08942, over 21825.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3177, pruned_loss=0.08326, over 4287162.57 frames. ], batch size: 371, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:03:18,888 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.42 vs. limit=12.0 2023-06-24 09:03:21,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1704912.0, ans=0.0 2023-06-24 09:04:20,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1705032.0, ans=0.04949747468305833 2023-06-24 09:04:34,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1705092.0, ans=0.125 2023-06-24 09:04:47,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1705152.0, ans=0.125 2023-06-24 09:04:50,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1705152.0, ans=0.2 2023-06-24 09:04:55,675 INFO [train.py:996] (2/4) Epoch 10, batch 9750, loss[loss=0.2012, simple_loss=0.2617, pruned_loss=0.07034, over 21518.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3116, pruned_loss=0.08108, over 4273857.15 frames. ], batch size: 230, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:05:58,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1705332.0, ans=0.0 2023-06-24 09:06:10,517 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.138e+02 7.196e+02 1.029e+03 1.714e+03 4.123e+03, threshold=2.059e+03, percent-clipped=13.0 2023-06-24 09:06:28,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1705452.0, ans=0.125 2023-06-24 09:06:30,451 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=22.5 2023-06-24 09:06:32,715 INFO [train.py:996] (2/4) Epoch 10, batch 9800, loss[loss=0.217, simple_loss=0.2994, pruned_loss=0.06732, over 21804.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3084, pruned_loss=0.08063, over 4269187.79 frames. ], batch size: 351, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:06:40,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1705512.0, ans=0.0 2023-06-24 09:07:45,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1705692.0, ans=0.5 2023-06-24 09:07:46,490 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=15.0 2023-06-24 09:07:47,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1705692.0, ans=0.125 2023-06-24 09:07:59,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1705752.0, ans=0.0 2023-06-24 09:08:09,968 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.78 vs. limit=22.5 2023-06-24 09:08:10,613 INFO [train.py:996] (2/4) Epoch 10, batch 9850, loss[loss=0.2158, simple_loss=0.2847, pruned_loss=0.07348, over 21841.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3065, pruned_loss=0.08043, over 4272627.47 frames. ], batch size: 107, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:08:11,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1705812.0, ans=0.1 2023-06-24 09:08:31,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1705872.0, ans=0.1 2023-06-24 09:09:11,582 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.58 vs. limit=22.5 2023-06-24 09:09:19,109 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:09:26,564 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.182e+02 7.595e+02 1.021e+03 1.343e+03 2.731e+03, threshold=2.043e+03, percent-clipped=9.0 2023-06-24 09:09:36,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1706052.0, ans=0.04949747468305833 2023-06-24 09:09:49,324 INFO [train.py:996] (2/4) Epoch 10, batch 9900, loss[loss=0.2326, simple_loss=0.3322, pruned_loss=0.06646, over 19969.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3032, pruned_loss=0.08027, over 4269343.41 frames. ], batch size: 703, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:09:50,406 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-06-24 09:10:46,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1706232.0, ans=0.1 2023-06-24 09:11:27,325 INFO [train.py:996] (2/4) Epoch 10, batch 9950, loss[loss=0.2253, simple_loss=0.2752, pruned_loss=0.08772, over 21541.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3037, pruned_loss=0.08185, over 4266594.51 frames. ], batch size: 230, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:11:38,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1706412.0, ans=0.125 2023-06-24 09:11:41,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1706412.0, ans=0.0 2023-06-24 09:12:27,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1706532.0, ans=0.0 2023-06-24 09:12:44,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1706592.0, ans=22.5 2023-06-24 09:12:44,772 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.712e+02 7.241e+02 1.069e+03 1.520e+03 2.876e+03, threshold=2.138e+03, percent-clipped=9.0 2023-06-24 09:12:55,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1706652.0, ans=0.05 2023-06-24 09:13:12,889 INFO [train.py:996] (2/4) Epoch 10, batch 10000, loss[loss=0.2059, simple_loss=0.2656, pruned_loss=0.07309, over 19961.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.2995, pruned_loss=0.08043, over 4266898.25 frames. ], batch size: 703, lr: 2.96e-03, grad_scale: 32.0 2023-06-24 09:13:31,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1706712.0, ans=0.2 2023-06-24 09:14:08,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1706832.0, ans=0.125 2023-06-24 09:14:18,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1706892.0, ans=0.09899494936611666 2023-06-24 09:14:24,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1706892.0, ans=0.2 2023-06-24 09:14:45,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1706952.0, ans=0.125 2023-06-24 09:14:49,258 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.26 vs. limit=15.0 2023-06-24 09:14:54,903 INFO [train.py:996] (2/4) Epoch 10, batch 10050, loss[loss=0.279, simple_loss=0.3925, pruned_loss=0.08275, over 19889.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3025, pruned_loss=0.08094, over 4268579.82 frames. ], batch size: 703, lr: 2.96e-03, grad_scale: 32.0 2023-06-24 09:15:27,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1707072.0, ans=0.125 2023-06-24 09:16:11,745 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.316e+02 6.781e+02 9.769e+02 1.554e+03 3.220e+03, threshold=1.954e+03, percent-clipped=12.0 2023-06-24 09:16:30,117 INFO [train.py:996] (2/4) Epoch 10, batch 10100, loss[loss=0.2114, simple_loss=0.2853, pruned_loss=0.06875, over 21705.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.2993, pruned_loss=0.07872, over 4273249.05 frames. ], batch size: 247, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:17:05,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1707372.0, ans=0.125 2023-06-24 09:17:07,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1707372.0, ans=15.0 2023-06-24 09:17:25,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1707432.0, ans=0.125 2023-06-24 09:17:34,360 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.13 vs. limit=22.5 2023-06-24 09:17:47,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1707492.0, ans=0.07 2023-06-24 09:17:56,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1707552.0, ans=0.035 2023-06-24 09:18:13,786 INFO [train.py:996] (2/4) Epoch 10, batch 10150, loss[loss=0.2132, simple_loss=0.2771, pruned_loss=0.07462, over 21830.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3056, pruned_loss=0.08168, over 4275289.23 frames. ], batch size: 107, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:18:20,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1707612.0, ans=0.125 2023-06-24 09:18:36,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1707672.0, ans=0.125 2023-06-24 09:18:50,408 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-24 09:18:54,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1707732.0, ans=0.0 2023-06-24 09:19:05,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1707792.0, ans=0.125 2023-06-24 09:19:25,304 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.627e+02 7.106e+02 9.650e+02 1.431e+03 2.478e+03, threshold=1.930e+03, percent-clipped=8.0 2023-06-24 09:19:53,995 INFO [train.py:996] (2/4) Epoch 10, batch 10200, loss[loss=0.1915, simple_loss=0.276, pruned_loss=0.05356, over 20778.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.307, pruned_loss=0.07994, over 4266394.44 frames. ], batch size: 607, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:20:12,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1707912.0, ans=0.125 2023-06-24 09:20:21,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1707972.0, ans=0.05 2023-06-24 09:20:41,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1708032.0, ans=0.0 2023-06-24 09:21:02,566 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.56 vs. limit=22.5 2023-06-24 09:21:29,512 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.98 vs. limit=15.0 2023-06-24 09:21:33,127 INFO [train.py:996] (2/4) Epoch 10, batch 10250, loss[loss=0.2, simple_loss=0.2963, pruned_loss=0.05189, over 21451.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3036, pruned_loss=0.07523, over 4264467.24 frames. ], batch size: 471, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:22:04,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1708272.0, ans=0.2 2023-06-24 09:22:45,997 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.201e+02 5.691e+02 8.251e+02 1.364e+03 2.412e+03, threshold=1.650e+03, percent-clipped=9.0 2023-06-24 09:23:21,830 INFO [train.py:996] (2/4) Epoch 10, batch 10300, loss[loss=0.1826, simple_loss=0.2678, pruned_loss=0.04863, over 20767.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3048, pruned_loss=0.07583, over 4260693.08 frames. ], batch size: 607, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:23:34,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1708512.0, ans=0.2 2023-06-24 09:23:40,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1708572.0, ans=0.1 2023-06-24 09:24:12,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1708632.0, ans=0.125 2023-06-24 09:24:20,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1708692.0, ans=0.0 2023-06-24 09:24:38,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1708692.0, ans=0.1 2023-06-24 09:24:46,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1708752.0, ans=0.125 2023-06-24 09:24:48,366 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.78 vs. limit=22.5 2023-06-24 09:25:03,622 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.14 vs. limit=15.0 2023-06-24 09:25:04,351 INFO [train.py:996] (2/4) Epoch 10, batch 10350, loss[loss=0.1827, simple_loss=0.2449, pruned_loss=0.06025, over 21205.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3063, pruned_loss=0.07598, over 4255141.89 frames. ], batch size: 159, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:26:04,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1708992.0, ans=0.2 2023-06-24 09:26:21,107 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.286e+02 6.885e+02 1.073e+03 1.600e+03 3.112e+03, threshold=2.146e+03, percent-clipped=24.0 2023-06-24 09:26:30,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1709052.0, ans=0.125 2023-06-24 09:26:41,137 INFO [train.py:996] (2/4) Epoch 10, batch 10400, loss[loss=0.1626, simple_loss=0.2185, pruned_loss=0.0534, over 21264.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.303, pruned_loss=0.07515, over 4259429.17 frames. ], batch size: 159, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:27:03,070 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.93 vs. limit=15.0 2023-06-24 09:27:08,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1709172.0, ans=0.0 2023-06-24 09:27:40,282 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-24 09:28:17,467 INFO [train.py:996] (2/4) Epoch 10, batch 10450, loss[loss=0.2484, simple_loss=0.3403, pruned_loss=0.07829, over 20676.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3071, pruned_loss=0.07716, over 4265095.89 frames. ], batch size: 607, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:28:36,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1709412.0, ans=0.125 2023-06-24 09:29:15,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1709532.0, ans=0.0 2023-06-24 09:29:37,707 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.221e+02 7.690e+02 1.222e+03 1.916e+03 3.478e+03, threshold=2.445e+03, percent-clipped=16.0 2023-06-24 09:29:55,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1709712.0, ans=0.125 2023-06-24 09:29:56,543 INFO [train.py:996] (2/4) Epoch 10, batch 10500, loss[loss=0.2594, simple_loss=0.3177, pruned_loss=0.1005, over 21833.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3075, pruned_loss=0.07576, over 4255915.14 frames. ], batch size: 98, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:30:25,145 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.94 vs. limit=15.0 2023-06-24 09:30:47,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1709832.0, ans=0.2 2023-06-24 09:30:47,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1709832.0, ans=0.2 2023-06-24 09:30:58,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1709832.0, ans=0.0 2023-06-24 09:31:29,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1709952.0, ans=0.125 2023-06-24 09:31:34,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1710012.0, ans=0.125 2023-06-24 09:31:35,601 INFO [train.py:996] (2/4) Epoch 10, batch 10550, loss[loss=0.2233, simple_loss=0.2883, pruned_loss=0.07916, over 21618.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.3025, pruned_loss=0.07542, over 4256313.62 frames. ], batch size: 298, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:31:35,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1710012.0, ans=0.0 2023-06-24 09:31:45,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1710012.0, ans=0.125 2023-06-24 09:31:52,622 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=12.0 2023-06-24 09:32:47,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1710192.0, ans=0.0 2023-06-24 09:32:55,104 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.399e+02 7.494e+02 1.006e+03 1.488e+03 3.263e+03, threshold=2.013e+03, percent-clipped=2.0 2023-06-24 09:33:15,658 INFO [train.py:996] (2/4) Epoch 10, batch 10600, loss[loss=0.198, simple_loss=0.2835, pruned_loss=0.05627, over 21668.00 frames. ], tot_loss[loss=0.223, simple_loss=0.297, pruned_loss=0.07448, over 4258290.60 frames. ], batch size: 298, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:34:25,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1710492.0, ans=0.09899494936611666 2023-06-24 09:34:59,838 INFO [train.py:996] (2/4) Epoch 10, batch 10650, loss[loss=0.1827, simple_loss=0.2415, pruned_loss=0.06191, over 21928.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3029, pruned_loss=0.07436, over 4261284.85 frames. ], batch size: 98, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:35:57,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1710732.0, ans=0.125 2023-06-24 09:36:16,430 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.325e+02 7.408e+02 1.241e+03 1.885e+03 3.956e+03, threshold=2.481e+03, percent-clipped=17.0 2023-06-24 09:36:21,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1710852.0, ans=0.125 2023-06-24 09:36:41,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1710852.0, ans=0.0 2023-06-24 09:36:44,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1710912.0, ans=0.0 2023-06-24 09:36:45,444 INFO [train.py:996] (2/4) Epoch 10, batch 10700, loss[loss=0.2543, simple_loss=0.3257, pruned_loss=0.09149, over 21654.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.3013, pruned_loss=0.07385, over 4251402.82 frames. ], batch size: 351, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:36:45,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1710912.0, ans=0.0 2023-06-24 09:37:23,719 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=15.0 2023-06-24 09:38:33,049 INFO [train.py:996] (2/4) Epoch 10, batch 10750, loss[loss=0.2675, simple_loss=0.3609, pruned_loss=0.08707, over 21655.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3127, pruned_loss=0.07865, over 4260216.57 frames. ], batch size: 414, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:39:04,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1711272.0, ans=0.125 2023-06-24 09:39:22,683 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-24 09:39:28,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1711392.0, ans=0.125 2023-06-24 09:39:50,480 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.281e+02 7.385e+02 1.038e+03 1.565e+03 3.899e+03, threshold=2.076e+03, percent-clipped=10.0 2023-06-24 09:40:08,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1711452.0, ans=0.125 2023-06-24 09:40:20,300 INFO [train.py:996] (2/4) Epoch 10, batch 10800, loss[loss=0.2372, simple_loss=0.3177, pruned_loss=0.07838, over 21725.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3164, pruned_loss=0.07947, over 4256795.10 frames. ], batch size: 298, lr: 2.96e-03, grad_scale: 32.0 2023-06-24 09:40:40,183 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.62 vs. limit=10.0 2023-06-24 09:40:51,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1711632.0, ans=0.2 2023-06-24 09:41:21,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1711692.0, ans=0.125 2023-06-24 09:41:35,904 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-24 09:41:59,444 INFO [train.py:996] (2/4) Epoch 10, batch 10850, loss[loss=0.2442, simple_loss=0.3133, pruned_loss=0.08753, over 21584.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3152, pruned_loss=0.08001, over 4260455.69 frames. ], batch size: 441, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:42:09,588 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:42:56,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1711992.0, ans=0.125 2023-06-24 09:43:18,082 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.679e+02 6.562e+02 9.748e+02 1.395e+03 3.143e+03, threshold=1.950e+03, percent-clipped=4.0 2023-06-24 09:43:38,790 INFO [train.py:996] (2/4) Epoch 10, batch 10900, loss[loss=0.2133, simple_loss=0.3079, pruned_loss=0.05932, over 21736.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3086, pruned_loss=0.07769, over 4262401.38 frames. ], batch size: 351, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:43:43,091 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.07 vs. limit=12.0 2023-06-24 09:43:45,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1712112.0, ans=0.1 2023-06-24 09:44:08,186 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.762e-03 2023-06-24 09:45:18,155 INFO [train.py:996] (2/4) Epoch 10, batch 10950, loss[loss=0.2123, simple_loss=0.2763, pruned_loss=0.07414, over 21496.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3042, pruned_loss=0.07579, over 4264724.20 frames. ], batch size: 132, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:45:22,261 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 09:45:42,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1712472.0, ans=0.125 2023-06-24 09:45:54,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1712532.0, ans=0.0 2023-06-24 09:45:56,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1712532.0, ans=0.125 2023-06-24 09:46:04,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1712532.0, ans=0.1 2023-06-24 09:46:35,542 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.061e+02 6.641e+02 1.100e+03 1.576e+03 3.666e+03, threshold=2.199e+03, percent-clipped=18.0 2023-06-24 09:46:56,691 INFO [train.py:996] (2/4) Epoch 10, batch 11000, loss[loss=0.26, simple_loss=0.3254, pruned_loss=0.09732, over 21936.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3031, pruned_loss=0.07721, over 4272858.19 frames. ], batch size: 415, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:47:58,000 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=22.5 2023-06-24 09:47:59,797 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=12.0 2023-06-24 09:48:04,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1712892.0, ans=0.1 2023-06-24 09:48:31,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1712952.0, ans=0.0 2023-06-24 09:48:35,809 INFO [train.py:996] (2/4) Epoch 10, batch 11050, loss[loss=0.18, simple_loss=0.231, pruned_loss=0.06449, over 20075.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3011, pruned_loss=0.07828, over 4275766.80 frames. ], batch size: 704, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:48:37,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1713012.0, ans=0.2 2023-06-24 09:48:58,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1713072.0, ans=0.125 2023-06-24 09:49:00,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1713072.0, ans=0.0 2023-06-24 09:49:52,344 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.455e+02 6.589e+02 8.856e+02 1.146e+03 2.849e+03, threshold=1.771e+03, percent-clipped=5.0 2023-06-24 09:49:53,550 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-24 09:50:13,202 INFO [train.py:996] (2/4) Epoch 10, batch 11100, loss[loss=0.1875, simple_loss=0.2684, pruned_loss=0.05328, over 21500.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3, pruned_loss=0.07835, over 4274496.20 frames. ], batch size: 230, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:50:49,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1713432.0, ans=0.125 2023-06-24 09:51:17,771 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=15.0 2023-06-24 09:51:31,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1713552.0, ans=0.0 2023-06-24 09:51:46,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1713552.0, ans=0.1 2023-06-24 09:51:54,340 INFO [train.py:996] (2/4) Epoch 10, batch 11150, loss[loss=0.3053, simple_loss=0.3675, pruned_loss=0.1215, over 21405.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2977, pruned_loss=0.07789, over 4266387.69 frames. ], batch size: 507, lr: 2.96e-03, grad_scale: 8.0 2023-06-24 09:53:11,658 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.257e+02 6.370e+02 9.682e+02 1.600e+03 2.878e+03, threshold=1.936e+03, percent-clipped=17.0 2023-06-24 09:53:24,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1713852.0, ans=0.125 2023-06-24 09:53:29,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1713852.0, ans=0.0 2023-06-24 09:53:33,206 INFO [train.py:996] (2/4) Epoch 10, batch 11200, loss[loss=0.1988, simple_loss=0.2641, pruned_loss=0.06679, over 21404.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2963, pruned_loss=0.0767, over 4262965.74 frames. ], batch size: 194, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:53:34,439 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.35 vs. limit=5.0 2023-06-24 09:53:39,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1713912.0, ans=0.0 2023-06-24 09:54:00,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1713972.0, ans=0.0 2023-06-24 09:54:08,679 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.80 vs. limit=10.0 2023-06-24 09:54:45,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1714092.0, ans=0.0 2023-06-24 09:54:55,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1714152.0, ans=0.125 2023-06-24 09:55:12,169 INFO [train.py:996] (2/4) Epoch 10, batch 11250, loss[loss=0.217, simple_loss=0.2979, pruned_loss=0.06807, over 21825.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2956, pruned_loss=0.07689, over 4257795.10 frames. ], batch size: 316, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:55:26,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1714272.0, ans=0.0 2023-06-24 09:55:40,332 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.24 vs. limit=12.0 2023-06-24 09:55:41,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1714272.0, ans=0.07 2023-06-24 09:56:00,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1714332.0, ans=0.0 2023-06-24 09:56:12,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1714392.0, ans=0.1 2023-06-24 09:56:26,626 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.287e+02 6.030e+02 7.779e+02 1.066e+03 3.071e+03, threshold=1.556e+03, percent-clipped=6.0 2023-06-24 09:56:38,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1714452.0, ans=0.125 2023-06-24 09:56:47,850 INFO [train.py:996] (2/4) Epoch 10, batch 11300, loss[loss=0.2268, simple_loss=0.3244, pruned_loss=0.06459, over 19969.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.2964, pruned_loss=0.07594, over 4265808.20 frames. ], batch size: 702, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:57:10,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1714572.0, ans=0.125 2023-06-24 09:57:28,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1714632.0, ans=0.1 2023-06-24 09:58:28,620 INFO [train.py:996] (2/4) Epoch 10, batch 11350, loss[loss=0.2454, simple_loss=0.3327, pruned_loss=0.07902, over 21284.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2972, pruned_loss=0.0753, over 4269309.42 frames. ], batch size: 548, lr: 2.96e-03, grad_scale: 16.0 2023-06-24 09:58:32,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1714812.0, ans=0.1 2023-06-24 09:58:32,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1714812.0, ans=0.125 2023-06-24 09:58:35,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1714812.0, ans=0.2 2023-06-24 09:58:47,859 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.77 vs. limit=15.0 2023-06-24 09:59:04,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1714932.0, ans=0.1 2023-06-24 09:59:49,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1714992.0, ans=0.04949747468305833 2023-06-24 09:59:51,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1714992.0, ans=0.1 2023-06-24 09:59:53,964 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.336e+02 5.946e+02 8.078e+02 1.222e+03 2.329e+03, threshold=1.616e+03, percent-clipped=17.0 2023-06-24 10:00:10,825 INFO [train.py:996] (2/4) Epoch 10, batch 11400, loss[loss=0.2401, simple_loss=0.3279, pruned_loss=0.07616, over 21693.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3032, pruned_loss=0.0782, over 4276350.04 frames. ], batch size: 332, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:00:20,920 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=2.94 vs. limit=6.0 2023-06-24 10:00:33,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1715172.0, ans=0.0 2023-06-24 10:00:38,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1715172.0, ans=0.1 2023-06-24 10:01:09,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1715232.0, ans=0.0 2023-06-24 10:01:11,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1715232.0, ans=0.125 2023-06-24 10:01:26,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1715292.0, ans=0.125 2023-06-24 10:01:49,689 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=15.0 2023-06-24 10:01:51,720 INFO [train.py:996] (2/4) Epoch 10, batch 11450, loss[loss=0.2373, simple_loss=0.3095, pruned_loss=0.08253, over 21785.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3058, pruned_loss=0.07744, over 4275416.27 frames. ], batch size: 247, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:02:11,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1715472.0, ans=0.1 2023-06-24 10:02:29,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1715472.0, ans=0.0 2023-06-24 10:02:46,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1715532.0, ans=0.2 2023-06-24 10:02:46,720 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-24 10:02:59,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1715592.0, ans=0.0 2023-06-24 10:03:16,702 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.172e+02 6.928e+02 8.683e+02 1.191e+03 2.555e+03, threshold=1.737e+03, percent-clipped=7.0 2023-06-24 10:03:33,679 INFO [train.py:996] (2/4) Epoch 10, batch 11500, loss[loss=0.1928, simple_loss=0.2876, pruned_loss=0.04899, over 21720.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3089, pruned_loss=0.0788, over 4277755.55 frames. ], batch size: 247, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:03:46,635 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.82 vs. limit=15.0 2023-06-24 10:05:05,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1715952.0, ans=0.125 2023-06-24 10:05:16,079 INFO [train.py:996] (2/4) Epoch 10, batch 11550, loss[loss=0.2455, simple_loss=0.3508, pruned_loss=0.07011, over 21781.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3131, pruned_loss=0.07857, over 4271731.15 frames. ], batch size: 332, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:05:48,429 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-06-24 10:06:15,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1716132.0, ans=0.125 2023-06-24 10:06:36,596 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.588e+02 7.008e+02 1.217e+03 2.057e+03 3.971e+03, threshold=2.435e+03, percent-clipped=35.0 2023-06-24 10:07:01,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1716312.0, ans=0.025 2023-06-24 10:07:02,453 INFO [train.py:996] (2/4) Epoch 10, batch 11600, loss[loss=0.2502, simple_loss=0.3481, pruned_loss=0.0762, over 21711.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3278, pruned_loss=0.08108, over 4274315.11 frames. ], batch size: 247, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:07:02,832 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:07:12,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1716312.0, ans=0.5 2023-06-24 10:07:35,006 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:07:39,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1716432.0, ans=0.125 2023-06-24 10:07:45,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1716432.0, ans=0.1 2023-06-24 10:07:52,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1716432.0, ans=0.2 2023-06-24 10:08:33,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1716552.0, ans=0.0 2023-06-24 10:08:37,727 INFO [train.py:996] (2/4) Epoch 10, batch 11650, loss[loss=0.2697, simple_loss=0.3409, pruned_loss=0.09927, over 21516.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3347, pruned_loss=0.08224, over 4269077.83 frames. ], batch size: 441, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:08:58,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1716672.0, ans=0.125 2023-06-24 10:09:27,916 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:09:46,340 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.73 vs. limit=10.0 2023-06-24 10:09:50,608 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:09:51,848 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.503e+02 6.606e+02 1.019e+03 1.524e+03 3.241e+03, threshold=2.038e+03, percent-clipped=8.0 2023-06-24 10:10:16,854 INFO [train.py:996] (2/4) Epoch 10, batch 11700, loss[loss=0.1955, simple_loss=0.2621, pruned_loss=0.06445, over 21483.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3264, pruned_loss=0.08161, over 4276492.48 frames. ], batch size: 195, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:10:33,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1716912.0, ans=0.125 2023-06-24 10:10:44,368 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1716972.0, ans=0.125 2023-06-24 10:11:16,731 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.62 vs. limit=15.0 2023-06-24 10:11:55,171 INFO [train.py:996] (2/4) Epoch 10, batch 11750, loss[loss=0.2798, simple_loss=0.3452, pruned_loss=0.1072, over 21452.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3176, pruned_loss=0.08166, over 4278523.61 frames. ], batch size: 131, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:12:19,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1717272.0, ans=0.125 2023-06-24 10:13:04,682 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.43 vs. limit=22.5 2023-06-24 10:13:15,595 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.135e+02 6.168e+02 8.880e+02 1.254e+03 3.045e+03, threshold=1.776e+03, percent-clipped=3.0 2023-06-24 10:13:27,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1717452.0, ans=0.1 2023-06-24 10:13:34,939 INFO [train.py:996] (2/4) Epoch 10, batch 11800, loss[loss=0.2391, simple_loss=0.3177, pruned_loss=0.08029, over 21538.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3176, pruned_loss=0.08255, over 4280088.46 frames. ], batch size: 131, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:13:39,292 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.54 vs. limit=10.0 2023-06-24 10:15:18,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1717812.0, ans=0.125 2023-06-24 10:15:19,729 INFO [train.py:996] (2/4) Epoch 10, batch 11850, loss[loss=0.259, simple_loss=0.3469, pruned_loss=0.08555, over 21561.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3191, pruned_loss=0.08181, over 4280958.83 frames. ], batch size: 471, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:15:33,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1717812.0, ans=0.2 2023-06-24 10:15:47,261 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-24 10:16:09,314 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:16:41,801 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.280e+02 6.048e+02 1.109e+03 1.865e+03 3.176e+03, threshold=2.218e+03, percent-clipped=24.0 2023-06-24 10:17:01,675 INFO [train.py:996] (2/4) Epoch 10, batch 11900, loss[loss=0.1867, simple_loss=0.2597, pruned_loss=0.05682, over 21738.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3214, pruned_loss=0.07985, over 4273099.91 frames. ], batch size: 124, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:17:47,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1718232.0, ans=0.125 2023-06-24 10:18:13,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1718292.0, ans=0.2 2023-06-24 10:18:42,197 INFO [train.py:996] (2/4) Epoch 10, batch 11950, loss[loss=0.1978, simple_loss=0.2916, pruned_loss=0.05199, over 21665.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3206, pruned_loss=0.07636, over 4271707.31 frames. ], batch size: 298, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:18:48,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1718412.0, ans=0.125 2023-06-24 10:20:07,038 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.251e+02 6.812e+02 1.209e+03 1.807e+03 3.979e+03, threshold=2.418e+03, percent-clipped=18.0 2023-06-24 10:20:21,812 INFO [train.py:996] (2/4) Epoch 10, batch 12000, loss[loss=0.2321, simple_loss=0.2854, pruned_loss=0.08946, over 21796.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3138, pruned_loss=0.07454, over 4251941.26 frames. ], batch size: 124, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:20:21,813 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 10:20:37,788 INFO [train.py:1028] (2/4) Epoch 10, validation: loss=0.2579, simple_loss=0.3537, pruned_loss=0.08105, over 1796401.00 frames. 2023-06-24 10:20:37,789 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-24 10:20:42,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1718712.0, ans=0.025 2023-06-24 10:20:44,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1718712.0, ans=0.2 2023-06-24 10:21:13,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1718772.0, ans=0.035 2023-06-24 10:21:30,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1718832.0, ans=0.0 2023-06-24 10:21:48,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1718892.0, ans=0.125 2023-06-24 10:22:11,559 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-24 10:22:17,049 INFO [train.py:996] (2/4) Epoch 10, batch 12050, loss[loss=0.2351, simple_loss=0.2946, pruned_loss=0.08778, over 21813.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3111, pruned_loss=0.07727, over 4254282.63 frames. ], batch size: 247, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:22:17,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1719012.0, ans=0.125 2023-06-24 10:22:31,432 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.05 vs. limit=15.0 2023-06-24 10:23:39,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1719192.0, ans=0.125 2023-06-24 10:23:44,268 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.653e+02 7.636e+02 1.060e+03 1.403e+03 2.653e+03, threshold=2.120e+03, percent-clipped=2.0 2023-06-24 10:24:03,744 INFO [train.py:996] (2/4) Epoch 10, batch 12100, loss[loss=0.1923, simple_loss=0.2838, pruned_loss=0.05043, over 19898.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3156, pruned_loss=0.08113, over 4266554.04 frames. ], batch size: 702, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:25:53,227 INFO [train.py:996] (2/4) Epoch 10, batch 12150, loss[loss=0.2147, simple_loss=0.2898, pruned_loss=0.06984, over 21219.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3186, pruned_loss=0.08086, over 4274275.22 frames. ], batch size: 176, lr: 2.95e-03, grad_scale: 32.0 2023-06-24 10:25:53,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1719612.0, ans=0.125 2023-06-24 10:25:55,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1719612.0, ans=0.025 2023-06-24 10:26:36,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1719732.0, ans=0.2 2023-06-24 10:27:22,051 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.407e+02 7.316e+02 1.017e+03 1.333e+03 3.987e+03, threshold=2.033e+03, percent-clipped=9.0 2023-06-24 10:27:34,126 INFO [train.py:996] (2/4) Epoch 10, batch 12200, loss[loss=0.253, simple_loss=0.382, pruned_loss=0.06194, over 19687.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3168, pruned_loss=0.08007, over 4274833.00 frames. ], batch size: 702, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:28:51,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1720152.0, ans=0.025 2023-06-24 10:28:53,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1720152.0, ans=0.125 2023-06-24 10:29:06,480 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.23 vs. limit=22.5 2023-06-24 10:29:07,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1720152.0, ans=0.1 2023-06-24 10:29:13,849 INFO [train.py:996] (2/4) Epoch 10, batch 12250, loss[loss=0.2661, simple_loss=0.3925, pruned_loss=0.06983, over 19761.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3093, pruned_loss=0.0768, over 4279163.04 frames. ], batch size: 702, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:29:38,490 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=15.0 2023-06-24 10:29:47,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1720272.0, ans=0.0 2023-06-24 10:30:37,180 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.979e+02 6.057e+02 9.242e+02 1.403e+03 3.346e+03, threshold=1.848e+03, percent-clipped=10.0 2023-06-24 10:30:39,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1720452.0, ans=0.125 2023-06-24 10:30:52,740 INFO [train.py:996] (2/4) Epoch 10, batch 12300, loss[loss=0.1691, simple_loss=0.2522, pruned_loss=0.04298, over 21441.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3017, pruned_loss=0.07145, over 4283901.05 frames. ], batch size: 194, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:31:03,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1720512.0, ans=0.125 2023-06-24 10:31:04,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1720512.0, ans=0.0 2023-06-24 10:32:38,235 INFO [train.py:996] (2/4) Epoch 10, batch 12350, loss[loss=0.253, simple_loss=0.3173, pruned_loss=0.09429, over 19988.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3056, pruned_loss=0.07285, over 4278265.92 frames. ], batch size: 703, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:33:22,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1720932.0, ans=0.125 2023-06-24 10:33:22,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1720932.0, ans=0.0 2023-06-24 10:33:27,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1720932.0, ans=0.125 2023-06-24 10:33:56,480 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.324e+02 6.849e+02 9.475e+02 1.484e+03 4.503e+03, threshold=1.895e+03, percent-clipped=12.0 2023-06-24 10:34:17,405 INFO [train.py:996] (2/4) Epoch 10, batch 12400, loss[loss=0.2452, simple_loss=0.3078, pruned_loss=0.09129, over 21842.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3079, pruned_loss=0.07493, over 4278222.98 frames. ], batch size: 298, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:34:41,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1721172.0, ans=0.5 2023-06-24 10:34:46,775 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=12.0 2023-06-24 10:34:59,887 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.54 vs. limit=6.0 2023-06-24 10:35:00,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1721232.0, ans=0.125 2023-06-24 10:35:05,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1721232.0, ans=0.1 2023-06-24 10:35:56,473 INFO [train.py:996] (2/4) Epoch 10, batch 12450, loss[loss=0.3092, simple_loss=0.3784, pruned_loss=0.12, over 21750.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3113, pruned_loss=0.07811, over 4275142.59 frames. ], batch size: 118, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:36:14,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1721412.0, ans=0.0 2023-06-24 10:36:32,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1721532.0, ans=0.1 2023-06-24 10:36:34,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1721532.0, ans=0.1 2023-06-24 10:37:26,122 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.427e+02 8.062e+02 1.038e+03 1.545e+03 2.621e+03, threshold=2.076e+03, percent-clipped=10.0 2023-06-24 10:37:41,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1721712.0, ans=0.125 2023-06-24 10:37:42,439 INFO [train.py:996] (2/4) Epoch 10, batch 12500, loss[loss=0.2673, simple_loss=0.3663, pruned_loss=0.08419, over 21822.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3215, pruned_loss=0.08205, over 4278978.29 frames. ], batch size: 282, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:38:26,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1721832.0, ans=0.0 2023-06-24 10:38:37,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1721832.0, ans=0.125 2023-06-24 10:38:39,364 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.69 vs. limit=10.0 2023-06-24 10:38:50,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1721892.0, ans=0.07 2023-06-24 10:39:19,503 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.83 vs. limit=22.5 2023-06-24 10:39:23,792 INFO [train.py:996] (2/4) Epoch 10, batch 12550, loss[loss=0.2265, simple_loss=0.3056, pruned_loss=0.07371, over 21785.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3271, pruned_loss=0.08348, over 4283377.66 frames. ], batch size: 247, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:39:56,933 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.45 vs. limit=15.0 2023-06-24 10:40:13,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1722132.0, ans=0.125 2023-06-24 10:40:48,771 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.945e+02 6.642e+02 8.862e+02 1.444e+03 2.963e+03, threshold=1.772e+03, percent-clipped=6.0 2023-06-24 10:40:49,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1722252.0, ans=0.05 2023-06-24 10:40:55,425 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:40:58,109 INFO [train.py:996] (2/4) Epoch 10, batch 12600, loss[loss=0.2145, simple_loss=0.3004, pruned_loss=0.06435, over 20805.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.324, pruned_loss=0.0819, over 4273445.51 frames. ], batch size: 608, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:41:17,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1722372.0, ans=0.125 2023-06-24 10:41:24,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1722372.0, ans=0.125 2023-06-24 10:41:47,710 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-24 10:41:58,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1722492.0, ans=0.0 2023-06-24 10:42:22,381 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1722552.0, ans=0.0 2023-06-24 10:42:32,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1722552.0, ans=0.125 2023-06-24 10:42:36,259 INFO [train.py:996] (2/4) Epoch 10, batch 12650, loss[loss=0.2057, simple_loss=0.2776, pruned_loss=0.06685, over 21821.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3176, pruned_loss=0.07856, over 4277248.68 frames. ], batch size: 298, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:42:43,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1722612.0, ans=0.0 2023-06-24 10:43:23,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1722732.0, ans=0.125 2023-06-24 10:43:58,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1722852.0, ans=0.1 2023-06-24 10:44:06,394 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.090e+02 7.127e+02 1.026e+03 1.420e+03 2.601e+03, threshold=2.052e+03, percent-clipped=16.0 2023-06-24 10:44:16,309 INFO [train.py:996] (2/4) Epoch 10, batch 12700, loss[loss=0.2397, simple_loss=0.3124, pruned_loss=0.08344, over 21506.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3173, pruned_loss=0.08051, over 4279058.21 frames. ], batch size: 211, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:44:45,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1722972.0, ans=0.0 2023-06-24 10:45:23,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1723092.0, ans=0.07 2023-06-24 10:45:39,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1723152.0, ans=10.0 2023-06-24 10:45:51,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1723152.0, ans=0.1 2023-06-24 10:45:52,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1723152.0, ans=0.125 2023-06-24 10:45:54,644 INFO [train.py:996] (2/4) Epoch 10, batch 12750, loss[loss=0.2338, simple_loss=0.3143, pruned_loss=0.07661, over 21802.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3181, pruned_loss=0.08097, over 4280194.74 frames. ], batch size: 282, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:45:58,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1723212.0, ans=10.0 2023-06-24 10:46:35,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1723332.0, ans=0.125 2023-06-24 10:47:18,533 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.582e+02 6.526e+02 8.799e+02 1.342e+03 3.585e+03, threshold=1.760e+03, percent-clipped=6.0 2023-06-24 10:47:33,305 INFO [train.py:996] (2/4) Epoch 10, batch 12800, loss[loss=0.266, simple_loss=0.363, pruned_loss=0.08447, over 19889.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.318, pruned_loss=0.08169, over 4279827.09 frames. ], batch size: 703, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:48:07,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1723572.0, ans=0.125 2023-06-24 10:48:46,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1723692.0, ans=0.125 2023-06-24 10:49:03,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1723752.0, ans=0.125 2023-06-24 10:49:18,842 INFO [train.py:996] (2/4) Epoch 10, batch 12850, loss[loss=0.2413, simple_loss=0.3139, pruned_loss=0.08438, over 20141.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3194, pruned_loss=0.08303, over 4280420.78 frames. ], batch size: 704, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:49:21,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1723812.0, ans=0.125 2023-06-24 10:49:21,527 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=7.63 vs. limit=12.0 2023-06-24 10:49:37,994 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.77 vs. limit=15.0 2023-06-24 10:50:48,488 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.383e+02 5.816e+02 7.846e+02 1.216e+03 2.443e+03, threshold=1.569e+03, percent-clipped=11.0 2023-06-24 10:51:02,580 INFO [train.py:996] (2/4) Epoch 10, batch 12900, loss[loss=0.2129, simple_loss=0.2831, pruned_loss=0.07133, over 21266.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3175, pruned_loss=0.08006, over 4276765.21 frames. ], batch size: 176, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 10:52:42,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1724412.0, ans=0.2 2023-06-24 10:52:43,481 INFO [train.py:996] (2/4) Epoch 10, batch 12950, loss[loss=0.2388, simple_loss=0.3183, pruned_loss=0.07961, over 21718.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3161, pruned_loss=0.07858, over 4274413.63 frames. ], batch size: 332, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:53:01,521 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.63 vs. limit=15.0 2023-06-24 10:53:01,608 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.76 vs. limit=22.5 2023-06-24 10:53:24,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1724532.0, ans=0.125 2023-06-24 10:53:34,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1724532.0, ans=0.025 2023-06-24 10:54:02,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1724592.0, ans=0.2 2023-06-24 10:54:02,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1724592.0, ans=0.0 2023-06-24 10:54:15,808 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.837e+02 8.466e+02 1.346e+03 1.826e+03 3.659e+03, threshold=2.691e+03, percent-clipped=37.0 2023-06-24 10:54:23,654 INFO [train.py:996] (2/4) Epoch 10, batch 13000, loss[loss=0.1937, simple_loss=0.2872, pruned_loss=0.05014, over 21619.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3148, pruned_loss=0.0786, over 4270542.53 frames. ], batch size: 389, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:54:29,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1724712.0, ans=0.07 2023-06-24 10:54:31,519 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:54:47,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1724772.0, ans=0.125 2023-06-24 10:54:48,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1724772.0, ans=0.0 2023-06-24 10:55:25,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1724892.0, ans=0.125 2023-06-24 10:55:28,389 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.31 vs. limit=15.0 2023-06-24 10:55:30,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1724892.0, ans=0.125 2023-06-24 10:56:01,963 INFO [train.py:996] (2/4) Epoch 10, batch 13050, loss[loss=0.2128, simple_loss=0.2822, pruned_loss=0.07169, over 21807.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3106, pruned_loss=0.07646, over 4269829.96 frames. ], batch size: 247, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:56:11,298 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.49 vs. limit=15.0 2023-06-24 10:56:21,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1725072.0, ans=0.0 2023-06-24 10:56:36,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1725072.0, ans=0.125 2023-06-24 10:57:32,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1725252.0, ans=0.125 2023-06-24 10:57:33,637 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.137e+02 5.824e+02 8.081e+02 1.133e+03 2.445e+03, threshold=1.616e+03, percent-clipped=0.0 2023-06-24 10:57:41,719 INFO [train.py:996] (2/4) Epoch 10, batch 13100, loss[loss=0.2388, simple_loss=0.3137, pruned_loss=0.08194, over 21328.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3114, pruned_loss=0.07636, over 4281455.47 frames. ], batch size: 176, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:58:08,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1725372.0, ans=0.0 2023-06-24 10:58:51,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1725492.0, ans=0.035 2023-06-24 10:59:03,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1725492.0, ans=0.125 2023-06-24 10:59:04,758 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 10:59:26,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1725612.0, ans=0.2 2023-06-24 10:59:27,989 INFO [train.py:996] (2/4) Epoch 10, batch 13150, loss[loss=0.2629, simple_loss=0.3225, pruned_loss=0.1016, over 21848.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3146, pruned_loss=0.07907, over 4282333.26 frames. ], batch size: 124, lr: 2.95e-03, grad_scale: 8.0 2023-06-24 10:59:28,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1725612.0, ans=0.125 2023-06-24 10:59:30,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1725612.0, ans=0.125 2023-06-24 10:59:33,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1725612.0, ans=0.125 2023-06-24 10:59:35,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1725612.0, ans=0.0 2023-06-24 10:59:49,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1725672.0, ans=0.125 2023-06-24 10:59:55,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1725672.0, ans=0.0 2023-06-24 11:00:30,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1725792.0, ans=0.2 2023-06-24 11:00:35,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1725792.0, ans=0.0 2023-06-24 11:00:36,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1725792.0, ans=0.125 2023-06-24 11:00:50,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1725852.0, ans=0.1 2023-06-24 11:00:56,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1725852.0, ans=0.125 2023-06-24 11:00:59,518 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.653e+02 8.375e+02 1.328e+03 1.823e+03 3.736e+03, threshold=2.655e+03, percent-clipped=31.0 2023-06-24 11:01:06,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1725912.0, ans=0.2 2023-06-24 11:01:07,596 INFO [train.py:996] (2/4) Epoch 10, batch 13200, loss[loss=0.2603, simple_loss=0.3304, pruned_loss=0.09516, over 21578.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3131, pruned_loss=0.07878, over 4278879.75 frames. ], batch size: 389, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 11:01:54,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1726032.0, ans=0.0 2023-06-24 11:02:52,527 INFO [train.py:996] (2/4) Epoch 10, batch 13250, loss[loss=0.2728, simple_loss=0.3382, pruned_loss=0.1037, over 21697.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3133, pruned_loss=0.0809, over 4286915.74 frames. ], batch size: 441, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 11:03:09,504 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=15.0 2023-06-24 11:03:26,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1726272.0, ans=0.125 2023-06-24 11:03:57,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1726392.0, ans=0.125 2023-06-24 11:04:01,002 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.01 vs. limit=15.0 2023-06-24 11:04:20,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1726452.0, ans=0.5 2023-06-24 11:04:24,290 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.098e+02 9.382e+02 1.293e+03 1.907e+03 4.949e+03, threshold=2.585e+03, percent-clipped=10.0 2023-06-24 11:04:32,041 INFO [train.py:996] (2/4) Epoch 10, batch 13300, loss[loss=0.2453, simple_loss=0.3332, pruned_loss=0.07871, over 21798.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3169, pruned_loss=0.08154, over 4286498.69 frames. ], batch size: 282, lr: 2.95e-03, grad_scale: 16.0 2023-06-24 11:05:00,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1726572.0, ans=10.0 2023-06-24 11:05:39,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1726692.0, ans=0.125 2023-06-24 11:05:46,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1726692.0, ans=0.1 2023-06-24 11:06:03,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1726752.0, ans=0.125 2023-06-24 11:06:14,231 INFO [train.py:996] (2/4) Epoch 10, batch 13350, loss[loss=0.2668, simple_loss=0.3431, pruned_loss=0.09524, over 21790.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3211, pruned_loss=0.08334, over 4278693.47 frames. ], batch size: 124, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:06:25,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1726812.0, ans=0.125 2023-06-24 11:06:41,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1726872.0, ans=0.125 2023-06-24 11:06:51,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1726932.0, ans=0.2 2023-06-24 11:06:58,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1726932.0, ans=0.1 2023-06-24 11:07:02,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1726932.0, ans=0.125 2023-06-24 11:07:39,412 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.530e+02 6.336e+02 8.525e+02 1.254e+03 2.418e+03, threshold=1.705e+03, percent-clipped=0.0 2023-06-24 11:07:49,737 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:07:50,818 INFO [train.py:996] (2/4) Epoch 10, batch 13400, loss[loss=0.2183, simple_loss=0.2939, pruned_loss=0.07131, over 21913.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3226, pruned_loss=0.08413, over 4268233.71 frames. ], batch size: 316, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:08:12,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1727172.0, ans=0.0 2023-06-24 11:08:28,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1727232.0, ans=0.0 2023-06-24 11:09:17,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1727352.0, ans=0.125 2023-06-24 11:09:27,801 INFO [train.py:996] (2/4) Epoch 10, batch 13450, loss[loss=0.3065, simple_loss=0.364, pruned_loss=0.1246, over 21456.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3244, pruned_loss=0.0858, over 4266258.89 frames. ], batch size: 471, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:09:34,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1727412.0, ans=0.125 2023-06-24 11:09:37,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1727412.0, ans=0.125 2023-06-24 11:09:42,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1727472.0, ans=0.125 2023-06-24 11:09:49,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1727472.0, ans=0.0 2023-06-24 11:10:25,846 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=22.5 2023-06-24 11:10:59,926 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.972e+02 7.908e+02 1.192e+03 1.823e+03 3.915e+03, threshold=2.384e+03, percent-clipped=24.0 2023-06-24 11:11:00,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1727652.0, ans=0.1 2023-06-24 11:11:00,398 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:11:01,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1727652.0, ans=0.125 2023-06-24 11:11:06,201 INFO [train.py:996] (2/4) Epoch 10, batch 13500, loss[loss=0.1715, simple_loss=0.218, pruned_loss=0.06247, over 21830.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3146, pruned_loss=0.08205, over 4257143.62 frames. ], batch size: 102, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:11:24,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1727712.0, ans=0.0 2023-06-24 11:11:38,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1727772.0, ans=0.2 2023-06-24 11:11:54,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1727832.0, ans=0.0 2023-06-24 11:12:18,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1727892.0, ans=0.0 2023-06-24 11:12:31,041 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=15.0 2023-06-24 11:12:50,289 INFO [train.py:996] (2/4) Epoch 10, batch 13550, loss[loss=0.3898, simple_loss=0.4523, pruned_loss=0.1637, over 21503.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3183, pruned_loss=0.08244, over 4264011.07 frames. ], batch size: 507, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:13:14,983 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:14:14,891 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.411e+02 7.999e+02 1.250e+03 1.835e+03 3.854e+03, threshold=2.499e+03, percent-clipped=11.0 2023-06-24 11:14:21,338 INFO [train.py:996] (2/4) Epoch 10, batch 13600, loss[loss=0.2297, simple_loss=0.3104, pruned_loss=0.07449, over 21877.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.318, pruned_loss=0.08276, over 4271407.54 frames. ], batch size: 371, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:14:31,678 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.98 vs. limit=10.0 2023-06-24 11:16:02,445 INFO [train.py:996] (2/4) Epoch 10, batch 13650, loss[loss=0.2033, simple_loss=0.2683, pruned_loss=0.06915, over 21483.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3129, pruned_loss=0.07998, over 4276046.43 frames. ], batch size: 389, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:16:18,618 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.01 vs. limit=15.0 2023-06-24 11:16:34,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1728672.0, ans=0.0 2023-06-24 11:16:46,316 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.15 vs. limit=15.0 2023-06-24 11:16:55,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1728732.0, ans=0.2 2023-06-24 11:17:10,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1728792.0, ans=0.0 2023-06-24 11:17:20,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1728852.0, ans=0.2 2023-06-24 11:17:29,427 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.339e+02 6.847e+02 1.012e+03 1.771e+03 3.769e+03, threshold=2.024e+03, percent-clipped=10.0 2023-06-24 11:17:38,779 INFO [train.py:996] (2/4) Epoch 10, batch 13700, loss[loss=0.2806, simple_loss=0.3526, pruned_loss=0.1043, over 21630.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.309, pruned_loss=0.08008, over 4263566.69 frames. ], batch size: 441, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:18:11,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1728972.0, ans=0.1 2023-06-24 11:19:16,499 INFO [train.py:996] (2/4) Epoch 10, batch 13750, loss[loss=0.2091, simple_loss=0.2852, pruned_loss=0.06645, over 21584.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3046, pruned_loss=0.07871, over 4254482.73 frames. ], batch size: 230, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:19:57,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1729272.0, ans=0.2 2023-06-24 11:20:56,600 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.866e+02 6.962e+02 1.196e+03 1.876e+03 4.514e+03, threshold=2.392e+03, percent-clipped=21.0 2023-06-24 11:21:05,657 INFO [train.py:996] (2/4) Epoch 10, batch 13800, loss[loss=0.2288, simple_loss=0.3293, pruned_loss=0.06411, over 21579.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3096, pruned_loss=0.07688, over 4255674.50 frames. ], batch size: 230, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:21:13,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1729512.0, ans=0.0 2023-06-24 11:21:14,515 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.17 vs. limit=15.0 2023-06-24 11:21:28,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1729572.0, ans=0.0 2023-06-24 11:21:37,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.53 vs. limit=10.0 2023-06-24 11:22:06,226 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-24 11:22:21,844 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.48 vs. limit=6.0 2023-06-24 11:22:49,297 INFO [train.py:996] (2/4) Epoch 10, batch 13850, loss[loss=0.25, simple_loss=0.3259, pruned_loss=0.08706, over 21628.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3178, pruned_loss=0.07868, over 4256742.90 frames. ], batch size: 263, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:22:58,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1729812.0, ans=0.125 2023-06-24 11:23:08,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1729872.0, ans=0.1 2023-06-24 11:23:47,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1729992.0, ans=0.125 2023-06-24 11:23:47,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1729992.0, ans=0.1 2023-06-24 11:24:05,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1730052.0, ans=0.0 2023-06-24 11:24:21,350 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.592e+02 7.694e+02 1.020e+03 1.467e+03 3.637e+03, threshold=2.040e+03, percent-clipped=4.0 2023-06-24 11:24:25,889 INFO [train.py:996] (2/4) Epoch 10, batch 13900, loss[loss=0.2248, simple_loss=0.2981, pruned_loss=0.07571, over 21633.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3208, pruned_loss=0.08176, over 4263457.56 frames. ], batch size: 230, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:24:34,355 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=13.93 vs. limit=15.0 2023-06-24 11:24:43,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1730172.0, ans=0.1 2023-06-24 11:25:09,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1730232.0, ans=0.0 2023-06-24 11:26:02,520 INFO [train.py:996] (2/4) Epoch 10, batch 13950, loss[loss=0.287, simple_loss=0.4055, pruned_loss=0.08423, over 19902.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3211, pruned_loss=0.08366, over 4271740.86 frames. ], batch size: 702, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:26:14,291 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.23 vs. limit=15.0 2023-06-24 11:26:51,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1730532.0, ans=0.1 2023-06-24 11:27:16,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1730652.0, ans=0.0 2023-06-24 11:27:24,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1730652.0, ans=0.125 2023-06-24 11:27:33,031 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.002e+02 6.836e+02 9.588e+02 1.525e+03 4.378e+03, threshold=1.918e+03, percent-clipped=10.0 2023-06-24 11:27:36,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1730712.0, ans=0.0 2023-06-24 11:27:37,557 INFO [train.py:996] (2/4) Epoch 10, batch 14000, loss[loss=0.2399, simple_loss=0.3336, pruned_loss=0.0731, over 21578.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.32, pruned_loss=0.08172, over 4272626.88 frames. ], batch size: 471, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:28:07,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1730832.0, ans=0.1 2023-06-24 11:28:14,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1730832.0, ans=0.125 2023-06-24 11:28:40,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1730892.0, ans=0.125 2023-06-24 11:28:48,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1730892.0, ans=0.0 2023-06-24 11:29:11,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1731012.0, ans=0.1 2023-06-24 11:29:12,985 INFO [train.py:996] (2/4) Epoch 10, batch 14050, loss[loss=0.1953, simple_loss=0.2611, pruned_loss=0.06473, over 21611.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3152, pruned_loss=0.07842, over 4273580.96 frames. ], batch size: 282, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:29:33,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1731072.0, ans=0.025 2023-06-24 11:29:35,312 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=15.0 2023-06-24 11:29:36,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1731072.0, ans=0.2 2023-06-24 11:29:42,957 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.91 vs. limit=10.0 2023-06-24 11:29:59,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1731132.0, ans=0.2 2023-06-24 11:30:45,124 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.778e+02 8.037e+02 1.202e+03 1.798e+03 5.374e+03, threshold=2.404e+03, percent-clipped=21.0 2023-06-24 11:30:47,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1731312.0, ans=0.0 2023-06-24 11:30:48,132 INFO [train.py:996] (2/4) Epoch 10, batch 14100, loss[loss=0.2401, simple_loss=0.3132, pruned_loss=0.08346, over 21697.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3111, pruned_loss=0.07872, over 4268880.80 frames. ], batch size: 332, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:31:13,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1731372.0, ans=0.125 2023-06-24 11:31:14,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1731372.0, ans=0.0 2023-06-24 11:31:18,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1731432.0, ans=0.0 2023-06-24 11:31:18,693 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=15.0 2023-06-24 11:31:59,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1731492.0, ans=0.125 2023-06-24 11:32:23,928 INFO [train.py:996] (2/4) Epoch 10, batch 14150, loss[loss=0.2531, simple_loss=0.3285, pruned_loss=0.08886, over 21886.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3136, pruned_loss=0.07926, over 4267121.98 frames. ], batch size: 98, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:33:09,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1731732.0, ans=0.2 2023-06-24 11:33:22,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1731792.0, ans=0.0 2023-06-24 11:33:50,831 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.922e+02 6.112e+02 7.492e+02 9.193e+02 1.976e+03, threshold=1.498e+03, percent-clipped=0.0 2023-06-24 11:33:58,955 INFO [train.py:996] (2/4) Epoch 10, batch 14200, loss[loss=0.1864, simple_loss=0.2855, pruned_loss=0.04365, over 21662.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3113, pruned_loss=0.07835, over 4263111.46 frames. ], batch size: 263, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:34:02,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1731912.0, ans=0.125 2023-06-24 11:34:48,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1732092.0, ans=0.2 2023-06-24 11:35:29,035 INFO [train.py:996] (2/4) Epoch 10, batch 14250, loss[loss=0.1797, simple_loss=0.2711, pruned_loss=0.04414, over 21738.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3058, pruned_loss=0.07781, over 4260596.75 frames. ], batch size: 351, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:36:00,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1732272.0, ans=0.1 2023-06-24 11:36:37,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1732392.0, ans=0.125 2023-06-24 11:37:05,368 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.032e+02 5.862e+02 7.753e+02 1.347e+03 3.974e+03, threshold=1.551e+03, percent-clipped=20.0 2023-06-24 11:37:08,444 INFO [train.py:996] (2/4) Epoch 10, batch 14300, loss[loss=0.201, simple_loss=0.2818, pruned_loss=0.06008, over 17938.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3083, pruned_loss=0.07764, over 4255700.20 frames. ], batch size: 70, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:37:44,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1732632.0, ans=0.0 2023-06-24 11:37:49,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1732632.0, ans=0.125 2023-06-24 11:38:19,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.whiten.whitening_limit, batch_count=1732692.0, ans=12.0 2023-06-24 11:38:37,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1732752.0, ans=0.125 2023-06-24 11:38:39,660 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=15.0 2023-06-24 11:38:44,739 INFO [train.py:996] (2/4) Epoch 10, batch 14350, loss[loss=0.208, simple_loss=0.2912, pruned_loss=0.06244, over 21946.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3084, pruned_loss=0.07633, over 4245910.58 frames. ], batch size: 316, lr: 2.94e-03, grad_scale: 8.0 2023-06-24 11:39:47,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1732992.0, ans=0.0 2023-06-24 11:39:48,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1732992.0, ans=0.125 2023-06-24 11:40:08,795 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-24 11:40:09,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1733052.0, ans=0.125 2023-06-24 11:40:17,076 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.258e+02 7.078e+02 1.011e+03 1.349e+03 3.463e+03, threshold=2.022e+03, percent-clipped=22.0 2023-06-24 11:40:25,257 INFO [train.py:996] (2/4) Epoch 10, batch 14400, loss[loss=0.2269, simple_loss=0.289, pruned_loss=0.08238, over 21243.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.307, pruned_loss=0.07745, over 4251880.81 frames. ], batch size: 176, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:40:32,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1733112.0, ans=0.125 2023-06-24 11:40:37,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1733112.0, ans=0.125 2023-06-24 11:40:58,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1733232.0, ans=0.1 2023-06-24 11:41:24,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1733292.0, ans=0.0 2023-06-24 11:41:36,606 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:41:54,331 INFO [train.py:996] (2/4) Epoch 10, batch 14450, loss[loss=0.2074, simple_loss=0.2699, pruned_loss=0.07241, over 21646.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3025, pruned_loss=0.07761, over 4251505.45 frames. ], batch size: 298, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:43:23,366 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.538e+02 6.420e+02 9.253e+02 1.365e+03 3.104e+03, threshold=1.851e+03, percent-clipped=3.0 2023-06-24 11:43:23,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1733652.0, ans=0.125 2023-06-24 11:43:26,353 INFO [train.py:996] (2/4) Epoch 10, batch 14500, loss[loss=0.2687, simple_loss=0.3287, pruned_loss=0.1043, over 21297.00 frames. ], tot_loss[loss=0.227, simple_loss=0.2995, pruned_loss=0.07722, over 4256578.91 frames. ], batch size: 471, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:44:09,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1733832.0, ans=0.0 2023-06-24 11:44:28,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1733892.0, ans=0.0 2023-06-24 11:44:39,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1733892.0, ans=0.0 2023-06-24 11:45:08,968 INFO [train.py:996] (2/4) Epoch 10, batch 14550, loss[loss=0.2614, simple_loss=0.3364, pruned_loss=0.09317, over 21595.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.304, pruned_loss=0.07851, over 4266434.13 frames. ], batch size: 389, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:45:11,069 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 11:45:15,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1734012.0, ans=0.125 2023-06-24 11:45:28,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1734072.0, ans=0.0 2023-06-24 11:46:06,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1734192.0, ans=0.125 2023-06-24 11:46:42,611 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.617e+02 6.748e+02 9.842e+02 1.367e+03 3.226e+03, threshold=1.968e+03, percent-clipped=9.0 2023-06-24 11:46:45,768 INFO [train.py:996] (2/4) Epoch 10, batch 14600, loss[loss=0.2526, simple_loss=0.3409, pruned_loss=0.08212, over 21650.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.314, pruned_loss=0.08275, over 4268449.53 frames. ], batch size: 230, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:47:05,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1734372.0, ans=0.0 2023-06-24 11:47:05,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1734372.0, ans=0.125 2023-06-24 11:47:12,676 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=15.0 2023-06-24 11:47:42,129 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.76 vs. limit=10.0 2023-06-24 11:47:54,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1734492.0, ans=0.125 2023-06-24 11:48:02,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1734552.0, ans=0.125 2023-06-24 11:48:21,383 INFO [train.py:996] (2/4) Epoch 10, batch 14650, loss[loss=0.2244, simple_loss=0.308, pruned_loss=0.07045, over 21669.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3154, pruned_loss=0.08189, over 4275744.20 frames. ], batch size: 230, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:48:23,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1734612.0, ans=0.0 2023-06-24 11:48:33,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1734612.0, ans=22.5 2023-06-24 11:49:09,204 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-24 11:49:09,404 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.92 vs. limit=12.0 2023-06-24 11:49:21,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1734792.0, ans=0.0 2023-06-24 11:49:54,753 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.330e+02 6.829e+02 9.850e+02 1.571e+03 3.523e+03, threshold=1.970e+03, percent-clipped=13.0 2023-06-24 11:49:57,774 INFO [train.py:996] (2/4) Epoch 10, batch 14700, loss[loss=0.2333, simple_loss=0.33, pruned_loss=0.06824, over 21781.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3098, pruned_loss=0.07674, over 4267033.25 frames. ], batch size: 351, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:50:03,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1734912.0, ans=0.125 2023-06-24 11:50:09,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1734912.0, ans=0.04949747468305833 2023-06-24 11:50:29,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1734972.0, ans=0.2 2023-06-24 11:51:18,308 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.33 vs. limit=15.0 2023-06-24 11:51:36,590 INFO [train.py:996] (2/4) Epoch 10, batch 14750, loss[loss=0.2849, simple_loss=0.3668, pruned_loss=0.1015, over 21562.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3164, pruned_loss=0.08049, over 4269719.18 frames. ], batch size: 414, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 11:52:05,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1735272.0, ans=0.0 2023-06-24 11:52:51,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1735392.0, ans=0.07 2023-06-24 11:52:53,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1735392.0, ans=0.1 2023-06-24 11:53:06,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1735452.0, ans=6.0 2023-06-24 11:53:10,539 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.743e+02 7.762e+02 1.072e+03 1.702e+03 3.196e+03, threshold=2.144e+03, percent-clipped=17.0 2023-06-24 11:53:13,744 INFO [train.py:996] (2/4) Epoch 10, batch 14800, loss[loss=0.2512, simple_loss=0.3245, pruned_loss=0.08896, over 21448.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3253, pruned_loss=0.08492, over 4264357.50 frames. ], batch size: 389, lr: 2.94e-03, grad_scale: 32.0 2023-06-24 11:53:32,066 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.43 vs. limit=15.0 2023-06-24 11:53:38,466 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.11 vs. limit=15.0 2023-06-24 11:54:17,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1735632.0, ans=0.025 2023-06-24 11:54:32,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1735692.0, ans=0.0 2023-06-24 11:54:39,650 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.75 vs. limit=8.0 2023-06-24 11:54:58,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1735752.0, ans=0.1 2023-06-24 11:55:02,597 INFO [train.py:996] (2/4) Epoch 10, batch 14850, loss[loss=0.2167, simple_loss=0.2795, pruned_loss=0.07694, over 21676.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3204, pruned_loss=0.08425, over 4262308.37 frames. ], batch size: 299, lr: 2.94e-03, grad_scale: 32.0 2023-06-24 11:55:33,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1735872.0, ans=0.0 2023-06-24 11:55:53,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1735932.0, ans=0.1 2023-06-24 11:55:55,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1735932.0, ans=0.2 2023-06-24 11:56:14,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1735992.0, ans=0.0 2023-06-24 11:56:37,527 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.296e+02 7.240e+02 1.054e+03 1.566e+03 3.588e+03, threshold=2.108e+03, percent-clipped=9.0 2023-06-24 11:56:40,614 INFO [train.py:996] (2/4) Epoch 10, batch 14900, loss[loss=0.2436, simple_loss=0.3192, pruned_loss=0.08395, over 21506.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3223, pruned_loss=0.08565, over 4268056.67 frames. ], batch size: 194, lr: 2.94e-03, grad_scale: 32.0 2023-06-24 11:57:25,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1736232.0, ans=0.125 2023-06-24 11:57:38,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1736232.0, ans=0.1 2023-06-24 11:57:44,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1736232.0, ans=0.035 2023-06-24 11:57:53,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1736292.0, ans=0.125 2023-06-24 11:57:55,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1736292.0, ans=0.125 2023-06-24 11:57:55,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1736292.0, ans=0.025 2023-06-24 11:58:25,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1736352.0, ans=0.125 2023-06-24 11:58:28,352 INFO [train.py:996] (2/4) Epoch 10, batch 14950, loss[loss=0.2309, simple_loss=0.3124, pruned_loss=0.07466, over 21617.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3224, pruned_loss=0.08458, over 4264642.32 frames. ], batch size: 263, lr: 2.94e-03, grad_scale: 32.0 2023-06-24 11:58:34,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1736412.0, ans=0.1 2023-06-24 11:58:51,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1736472.0, ans=0.1 2023-06-24 11:58:55,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1736472.0, ans=0.0 2023-06-24 11:59:29,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1736592.0, ans=0.0 2023-06-24 11:59:30,061 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.19 vs. limit=15.0 2023-06-24 12:00:04,934 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.178e+02 7.042e+02 9.483e+02 1.424e+03 2.881e+03, threshold=1.897e+03, percent-clipped=9.0 2023-06-24 12:00:06,677 INFO [train.py:996] (2/4) Epoch 10, batch 15000, loss[loss=0.3021, simple_loss=0.3616, pruned_loss=0.1213, over 21613.00 frames. ], tot_loss[loss=0.2489, simple_loss=0.3248, pruned_loss=0.08652, over 4266365.24 frames. ], batch size: 508, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 12:00:06,678 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 12:00:21,973 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.4.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([2.1184, 3.5989, 3.5769, 2.0904], device='cuda:2') 2023-06-24 12:00:22,754 INFO [train.py:1028] (2/4) Epoch 10, validation: loss=0.2522, simple_loss=0.3488, pruned_loss=0.07776, over 1796401.00 frames. 2023-06-24 12:00:22,755 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-24 12:00:34,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1736712.0, ans=0.125 2023-06-24 12:00:54,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1736772.0, ans=0.0 2023-06-24 12:01:02,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1736832.0, ans=0.035 2023-06-24 12:01:11,229 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.59 vs. limit=10.0 2023-06-24 12:02:00,956 INFO [train.py:996] (2/4) Epoch 10, batch 15050, loss[loss=0.2459, simple_loss=0.3272, pruned_loss=0.08226, over 21608.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3268, pruned_loss=0.0874, over 4259871.55 frames. ], batch size: 230, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 12:02:36,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1737072.0, ans=0.2 2023-06-24 12:02:46,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1737132.0, ans=0.025 2023-06-24 12:03:15,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1737192.0, ans=10.0 2023-06-24 12:03:35,009 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-24 12:03:36,993 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.769e+02 8.670e+02 1.430e+03 2.221e+03 3.965e+03, threshold=2.861e+03, percent-clipped=33.0 2023-06-24 12:03:38,474 INFO [train.py:996] (2/4) Epoch 10, batch 15100, loss[loss=0.2437, simple_loss=0.3196, pruned_loss=0.08387, over 21835.00 frames. ], tot_loss[loss=0.252, simple_loss=0.3304, pruned_loss=0.08681, over 4257134.18 frames. ], batch size: 282, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 12:03:42,254 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1737312.0, ans=0.125 2023-06-24 12:04:06,760 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.95 vs. limit=15.0 2023-06-24 12:04:13,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1737372.0, ans=0.125 2023-06-24 12:04:29,469 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.80 vs. limit=15.0 2023-06-24 12:04:46,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1737492.0, ans=0.125 2023-06-24 12:04:54,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1737492.0, ans=0.1 2023-06-24 12:05:15,748 INFO [train.py:996] (2/4) Epoch 10, batch 15150, loss[loss=0.2907, simple_loss=0.3331, pruned_loss=0.1241, over 21214.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3268, pruned_loss=0.08734, over 4254297.31 frames. ], batch size: 471, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 12:05:32,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.min_positive, batch_count=1737612.0, ans=0.05 2023-06-24 12:05:56,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1737672.0, ans=0.2 2023-06-24 12:06:11,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1737732.0, ans=0.2 2023-06-24 12:06:42,700 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.62 vs. limit=15.0 2023-06-24 12:06:55,782 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.279e+02 6.808e+02 1.091e+03 1.698e+03 5.270e+03, threshold=2.181e+03, percent-clipped=2.0 2023-06-24 12:07:02,090 INFO [train.py:996] (2/4) Epoch 10, batch 15200, loss[loss=0.2154, simple_loss=0.2772, pruned_loss=0.07678, over 21209.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3181, pruned_loss=0.08383, over 4258549.64 frames. ], batch size: 176, lr: 2.94e-03, grad_scale: 32.0 2023-06-24 12:07:21,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1737972.0, ans=0.0 2023-06-24 12:08:11,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1738152.0, ans=0.0 2023-06-24 12:08:23,863 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.38 vs. limit=6.0 2023-06-24 12:08:32,600 INFO [train.py:996] (2/4) Epoch 10, batch 15250, loss[loss=0.2278, simple_loss=0.2881, pruned_loss=0.08373, over 21744.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3122, pruned_loss=0.0832, over 4266404.52 frames. ], batch size: 112, lr: 2.94e-03, grad_scale: 16.0 2023-06-24 12:09:01,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1738272.0, ans=0.125 2023-06-24 12:09:10,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1738272.0, ans=0.1 2023-06-24 12:09:43,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1738392.0, ans=0.07 2023-06-24 12:09:58,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1738452.0, ans=0.2 2023-06-24 12:10:17,222 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.809e+02 9.550e+02 1.644e+03 2.423e+03 4.460e+03, threshold=3.287e+03, percent-clipped=35.0 2023-06-24 12:10:17,243 INFO [train.py:996] (2/4) Epoch 10, batch 15300, loss[loss=0.2664, simple_loss=0.3339, pruned_loss=0.09941, over 21735.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3147, pruned_loss=0.0861, over 4274671.41 frames. ], batch size: 298, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:10:33,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1738572.0, ans=0.1 2023-06-24 12:10:55,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1738572.0, ans=0.1 2023-06-24 12:11:08,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1738632.0, ans=0.125 2023-06-24 12:11:31,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1738752.0, ans=0.2 2023-06-24 12:11:54,379 INFO [train.py:996] (2/4) Epoch 10, batch 15350, loss[loss=0.2847, simple_loss=0.3651, pruned_loss=0.1022, over 21448.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3205, pruned_loss=0.08819, over 4271206.89 frames. ], batch size: 471, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:11:59,894 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=22.5 2023-06-24 12:12:06,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1738812.0, ans=0.125 2023-06-24 12:12:57,677 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.97 vs. limit=15.0 2023-06-24 12:13:09,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1739052.0, ans=0.125 2023-06-24 12:13:24,418 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.980e+02 7.880e+02 1.088e+03 1.633e+03 3.514e+03, threshold=2.175e+03, percent-clipped=1.0 2023-06-24 12:13:24,439 INFO [train.py:996] (2/4) Epoch 10, batch 15400, loss[loss=0.2265, simple_loss=0.3049, pruned_loss=0.0741, over 21796.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3224, pruned_loss=0.0861, over 4266904.63 frames. ], batch size: 247, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:13:38,990 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-24 12:15:05,375 INFO [train.py:996] (2/4) Epoch 10, batch 15450, loss[loss=0.2023, simple_loss=0.2828, pruned_loss=0.0609, over 15765.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3201, pruned_loss=0.08358, over 4256244.40 frames. ], batch size: 60, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:15:09,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1739412.0, ans=0.2 2023-06-24 12:15:12,162 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:15:51,982 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.16 vs. limit=22.5 2023-06-24 12:16:43,395 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.496e+02 7.285e+02 1.027e+03 1.632e+03 3.153e+03, threshold=2.054e+03, percent-clipped=10.0 2023-06-24 12:16:43,416 INFO [train.py:996] (2/4) Epoch 10, batch 15500, loss[loss=0.2779, simple_loss=0.351, pruned_loss=0.1024, over 21591.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3234, pruned_loss=0.08411, over 4258533.79 frames. ], batch size: 389, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:17:08,148 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:17:25,012 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.43 vs. limit=22.5 2023-06-24 12:18:15,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1739952.0, ans=0.0 2023-06-24 12:18:21,656 INFO [train.py:996] (2/4) Epoch 10, batch 15550, loss[loss=0.2007, simple_loss=0.2713, pruned_loss=0.06506, over 21763.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3186, pruned_loss=0.08088, over 4262625.43 frames. ], batch size: 124, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:18:48,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1740072.0, ans=0.2 2023-06-24 12:19:48,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1740252.0, ans=0.1 2023-06-24 12:19:55,386 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.86 vs. limit=6.0 2023-06-24 12:19:58,918 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.243e+02 5.866e+02 9.218e+02 1.616e+03 3.082e+03, threshold=1.844e+03, percent-clipped=8.0 2023-06-24 12:19:58,938 INFO [train.py:996] (2/4) Epoch 10, batch 15600, loss[loss=0.1957, simple_loss=0.2549, pruned_loss=0.06824, over 20761.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3112, pruned_loss=0.07904, over 4263716.72 frames. ], batch size: 609, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:20:36,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1740372.0, ans=0.0 2023-06-24 12:21:30,696 INFO [train.py:996] (2/4) Epoch 10, batch 15650, loss[loss=0.2262, simple_loss=0.2902, pruned_loss=0.08106, over 21657.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3093, pruned_loss=0.07865, over 4259057.68 frames. ], batch size: 316, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:21:34,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1740612.0, ans=0.125 2023-06-24 12:21:34,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1740612.0, ans=0.1 2023-06-24 12:22:07,595 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.78 vs. limit=10.0 2023-06-24 12:22:20,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1740732.0, ans=15.0 2023-06-24 12:22:47,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1740792.0, ans=0.125 2023-06-24 12:23:07,096 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.033e+02 7.365e+02 1.096e+03 1.379e+03 2.536e+03, threshold=2.192e+03, percent-clipped=6.0 2023-06-24 12:23:07,117 INFO [train.py:996] (2/4) Epoch 10, batch 15700, loss[loss=0.2441, simple_loss=0.3105, pruned_loss=0.08884, over 21293.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3054, pruned_loss=0.07781, over 4267517.58 frames. ], batch size: 471, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:23:07,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff2.min_abs, batch_count=1740912.0, ans=0.1 2023-06-24 12:23:41,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1740972.0, ans=0.1 2023-06-24 12:23:43,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1740972.0, ans=0.0 2023-06-24 12:23:52,153 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-24 12:24:43,538 INFO [train.py:996] (2/4) Epoch 10, batch 15750, loss[loss=0.2176, simple_loss=0.2847, pruned_loss=0.0752, over 21382.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3006, pruned_loss=0.07771, over 4268721.91 frames. ], batch size: 131, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:25:07,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1741272.0, ans=0.5 2023-06-24 12:25:16,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1741272.0, ans=0.125 2023-06-24 12:25:32,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1741332.0, ans=0.1 2023-06-24 12:25:39,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1741332.0, ans=0.125 2023-06-24 12:25:53,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1741392.0, ans=0.125 2023-06-24 12:26:13,772 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.443e+02 6.609e+02 9.142e+02 1.184e+03 2.398e+03, threshold=1.828e+03, percent-clipped=2.0 2023-06-24 12:26:13,793 INFO [train.py:996] (2/4) Epoch 10, batch 15800, loss[loss=0.2474, simple_loss=0.3118, pruned_loss=0.09154, over 21501.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2961, pruned_loss=0.07784, over 4270511.49 frames. ], batch size: 389, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:26:14,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1741512.0, ans=0.0 2023-06-24 12:26:36,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1741512.0, ans=0.1 2023-06-24 12:26:46,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1741572.0, ans=0.0 2023-06-24 12:26:55,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1741632.0, ans=0.1 2023-06-24 12:26:57,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1741632.0, ans=0.0 2023-06-24 12:27:32,224 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-24 12:27:43,901 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 12:27:49,619 INFO [train.py:996] (2/4) Epoch 10, batch 15850, loss[loss=0.2635, simple_loss=0.334, pruned_loss=0.09653, over 21681.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3002, pruned_loss=0.08074, over 4265123.23 frames. ], batch size: 351, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:28:22,028 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.67 vs. limit=15.0 2023-06-24 12:28:31,251 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-24 12:28:34,575 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.82 vs. limit=12.0 2023-06-24 12:29:11,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1742052.0, ans=0.125 2023-06-24 12:29:14,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1742052.0, ans=0.0 2023-06-24 12:29:26,894 INFO [train.py:996] (2/4) Epoch 10, batch 15900, loss[loss=0.2316, simple_loss=0.2911, pruned_loss=0.08601, over 21802.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3003, pruned_loss=0.08131, over 4249361.34 frames. ], batch size: 352, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:29:28,405 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.526e+02 8.317e+02 1.237e+03 1.605e+03 4.098e+03, threshold=2.474e+03, percent-clipped=15.0 2023-06-24 12:29:32,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1742112.0, ans=0.1 2023-06-24 12:29:42,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1742112.0, ans=0.125 2023-06-24 12:29:57,280 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.69 vs. limit=15.0 2023-06-24 12:30:07,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1742232.0, ans=0.125 2023-06-24 12:30:48,558 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.63 vs. limit=22.5 2023-06-24 12:31:01,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1742352.0, ans=0.1 2023-06-24 12:31:03,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1742412.0, ans=0.1 2023-06-24 12:31:04,684 INFO [train.py:996] (2/4) Epoch 10, batch 15950, loss[loss=0.2207, simple_loss=0.3503, pruned_loss=0.0455, over 19761.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3023, pruned_loss=0.07893, over 4238993.03 frames. ], batch size: 702, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:31:11,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1742412.0, ans=0.0 2023-06-24 12:31:39,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1742472.0, ans=0.125 2023-06-24 12:31:59,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1742532.0, ans=0.1 2023-06-24 12:32:12,752 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-24 12:32:43,155 INFO [train.py:996] (2/4) Epoch 10, batch 16000, loss[loss=0.1937, simple_loss=0.2721, pruned_loss=0.05762, over 21132.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3035, pruned_loss=0.07634, over 4254715.88 frames. ], batch size: 143, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:32:44,652 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.507e+02 6.384e+02 8.996e+02 1.327e+03 2.604e+03, threshold=1.799e+03, percent-clipped=2.0 2023-06-24 12:32:59,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1742772.0, ans=0.125 2023-06-24 12:33:37,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1742832.0, ans=0.2 2023-06-24 12:34:15,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1742952.0, ans=0.125 2023-06-24 12:34:20,837 INFO [train.py:996] (2/4) Epoch 10, batch 16050, loss[loss=0.2225, simple_loss=0.2989, pruned_loss=0.07304, over 21352.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3044, pruned_loss=0.07445, over 4254264.73 frames. ], batch size: 144, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:35:10,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1743192.0, ans=0.125 2023-06-24 12:35:31,408 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.77 vs. limit=15.0 2023-06-24 12:35:51,303 INFO [train.py:996] (2/4) Epoch 10, batch 16100, loss[loss=0.1872, simple_loss=0.2448, pruned_loss=0.06483, over 20724.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3068, pruned_loss=0.0751, over 4262461.18 frames. ], batch size: 607, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:35:54,430 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.082e+02 6.034e+02 8.242e+02 1.333e+03 2.832e+03, threshold=1.648e+03, percent-clipped=8.0 2023-06-24 12:36:10,372 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=22.5 2023-06-24 12:37:26,902 INFO [train.py:996] (2/4) Epoch 10, batch 16150, loss[loss=0.2452, simple_loss=0.3189, pruned_loss=0.08576, over 21934.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3073, pruned_loss=0.07858, over 4264904.32 frames. ], batch size: 351, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:37:27,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1743612.0, ans=0.1 2023-06-24 12:38:08,249 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.27 vs. limit=10.0 2023-06-24 12:38:25,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1743792.0, ans=0.2 2023-06-24 12:39:05,193 INFO [train.py:996] (2/4) Epoch 10, batch 16200, loss[loss=0.3298, simple_loss=0.3836, pruned_loss=0.1379, over 21352.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.311, pruned_loss=0.08014, over 4271514.95 frames. ], batch size: 507, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:39:08,332 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.546e+02 7.202e+02 1.055e+03 1.408e+03 3.192e+03, threshold=2.110e+03, percent-clipped=15.0 2023-06-24 12:39:10,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1743912.0, ans=0.07 2023-06-24 12:40:19,868 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.91 vs. limit=22.5 2023-06-24 12:40:21,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1744152.0, ans=0.125 2023-06-24 12:40:28,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1744152.0, ans=0.125 2023-06-24 12:40:31,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1744152.0, ans=0.1 2023-06-24 12:40:37,332 INFO [train.py:996] (2/4) Epoch 10, batch 16250, loss[loss=0.2762, simple_loss=0.3446, pruned_loss=0.1039, over 21337.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3122, pruned_loss=0.08034, over 4275488.87 frames. ], batch size: 549, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:40:39,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1744212.0, ans=0.0 2023-06-24 12:41:06,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1744272.0, ans=0.125 2023-06-24 12:41:37,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1744392.0, ans=0.125 2023-06-24 12:41:39,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1744392.0, ans=15.0 2023-06-24 12:42:18,478 INFO [train.py:996] (2/4) Epoch 10, batch 16300, loss[loss=0.2281, simple_loss=0.325, pruned_loss=0.06562, over 21430.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3075, pruned_loss=0.07613, over 4259566.35 frames. ], batch size: 471, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:42:27,036 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.249e+02 6.511e+02 9.054e+02 1.473e+03 4.161e+03, threshold=1.811e+03, percent-clipped=10.0 2023-06-24 12:42:53,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1744632.0, ans=0.0 2023-06-24 12:43:19,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1744692.0, ans=0.07 2023-06-24 12:43:51,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1744752.0, ans=0.125 2023-06-24 12:44:01,329 INFO [train.py:996] (2/4) Epoch 10, batch 16350, loss[loss=0.2851, simple_loss=0.358, pruned_loss=0.1061, over 21950.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3078, pruned_loss=0.07723, over 4272124.40 frames. ], batch size: 372, lr: 2.93e-03, grad_scale: 8.0 2023-06-24 12:44:09,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1744812.0, ans=0.1 2023-06-24 12:45:15,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1744992.0, ans=0.125 2023-06-24 12:45:38,319 INFO [train.py:996] (2/4) Epoch 10, batch 16400, loss[loss=0.2084, simple_loss=0.2773, pruned_loss=0.06981, over 21525.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3117, pruned_loss=0.07833, over 4271052.99 frames. ], batch size: 211, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:45:42,800 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.368e+02 7.745e+02 1.144e+03 1.661e+03 2.943e+03, threshold=2.288e+03, percent-clipped=22.0 2023-06-24 12:45:51,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1745112.0, ans=0.125 2023-06-24 12:47:16,239 INFO [train.py:996] (2/4) Epoch 10, batch 16450, loss[loss=0.2355, simple_loss=0.3011, pruned_loss=0.08492, over 20008.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3115, pruned_loss=0.07972, over 4286566.09 frames. ], batch size: 702, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:47:22,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1745412.0, ans=0.0 2023-06-24 12:48:49,825 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.60 vs. limit=6.0 2023-06-24 12:48:53,690 INFO [train.py:996] (2/4) Epoch 10, batch 16500, loss[loss=0.2241, simple_loss=0.3026, pruned_loss=0.07275, over 21779.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3096, pruned_loss=0.07946, over 4280945.81 frames. ], batch size: 332, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:48:58,419 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.573e+02 7.722e+02 1.056e+03 1.682e+03 4.861e+03, threshold=2.112e+03, percent-clipped=4.0 2023-06-24 12:49:04,149 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.72 vs. limit=10.0 2023-06-24 12:49:28,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1745832.0, ans=0.025 2023-06-24 12:49:57,311 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.78 vs. limit=15.0 2023-06-24 12:50:28,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1745952.0, ans=0.1 2023-06-24 12:50:31,381 INFO [train.py:996] (2/4) Epoch 10, batch 16550, loss[loss=0.2197, simple_loss=0.2957, pruned_loss=0.07181, over 21779.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3051, pruned_loss=0.07708, over 4275843.06 frames. ], batch size: 247, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:50:38,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1746012.0, ans=0.125 2023-06-24 12:51:13,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1746072.0, ans=0.2 2023-06-24 12:51:32,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1746132.0, ans=0.2 2023-06-24 12:52:02,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1746252.0, ans=0.2 2023-06-24 12:52:02,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1746252.0, ans=0.035 2023-06-24 12:52:16,506 INFO [train.py:996] (2/4) Epoch 10, batch 16600, loss[loss=0.2936, simple_loss=0.3968, pruned_loss=0.09515, over 21662.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3134, pruned_loss=0.08027, over 4275274.86 frames. ], batch size: 441, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:52:21,401 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.341e+02 8.131e+02 1.242e+03 1.757e+03 3.477e+03, threshold=2.484e+03, percent-clipped=12.0 2023-06-24 12:52:23,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1746312.0, ans=0.125 2023-06-24 12:53:28,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1746492.0, ans=0.07 2023-06-24 12:54:00,370 INFO [train.py:996] (2/4) Epoch 10, batch 16650, loss[loss=0.2655, simple_loss=0.3416, pruned_loss=0.0947, over 21710.00 frames. ], tot_loss[loss=0.2434, simple_loss=0.3219, pruned_loss=0.08244, over 4280223.96 frames. ], batch size: 332, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:54:18,812 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.72 vs. limit=15.0 2023-06-24 12:54:41,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1746732.0, ans=0.09899494936611666 2023-06-24 12:55:44,441 INFO [train.py:996] (2/4) Epoch 10, batch 16700, loss[loss=0.2448, simple_loss=0.3371, pruned_loss=0.07629, over 21046.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3237, pruned_loss=0.08325, over 4277853.67 frames. ], batch size: 607, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:55:49,381 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.755e+02 7.026e+02 1.004e+03 1.401e+03 2.239e+03, threshold=2.007e+03, percent-clipped=0.0 2023-06-24 12:55:50,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1746912.0, ans=0.125 2023-06-24 12:56:50,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1747092.0, ans=0.0 2023-06-24 12:57:26,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1747152.0, ans=0.1 2023-06-24 12:57:30,002 INFO [train.py:996] (2/4) Epoch 10, batch 16750, loss[loss=0.2808, simple_loss=0.3531, pruned_loss=0.1042, over 21800.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3258, pruned_loss=0.08516, over 4273565.57 frames. ], batch size: 124, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 12:57:59,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1747272.0, ans=0.0 2023-06-24 12:58:26,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1747332.0, ans=0.0 2023-06-24 12:59:08,083 INFO [train.py:996] (2/4) Epoch 10, batch 16800, loss[loss=0.2147, simple_loss=0.2934, pruned_loss=0.06798, over 21851.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3287, pruned_loss=0.08489, over 4266434.00 frames. ], batch size: 332, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 12:59:12,598 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.402e+02 6.959e+02 1.077e+03 1.675e+03 3.931e+03, threshold=2.154e+03, percent-clipped=17.0 2023-06-24 12:59:33,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1747572.0, ans=0.04949747468305833 2023-06-24 12:59:48,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1747572.0, ans=0.125 2023-06-24 12:59:49,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1747572.0, ans=0.125 2023-06-24 13:00:01,235 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.00 vs. limit=15.0 2023-06-24 13:00:07,195 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-24 13:00:27,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1747752.0, ans=0.07 2023-06-24 13:00:43,537 INFO [train.py:996] (2/4) Epoch 10, batch 16850, loss[loss=0.2573, simple_loss=0.3177, pruned_loss=0.09851, over 21562.00 frames. ], tot_loss[loss=0.248, simple_loss=0.3251, pruned_loss=0.0855, over 4273906.01 frames. ], batch size: 548, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:00:48,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1747812.0, ans=0.0 2023-06-24 13:00:51,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1747812.0, ans=0.2 2023-06-24 13:01:00,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1747812.0, ans=0.125 2023-06-24 13:01:49,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1747992.0, ans=0.2 2023-06-24 13:01:50,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_ff2.min_abs, batch_count=1747992.0, ans=0.1 2023-06-24 13:01:53,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1747992.0, ans=0.125 2023-06-24 13:02:11,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1748052.0, ans=0.125 2023-06-24 13:02:11,119 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:02:19,538 INFO [train.py:996] (2/4) Epoch 10, batch 16900, loss[loss=0.2203, simple_loss=0.2831, pruned_loss=0.0788, over 21243.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3201, pruned_loss=0.08351, over 4272438.08 frames. ], batch size: 143, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:02:30,795 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.364e+02 7.003e+02 1.142e+03 1.621e+03 3.220e+03, threshold=2.284e+03, percent-clipped=11.0 2023-06-24 13:02:32,743 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:02:40,756 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:03:01,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1748172.0, ans=0.1 2023-06-24 13:03:08,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1748232.0, ans=0.0 2023-06-24 13:03:56,465 INFO [train.py:996] (2/4) Epoch 10, batch 16950, loss[loss=0.2145, simple_loss=0.2868, pruned_loss=0.07111, over 21667.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3151, pruned_loss=0.08241, over 4267417.59 frames. ], batch size: 263, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:04:18,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1748472.0, ans=0.2 2023-06-24 13:04:20,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1748472.0, ans=0.0 2023-06-24 13:04:49,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1748532.0, ans=0.2 2023-06-24 13:05:09,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1748592.0, ans=0.125 2023-06-24 13:05:22,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1748652.0, ans=0.1 2023-06-24 13:05:32,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1748712.0, ans=0.125 2023-06-24 13:05:33,930 INFO [train.py:996] (2/4) Epoch 10, batch 17000, loss[loss=0.2491, simple_loss=0.3084, pruned_loss=0.09492, over 21581.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3117, pruned_loss=0.08254, over 4277254.43 frames. ], batch size: 212, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:05:44,492 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.290e+02 6.732e+02 9.381e+02 1.306e+03 2.679e+03, threshold=1.876e+03, percent-clipped=4.0 2023-06-24 13:05:59,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1748772.0, ans=0.1 2023-06-24 13:06:20,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1748832.0, ans=0.0 2023-06-24 13:06:35,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1748832.0, ans=0.0 2023-06-24 13:06:40,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1748892.0, ans=0.0 2023-06-24 13:07:08,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1748952.0, ans=0.035 2023-06-24 13:07:15,577 INFO [train.py:996] (2/4) Epoch 10, batch 17050, loss[loss=0.2374, simple_loss=0.324, pruned_loss=0.07536, over 21813.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3199, pruned_loss=0.08458, over 4281064.37 frames. ], batch size: 351, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:07:20,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1749012.0, ans=0.0 2023-06-24 13:07:46,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1749072.0, ans=0.125 2023-06-24 13:08:01,288 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.84 vs. limit=15.0 2023-06-24 13:08:16,748 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=15.0 2023-06-24 13:08:32,998 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:08:46,012 INFO [train.py:996] (2/4) Epoch 10, batch 17100, loss[loss=0.239, simple_loss=0.3061, pruned_loss=0.08598, over 21457.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.319, pruned_loss=0.08515, over 4281703.24 frames. ], batch size: 131, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:08:56,760 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.820e+02 8.115e+02 1.148e+03 1.810e+03 4.142e+03, threshold=2.296e+03, percent-clipped=21.0 2023-06-24 13:08:58,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1749312.0, ans=0.0 2023-06-24 13:09:59,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1749492.0, ans=0.125 2023-06-24 13:10:26,340 INFO [train.py:996] (2/4) Epoch 10, batch 17150, loss[loss=0.2012, simple_loss=0.2786, pruned_loss=0.06187, over 21814.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3149, pruned_loss=0.08483, over 4278396.35 frames. ], batch size: 118, lr: 2.93e-03, grad_scale: 16.0 2023-06-24 13:10:27,398 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.57 vs. limit=5.0 2023-06-24 13:11:51,627 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=6.73 vs. limit=15.0 2023-06-24 13:11:53,172 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.06 vs. limit=15.0 2023-06-24 13:12:07,582 INFO [train.py:996] (2/4) Epoch 10, batch 17200, loss[loss=0.2493, simple_loss=0.3188, pruned_loss=0.08986, over 21774.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3152, pruned_loss=0.08523, over 4281138.46 frames. ], batch size: 247, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 13:12:18,508 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.490e+02 5.928e+02 7.581e+02 1.081e+03 2.493e+03, threshold=1.516e+03, percent-clipped=2.0 2023-06-24 13:12:44,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1749972.0, ans=0.125 2023-06-24 13:13:52,377 INFO [train.py:996] (2/4) Epoch 10, batch 17250, loss[loss=0.2417, simple_loss=0.3264, pruned_loss=0.07847, over 21596.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3177, pruned_loss=0.0869, over 4283311.26 frames. ], batch size: 389, lr: 2.93e-03, grad_scale: 32.0 2023-06-24 13:13:57,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1750212.0, ans=0.1 2023-06-24 13:14:27,393 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-24 13:14:33,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1750332.0, ans=0.0 2023-06-24 13:15:34,680 INFO [train.py:996] (2/4) Epoch 10, batch 17300, loss[loss=0.2852, simple_loss=0.3498, pruned_loss=0.1103, over 21315.00 frames. ], tot_loss[loss=0.251, simple_loss=0.3237, pruned_loss=0.08911, over 4280870.66 frames. ], batch size: 549, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:15:35,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1750512.0, ans=0.0 2023-06-24 13:15:41,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1750512.0, ans=0.125 2023-06-24 13:15:42,400 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.710e+02 7.635e+02 9.609e+02 1.379e+03 2.737e+03, threshold=1.922e+03, percent-clipped=17.0 2023-06-24 13:17:04,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1750752.0, ans=0.125 2023-06-24 13:17:11,816 INFO [train.py:996] (2/4) Epoch 10, batch 17350, loss[loss=0.2041, simple_loss=0.2938, pruned_loss=0.05723, over 21741.00 frames. ], tot_loss[loss=0.2512, simple_loss=0.3251, pruned_loss=0.08865, over 4277089.82 frames. ], batch size: 247, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:17:41,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1750872.0, ans=0.0 2023-06-24 13:17:46,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1750932.0, ans=0.1 2023-06-24 13:17:54,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1750932.0, ans=0.1 2023-06-24 13:17:55,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1750932.0, ans=0.2 2023-06-24 13:18:31,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1750992.0, ans=0.125 2023-06-24 13:18:36,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1751052.0, ans=0.125 2023-06-24 13:18:39,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1751052.0, ans=0.2 2023-06-24 13:18:49,842 INFO [train.py:996] (2/4) Epoch 10, batch 17400, loss[loss=0.2726, simple_loss=0.3595, pruned_loss=0.09281, over 21454.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3215, pruned_loss=0.08438, over 4278921.21 frames. ], batch size: 471, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:18:53,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1751112.0, ans=0.125 2023-06-24 13:18:57,486 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.089e+02 6.083e+02 9.753e+02 1.322e+03 2.899e+03, threshold=1.951e+03, percent-clipped=8.0 2023-06-24 13:19:28,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1751232.0, ans=0.125 2023-06-24 13:19:58,307 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.24 vs. limit=15.0 2023-06-24 13:20:03,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1751292.0, ans=0.125 2023-06-24 13:20:11,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1751352.0, ans=0.125 2023-06-24 13:20:20,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1751352.0, ans=0.125 2023-06-24 13:20:26,378 INFO [train.py:996] (2/4) Epoch 10, batch 17450, loss[loss=0.2173, simple_loss=0.3126, pruned_loss=0.06096, over 21872.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3183, pruned_loss=0.08176, over 4278026.66 frames. ], batch size: 373, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:20:45,440 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-06-24 13:20:47,139 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.81 vs. limit=15.0 2023-06-24 13:21:20,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1751532.0, ans=0.125 2023-06-24 13:21:44,647 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=15.0 2023-06-24 13:22:02,696 INFO [train.py:996] (2/4) Epoch 10, batch 17500, loss[loss=0.2406, simple_loss=0.3141, pruned_loss=0.08359, over 21853.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.314, pruned_loss=0.08022, over 4281854.65 frames. ], batch size: 124, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:22:16,402 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.845e+02 5.907e+02 8.163e+02 1.225e+03 3.069e+03, threshold=1.633e+03, percent-clipped=7.0 2023-06-24 13:22:36,576 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:23:03,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1751832.0, ans=0.0 2023-06-24 13:23:39,105 INFO [train.py:996] (2/4) Epoch 10, batch 17550, loss[loss=0.2233, simple_loss=0.3131, pruned_loss=0.06672, over 21698.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3135, pruned_loss=0.07905, over 4269731.13 frames. ], batch size: 247, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:24:02,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1752072.0, ans=0.1 2023-06-24 13:24:08,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1752072.0, ans=0.2 2023-06-24 13:24:38,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1752192.0, ans=0.125 2023-06-24 13:25:10,556 INFO [train.py:996] (2/4) Epoch 10, batch 17600, loss[loss=0.264, simple_loss=0.3226, pruned_loss=0.1027, over 21617.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3164, pruned_loss=0.08018, over 4267177.88 frames. ], batch size: 263, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:25:24,365 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.544e+02 6.468e+02 8.117e+02 1.176e+03 4.887e+03, threshold=1.623e+03, percent-clipped=13.0 2023-06-24 13:25:50,172 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-24 13:26:03,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1752432.0, ans=0.125 2023-06-24 13:26:14,011 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.79 vs. limit=15.0 2023-06-24 13:26:28,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1752492.0, ans=0.0 2023-06-24 13:26:51,771 INFO [train.py:996] (2/4) Epoch 10, batch 17650, loss[loss=0.2542, simple_loss=0.3322, pruned_loss=0.08812, over 21278.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3144, pruned_loss=0.07956, over 4261529.69 frames. ], batch size: 143, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:27:03,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1752612.0, ans=0.0 2023-06-24 13:27:20,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1752672.0, ans=0.0 2023-06-24 13:27:43,006 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=22.5 2023-06-24 13:28:21,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1752852.0, ans=0.1 2023-06-24 13:28:33,511 INFO [train.py:996] (2/4) Epoch 10, batch 17700, loss[loss=0.2535, simple_loss=0.3642, pruned_loss=0.07143, over 20794.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3114, pruned_loss=0.07748, over 4257040.46 frames. ], batch size: 608, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:28:48,055 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.630e+02 6.330e+02 1.013e+03 1.605e+03 3.260e+03, threshold=2.027e+03, percent-clipped=24.0 2023-06-24 13:30:16,871 INFO [train.py:996] (2/4) Epoch 10, batch 17750, loss[loss=0.1943, simple_loss=0.3029, pruned_loss=0.04289, over 20651.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3196, pruned_loss=0.08164, over 4260958.36 frames. ], batch size: 608, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:30:31,924 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:30:32,495 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.80 vs. limit=10.0 2023-06-24 13:31:17,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1753392.0, ans=0.2 2023-06-24 13:31:42,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1753452.0, ans=0.0 2023-06-24 13:31:48,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1753452.0, ans=0.125 2023-06-24 13:31:51,453 INFO [train.py:996] (2/4) Epoch 10, batch 17800, loss[loss=0.234, simple_loss=0.3052, pruned_loss=0.0814, over 21587.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3185, pruned_loss=0.08077, over 4264109.91 frames. ], batch size: 230, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:31:51,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1753512.0, ans=0.0 2023-06-24 13:32:07,439 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.377e+02 6.218e+02 8.448e+02 1.392e+03 2.915e+03, threshold=1.690e+03, percent-clipped=12.0 2023-06-24 13:33:29,914 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:33:34,293 INFO [train.py:996] (2/4) Epoch 10, batch 17850, loss[loss=0.2541, simple_loss=0.3258, pruned_loss=0.09116, over 21839.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3193, pruned_loss=0.08152, over 4271897.36 frames. ], batch size: 124, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:33:34,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1753812.0, ans=0.0 2023-06-24 13:33:47,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1753812.0, ans=0.1 2023-06-24 13:34:41,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1753992.0, ans=0.0 2023-06-24 13:35:06,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1754052.0, ans=0.125 2023-06-24 13:35:12,039 INFO [train.py:996] (2/4) Epoch 10, batch 17900, loss[loss=0.2977, simple_loss=0.3878, pruned_loss=0.1038, over 21696.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3241, pruned_loss=0.08253, over 4266938.90 frames. ], batch size: 441, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:35:13,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1754112.0, ans=0.125 2023-06-24 13:35:16,110 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.00 vs. limit=15.0 2023-06-24 13:35:23,064 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.491e+02 6.124e+02 9.329e+02 1.248e+03 3.216e+03, threshold=1.866e+03, percent-clipped=9.0 2023-06-24 13:35:30,531 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:35:31,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1754172.0, ans=0.0 2023-06-24 13:35:56,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1754232.0, ans=0.0 2023-06-24 13:35:56,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1754232.0, ans=0.125 2023-06-24 13:36:46,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1754352.0, ans=0.125 2023-06-24 13:36:52,228 INFO [train.py:996] (2/4) Epoch 10, batch 17950, loss[loss=0.1954, simple_loss=0.2814, pruned_loss=0.05475, over 21360.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3237, pruned_loss=0.07931, over 4264539.36 frames. ], batch size: 211, lr: 2.92e-03, grad_scale: 8.0 2023-06-24 13:38:11,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1754652.0, ans=0.125 2023-06-24 13:38:23,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=1754652.0, ans=0.05 2023-06-24 13:38:28,324 INFO [train.py:996] (2/4) Epoch 10, batch 18000, loss[loss=0.223, simple_loss=0.2761, pruned_loss=0.08496, over 21516.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3164, pruned_loss=0.07799, over 4262732.32 frames. ], batch size: 212, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:38:28,324 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 13:38:47,233 INFO [train.py:1028] (2/4) Epoch 10, validation: loss=0.2575, simple_loss=0.3533, pruned_loss=0.08085, over 1796401.00 frames. 2023-06-24 13:38:47,233 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-24 13:39:02,579 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.082e+02 8.313e+02 1.378e+03 2.030e+03 3.547e+03, threshold=2.755e+03, percent-clipped=28.0 2023-06-24 13:39:39,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1754832.0, ans=0.2 2023-06-24 13:39:42,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1754892.0, ans=0.2 2023-06-24 13:39:54,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1754892.0, ans=0.125 2023-06-24 13:39:58,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1754952.0, ans=0.0 2023-06-24 13:40:01,418 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.88 vs. limit=10.0 2023-06-24 13:40:03,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1754952.0, ans=0.2 2023-06-24 13:40:19,181 INFO [train.py:996] (2/4) Epoch 10, batch 18050, loss[loss=0.2156, simple_loss=0.2917, pruned_loss=0.06974, over 21490.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3101, pruned_loss=0.07707, over 4254929.37 frames. ], batch size: 389, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:40:26,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1755012.0, ans=0.125 2023-06-24 13:40:45,753 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.12 vs. limit=22.5 2023-06-24 13:40:50,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1755072.0, ans=0.125 2023-06-24 13:41:09,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1755132.0, ans=0.1 2023-06-24 13:42:03,169 INFO [train.py:996] (2/4) Epoch 10, batch 18100, loss[loss=0.2708, simple_loss=0.3323, pruned_loss=0.1047, over 21820.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.314, pruned_loss=0.07978, over 4264908.88 frames. ], batch size: 102, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:42:18,703 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.03 vs. limit=15.0 2023-06-24 13:42:19,035 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.442e+02 6.227e+02 8.455e+02 1.236e+03 2.629e+03, threshold=1.691e+03, percent-clipped=0.0 2023-06-24 13:42:25,891 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:42:42,804 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-24 13:42:52,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1755432.0, ans=0.125 2023-06-24 13:42:58,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1755492.0, ans=0.0 2023-06-24 13:43:06,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1755492.0, ans=0.05 2023-06-24 13:43:44,538 INFO [train.py:996] (2/4) Epoch 10, batch 18150, loss[loss=0.2329, simple_loss=0.2902, pruned_loss=0.08776, over 21250.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3152, pruned_loss=0.07863, over 4261569.81 frames. ], batch size: 159, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:43:49,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1755612.0, ans=0.125 2023-06-24 13:43:53,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1755612.0, ans=0.04949747468305833 2023-06-24 13:44:20,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1755732.0, ans=0.125 2023-06-24 13:44:25,937 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.25 vs. limit=22.5 2023-06-24 13:44:26,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1755732.0, ans=0.125 2023-06-24 13:44:37,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1755792.0, ans=0.1 2023-06-24 13:44:37,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1755792.0, ans=0.1 2023-06-24 13:44:58,071 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.23 vs. limit=15.0 2023-06-24 13:45:10,521 INFO [train.py:996] (2/4) Epoch 10, batch 18200, loss[loss=0.2141, simple_loss=0.2837, pruned_loss=0.07224, over 21298.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3095, pruned_loss=0.0792, over 4264392.76 frames. ], batch size: 551, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:45:17,636 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-24 13:45:29,184 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:45:30,450 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.409e+02 6.816e+02 9.910e+02 1.570e+03 3.771e+03, threshold=1.982e+03, percent-clipped=24.0 2023-06-24 13:45:47,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1755972.0, ans=0.125 2023-06-24 13:46:06,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1756032.0, ans=0.0 2023-06-24 13:46:26,567 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:46:41,620 INFO [train.py:996] (2/4) Epoch 10, batch 18250, loss[loss=0.1887, simple_loss=0.2566, pruned_loss=0.06043, over 21821.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3011, pruned_loss=0.07654, over 4269004.72 frames. ], batch size: 102, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:47:14,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1756272.0, ans=0.0 2023-06-24 13:47:15,303 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.09 vs. limit=12.0 2023-06-24 13:47:27,517 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.09 vs. limit=6.0 2023-06-24 13:48:12,410 INFO [train.py:996] (2/4) Epoch 10, batch 18300, loss[loss=0.24, simple_loss=0.3441, pruned_loss=0.06794, over 21806.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.301, pruned_loss=0.07604, over 4276912.45 frames. ], batch size: 351, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:48:20,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1756512.0, ans=0.0 2023-06-24 13:48:23,312 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.671e+02 6.079e+02 7.788e+02 1.352e+03 4.344e+03, threshold=1.558e+03, percent-clipped=12.0 2023-06-24 13:49:05,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1756632.0, ans=0.125 2023-06-24 13:49:49,370 INFO [train.py:996] (2/4) Epoch 10, batch 18350, loss[loss=0.2047, simple_loss=0.2742, pruned_loss=0.06758, over 21191.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3065, pruned_loss=0.07598, over 4264183.93 frames. ], batch size: 176, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:49:51,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1756812.0, ans=0.2 2023-06-24 13:50:22,809 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-06-24 13:51:15,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1757052.0, ans=0.125 2023-06-24 13:51:27,720 INFO [train.py:996] (2/4) Epoch 10, batch 18400, loss[loss=0.2208, simple_loss=0.2828, pruned_loss=0.07936, over 20219.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.303, pruned_loss=0.07535, over 4260088.97 frames. ], batch size: 703, lr: 2.92e-03, grad_scale: 32.0 2023-06-24 13:51:28,814 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.35 vs. limit=12.0 2023-06-24 13:51:43,852 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.931e+02 6.378e+02 8.856e+02 1.210e+03 2.743e+03, threshold=1.771e+03, percent-clipped=10.0 2023-06-24 13:52:59,693 INFO [train.py:996] (2/4) Epoch 10, batch 18450, loss[loss=0.1953, simple_loss=0.2798, pruned_loss=0.05541, over 21663.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2982, pruned_loss=0.07178, over 4260640.60 frames. ], batch size: 247, lr: 2.92e-03, grad_scale: 32.0 2023-06-24 13:53:17,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1757412.0, ans=0.125 2023-06-24 13:53:42,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1757532.0, ans=0.1 2023-06-24 13:54:04,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1757592.0, ans=0.5 2023-06-24 13:54:22,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1757652.0, ans=0.2 2023-06-24 13:54:23,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1757652.0, ans=0.125 2023-06-24 13:54:35,861 INFO [train.py:996] (2/4) Epoch 10, batch 18500, loss[loss=0.2258, simple_loss=0.3117, pruned_loss=0.06998, over 21390.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2926, pruned_loss=0.07055, over 4259846.05 frames. ], batch size: 471, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:54:56,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1757712.0, ans=0.0 2023-06-24 13:54:57,474 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.156e+02 5.768e+02 9.363e+02 1.391e+03 2.603e+03, threshold=1.873e+03, percent-clipped=9.0 2023-06-24 13:55:13,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1757772.0, ans=0.0 2023-06-24 13:56:12,629 INFO [train.py:996] (2/4) Epoch 10, batch 18550, loss[loss=0.2222, simple_loss=0.2793, pruned_loss=0.08253, over 21313.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.2912, pruned_loss=0.07056, over 4251886.32 frames. ], batch size: 177, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:56:20,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1758012.0, ans=0.125 2023-06-24 13:56:47,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1758072.0, ans=0.1 2023-06-24 13:57:16,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1758192.0, ans=0.0 2023-06-24 13:57:17,000 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 13:57:49,279 INFO [train.py:996] (2/4) Epoch 10, batch 18600, loss[loss=0.2049, simple_loss=0.2743, pruned_loss=0.0678, over 21206.00 frames. ], tot_loss[loss=0.2153, simple_loss=0.2893, pruned_loss=0.07071, over 4250675.63 frames. ], batch size: 159, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:58:12,238 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.175e+02 6.384e+02 9.662e+02 1.486e+03 4.666e+03, threshold=1.932e+03, percent-clipped=18.0 2023-06-24 13:58:17,174 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1758372.0, ans=0.125 2023-06-24 13:58:20,745 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.49 vs. limit=12.0 2023-06-24 13:58:44,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_na.min_abs, batch_count=1758432.0, ans=0.02 2023-06-24 13:59:25,722 INFO [train.py:996] (2/4) Epoch 10, batch 18650, loss[loss=0.223, simple_loss=0.2796, pruned_loss=0.08321, over 21218.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2908, pruned_loss=0.07158, over 4249596.90 frames. ], batch size: 159, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 13:59:27,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1758612.0, ans=0.125 2023-06-24 13:59:29,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1758612.0, ans=0.1 2023-06-24 14:00:55,938 INFO [train.py:996] (2/4) Epoch 10, batch 18700, loss[loss=0.2372, simple_loss=0.3047, pruned_loss=0.08483, over 21373.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2888, pruned_loss=0.07269, over 4249596.58 frames. ], batch size: 159, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:01:12,549 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.606e+02 6.882e+02 9.841e+02 1.662e+03 3.485e+03, threshold=1.968e+03, percent-clipped=16.0 2023-06-24 14:01:44,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1759032.0, ans=0.125 2023-06-24 14:02:00,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1759092.0, ans=0.125 2023-06-24 14:02:03,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1759092.0, ans=10.0 2023-06-24 14:02:28,405 INFO [train.py:996] (2/4) Epoch 10, batch 18750, loss[loss=0.2212, simple_loss=0.2784, pruned_loss=0.08201, over 21610.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2922, pruned_loss=0.076, over 4259169.13 frames. ], batch size: 212, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:03:10,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1759332.0, ans=0.125 2023-06-24 14:03:38,317 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-24 14:03:48,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1759452.0, ans=0.2 2023-06-24 14:04:00,662 INFO [train.py:996] (2/4) Epoch 10, batch 18800, loss[loss=0.161, simple_loss=0.2365, pruned_loss=0.04272, over 21732.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2989, pruned_loss=0.07686, over 4269575.50 frames. ], batch size: 112, lr: 2.92e-03, grad_scale: 32.0 2023-06-24 14:04:16,691 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1759512.0, ans=0.125 2023-06-24 14:04:22,332 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.381e+02 6.431e+02 9.700e+02 1.560e+03 3.348e+03, threshold=1.940e+03, percent-clipped=15.0 2023-06-24 14:05:13,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1759692.0, ans=0.125 2023-06-24 14:05:31,410 INFO [train.py:996] (2/4) Epoch 10, batch 18850, loss[loss=0.2305, simple_loss=0.3034, pruned_loss=0.0788, over 21835.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.295, pruned_loss=0.07222, over 4271606.86 frames. ], batch size: 102, lr: 2.92e-03, grad_scale: 32.0 2023-06-24 14:05:36,721 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1759812.0, ans=0.1 2023-06-24 14:06:04,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1759872.0, ans=0.125 2023-06-24 14:06:12,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1759872.0, ans=0.1 2023-06-24 14:06:49,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1759992.0, ans=0.125 2023-06-24 14:06:50,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1760052.0, ans=0.04949747468305833 2023-06-24 14:06:57,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1760052.0, ans=0.125 2023-06-24 14:06:58,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1760052.0, ans=0.0 2023-06-24 14:07:07,595 INFO [train.py:996] (2/4) Epoch 10, batch 18900, loss[loss=0.251, simple_loss=0.3057, pruned_loss=0.09814, over 21518.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2917, pruned_loss=0.07327, over 4262405.82 frames. ], batch size: 473, lr: 2.92e-03, grad_scale: 32.0 2023-06-24 14:07:09,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1760112.0, ans=0.125 2023-06-24 14:07:24,191 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.033e+02 6.602e+02 8.408e+02 1.056e+03 2.556e+03, threshold=1.682e+03, percent-clipped=3.0 2023-06-24 14:08:44,284 INFO [train.py:996] (2/4) Epoch 10, batch 18950, loss[loss=0.2415, simple_loss=0.3053, pruned_loss=0.08882, over 21684.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2934, pruned_loss=0.07562, over 4258016.63 frames. ], batch size: 263, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:08:58,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1760412.0, ans=0.07 2023-06-24 14:10:02,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1760592.0, ans=0.0 2023-06-24 14:10:19,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1760652.0, ans=0.125 2023-06-24 14:10:26,137 INFO [train.py:996] (2/4) Epoch 10, batch 19000, loss[loss=0.263, simple_loss=0.3335, pruned_loss=0.09625, over 21261.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3033, pruned_loss=0.07704, over 4255119.32 frames. ], batch size: 159, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:10:43,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1760712.0, ans=0.2 2023-06-24 14:10:49,349 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.433e+02 8.812e+02 1.283e+03 1.934e+03 4.893e+03, threshold=2.566e+03, percent-clipped=32.0 2023-06-24 14:11:21,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1760832.0, ans=0.2 2023-06-24 14:11:21,854 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:11:27,369 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.26 vs. limit=22.5 2023-06-24 14:11:38,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1760892.0, ans=0.0 2023-06-24 14:11:40,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1760952.0, ans=0.0 2023-06-24 14:11:55,368 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.91 vs. limit=15.0 2023-06-24 14:12:02,049 INFO [train.py:996] (2/4) Epoch 10, batch 19050, loss[loss=0.2527, simple_loss=0.3273, pruned_loss=0.08902, over 21845.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3089, pruned_loss=0.08103, over 4266869.02 frames. ], batch size: 351, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:13:01,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1761192.0, ans=0.1 2023-06-24 14:13:06,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1761192.0, ans=0.125 2023-06-24 14:13:15,048 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:13:42,736 INFO [train.py:996] (2/4) Epoch 10, batch 19100, loss[loss=0.2041, simple_loss=0.2645, pruned_loss=0.07183, over 21618.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3077, pruned_loss=0.08103, over 4272493.13 frames. ], batch size: 247, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:14:01,621 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.691e+02 6.507e+02 8.205e+02 1.177e+03 2.298e+03, threshold=1.641e+03, percent-clipped=0.0 2023-06-24 14:14:11,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1761372.0, ans=0.125 2023-06-24 14:14:19,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1761432.0, ans=0.125 2023-06-24 14:14:43,872 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.51 vs. limit=15.0 2023-06-24 14:15:13,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1761552.0, ans=0.0 2023-06-24 14:15:25,865 INFO [train.py:996] (2/4) Epoch 10, batch 19150, loss[loss=0.2301, simple_loss=0.2922, pruned_loss=0.08395, over 21165.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3082, pruned_loss=0.08135, over 4276620.40 frames. ], batch size: 608, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:15:28,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1761612.0, ans=0.125 2023-06-24 14:16:04,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1761732.0, ans=15.0 2023-06-24 14:16:32,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1761792.0, ans=0.0 2023-06-24 14:17:06,218 INFO [train.py:996] (2/4) Epoch 10, batch 19200, loss[loss=0.2281, simple_loss=0.3302, pruned_loss=0.06302, over 21117.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3192, pruned_loss=0.08211, over 4280533.96 frames. ], batch size: 143, lr: 2.92e-03, grad_scale: 32.0 2023-06-24 14:17:21,999 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.633e+02 7.091e+02 1.026e+03 1.602e+03 3.229e+03, threshold=2.053e+03, percent-clipped=24.0 2023-06-24 14:17:48,713 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.36 vs. limit=15.0 2023-06-24 14:17:52,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1762032.0, ans=0.125 2023-06-24 14:18:06,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1762092.0, ans=0.04949747468305833 2023-06-24 14:18:07,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1762092.0, ans=0.0 2023-06-24 14:18:14,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1762092.0, ans=0.125 2023-06-24 14:18:29,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1762152.0, ans=0.0 2023-06-24 14:18:36,549 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.37 vs. limit=15.0 2023-06-24 14:18:44,948 INFO [train.py:996] (2/4) Epoch 10, batch 19250, loss[loss=0.2345, simple_loss=0.3138, pruned_loss=0.07762, over 21615.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3172, pruned_loss=0.07694, over 4282865.35 frames. ], batch size: 471, lr: 2.92e-03, grad_scale: 16.0 2023-06-24 14:19:24,311 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.78 vs. limit=12.0 2023-06-24 14:20:01,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1762392.0, ans=0.125 2023-06-24 14:20:13,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1762452.0, ans=0.1 2023-06-24 14:20:18,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1762452.0, ans=0.0 2023-06-24 14:20:20,862 INFO [train.py:996] (2/4) Epoch 10, batch 19300, loss[loss=0.1866, simple_loss=0.3001, pruned_loss=0.03652, over 21277.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3138, pruned_loss=0.07637, over 4285110.08 frames. ], batch size: 548, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:20:26,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1762512.0, ans=0.1 2023-06-24 14:20:36,655 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.942e+02 6.247e+02 8.864e+02 1.177e+03 3.202e+03, threshold=1.773e+03, percent-clipped=6.0 2023-06-24 14:21:22,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1762692.0, ans=0.125 2023-06-24 14:21:52,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1762752.0, ans=0.125 2023-06-24 14:22:00,311 INFO [train.py:996] (2/4) Epoch 10, batch 19350, loss[loss=0.1796, simple_loss=0.2672, pruned_loss=0.04606, over 21524.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3097, pruned_loss=0.07409, over 4274418.06 frames. ], batch size: 212, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:23:18,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1762992.0, ans=0.125 2023-06-24 14:23:35,929 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.49 vs. limit=15.0 2023-06-24 14:23:36,211 INFO [train.py:996] (2/4) Epoch 10, batch 19400, loss[loss=0.1973, simple_loss=0.2731, pruned_loss=0.06071, over 21277.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.309, pruned_loss=0.07312, over 4280114.90 frames. ], batch size: 176, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 14:23:38,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1763112.0, ans=0.125 2023-06-24 14:23:58,113 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.962e+02 7.006e+02 1.136e+03 1.736e+03 4.231e+03, threshold=2.271e+03, percent-clipped=24.0 2023-06-24 14:23:59,217 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.74 vs. limit=22.5 2023-06-24 14:24:11,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1763172.0, ans=0.1 2023-06-24 14:25:11,446 INFO [train.py:996] (2/4) Epoch 10, batch 19450, loss[loss=0.2131, simple_loss=0.2699, pruned_loss=0.07818, over 21418.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3058, pruned_loss=0.07441, over 4282143.76 frames. ], batch size: 195, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 14:25:11,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1763412.0, ans=0.0 2023-06-24 14:25:13,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1763412.0, ans=0.2 2023-06-24 14:25:16,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1763412.0, ans=0.0 2023-06-24 14:25:27,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1763472.0, ans=0.2 2023-06-24 14:25:35,828 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:26:12,077 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.79 vs. limit=15.0 2023-06-24 14:26:27,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1763592.0, ans=0.125 2023-06-24 14:26:48,807 INFO [train.py:996] (2/4) Epoch 10, batch 19500, loss[loss=0.2403, simple_loss=0.3191, pruned_loss=0.08072, over 21623.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3021, pruned_loss=0.0753, over 4271643.12 frames. ], batch size: 414, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 14:27:11,215 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.173e+02 6.380e+02 1.047e+03 1.511e+03 3.799e+03, threshold=2.095e+03, percent-clipped=7.0 2023-06-24 14:27:27,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1763772.0, ans=0.2 2023-06-24 14:27:58,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1763892.0, ans=0.07 2023-06-24 14:28:04,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1763892.0, ans=0.125 2023-06-24 14:28:19,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1763952.0, ans=0.1 2023-06-24 14:28:25,359 INFO [train.py:996] (2/4) Epoch 10, batch 19550, loss[loss=0.1245, simple_loss=0.1696, pruned_loss=0.03975, over 17105.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2963, pruned_loss=0.07271, over 4270809.64 frames. ], batch size: 62, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 14:29:26,281 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.25 vs. limit=15.0 2023-06-24 14:30:01,057 INFO [train.py:996] (2/4) Epoch 10, batch 19600, loss[loss=0.256, simple_loss=0.3214, pruned_loss=0.09533, over 21911.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3, pruned_loss=0.07464, over 4276312.26 frames. ], batch size: 283, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:30:05,250 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.07 vs. limit=15.0 2023-06-24 14:30:06,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1764312.0, ans=0.2 2023-06-24 14:30:28,054 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.111e+02 6.531e+02 1.025e+03 1.412e+03 3.718e+03, threshold=2.049e+03, percent-clipped=12.0 2023-06-24 14:31:18,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1764492.0, ans=0.0 2023-06-24 14:31:19,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1764492.0, ans=0.0 2023-06-24 14:31:23,537 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=12.0 2023-06-24 14:31:37,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1764612.0, ans=0.0 2023-06-24 14:31:38,432 INFO [train.py:996] (2/4) Epoch 10, batch 19650, loss[loss=0.2355, simple_loss=0.3042, pruned_loss=0.08345, over 21349.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3041, pruned_loss=0.07874, over 4282425.68 frames. ], batch size: 176, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:32:13,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1764672.0, ans=0.1 2023-06-24 14:33:15,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1764852.0, ans=0.0 2023-06-24 14:33:27,992 INFO [train.py:996] (2/4) Epoch 10, batch 19700, loss[loss=0.2093, simple_loss=0.2832, pruned_loss=0.0677, over 21433.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3062, pruned_loss=0.07826, over 4274728.08 frames. ], batch size: 195, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:33:30,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1764912.0, ans=0.125 2023-06-24 14:33:54,877 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.700e+02 9.383e+02 1.272e+03 2.018e+03 4.455e+03, threshold=2.544e+03, percent-clipped=24.0 2023-06-24 14:34:30,820 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.50 vs. limit=15.0 2023-06-24 14:35:08,166 INFO [train.py:996] (2/4) Epoch 10, batch 19750, loss[loss=0.213, simple_loss=0.2976, pruned_loss=0.06424, over 21305.00 frames. ], tot_loss[loss=0.237, simple_loss=0.315, pruned_loss=0.07947, over 4261076.52 frames. ], batch size: 176, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:36:49,173 INFO [train.py:996] (2/4) Epoch 10, batch 19800, loss[loss=0.2297, simple_loss=0.3105, pruned_loss=0.07446, over 21782.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3149, pruned_loss=0.08036, over 4258930.76 frames. ], batch size: 414, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:37:06,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1765512.0, ans=0.0 2023-06-24 14:37:11,177 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.026e+02 9.250e+02 1.586e+03 2.402e+03 4.902e+03, threshold=3.172e+03, percent-clipped=21.0 2023-06-24 14:37:25,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1765632.0, ans=0.1 2023-06-24 14:37:38,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1765632.0, ans=0.125 2023-06-24 14:38:27,738 INFO [train.py:996] (2/4) Epoch 10, batch 19850, loss[loss=0.2396, simple_loss=0.3393, pruned_loss=0.06995, over 21681.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3099, pruned_loss=0.07677, over 4260652.63 frames. ], batch size: 414, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:38:31,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1765812.0, ans=0.125 2023-06-24 14:38:56,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1765872.0, ans=0.125 2023-06-24 14:38:57,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1765932.0, ans=0.125 2023-06-24 14:38:57,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1765932.0, ans=0.125 2023-06-24 14:38:59,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1765932.0, ans=0.125 2023-06-24 14:40:03,674 INFO [train.py:996] (2/4) Epoch 10, batch 19900, loss[loss=0.2079, simple_loss=0.2896, pruned_loss=0.06311, over 21630.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3088, pruned_loss=0.07416, over 4268824.35 frames. ], batch size: 247, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:40:13,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1766112.0, ans=0.2 2023-06-24 14:40:20,752 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.278e+02 6.106e+02 8.584e+02 1.585e+03 3.373e+03, threshold=1.717e+03, percent-clipped=1.0 2023-06-24 14:40:26,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1766172.0, ans=0.125 2023-06-24 14:40:35,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1766232.0, ans=0.04949747468305833 2023-06-24 14:41:36,297 INFO [train.py:996] (2/4) Epoch 10, batch 19950, loss[loss=0.2069, simple_loss=0.2732, pruned_loss=0.07032, over 21753.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.302, pruned_loss=0.07335, over 4274133.41 frames. ], batch size: 118, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:41:38,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1766412.0, ans=0.0 2023-06-24 14:42:31,286 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=15.0 2023-06-24 14:42:38,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1766592.0, ans=0.125 2023-06-24 14:42:46,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1766592.0, ans=0.015 2023-06-24 14:42:53,312 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.84 vs. limit=10.0 2023-06-24 14:43:12,263 INFO [train.py:996] (2/4) Epoch 10, batch 20000, loss[loss=0.2272, simple_loss=0.3095, pruned_loss=0.07246, over 21821.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3027, pruned_loss=0.07375, over 4273327.00 frames. ], batch size: 371, lr: 2.91e-03, grad_scale: 32.0 2023-06-24 14:43:29,108 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.593e+02 7.271e+02 1.092e+03 1.631e+03 3.154e+03, threshold=2.184e+03, percent-clipped=18.0 2023-06-24 14:44:16,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1766892.0, ans=0.0 2023-06-24 14:44:32,118 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:44:47,355 INFO [train.py:996] (2/4) Epoch 10, batch 20050, loss[loss=0.2494, simple_loss=0.3187, pruned_loss=0.09005, over 21286.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.305, pruned_loss=0.07629, over 4283438.01 frames. ], batch size: 176, lr: 2.91e-03, grad_scale: 32.0 2023-06-24 14:45:05,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1767072.0, ans=0.0 2023-06-24 14:45:11,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1767072.0, ans=0.125 2023-06-24 14:45:32,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1767132.0, ans=0.0 2023-06-24 14:45:50,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1767192.0, ans=0.125 2023-06-24 14:46:00,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1767192.0, ans=0.2 2023-06-24 14:46:26,756 INFO [train.py:996] (2/4) Epoch 10, batch 20100, loss[loss=0.2584, simple_loss=0.3295, pruned_loss=0.09368, over 21489.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3078, pruned_loss=0.07877, over 4289632.57 frames. ], batch size: 548, lr: 2.91e-03, grad_scale: 32.0 2023-06-24 14:46:39,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1767312.0, ans=0.0 2023-06-24 14:46:42,287 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.75 vs. limit=5.0 2023-06-24 14:46:43,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1767372.0, ans=0.0 2023-06-24 14:46:51,351 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.584e+02 6.086e+02 7.806e+02 1.176e+03 2.985e+03, threshold=1.561e+03, percent-clipped=5.0 2023-06-24 14:47:11,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1767432.0, ans=0.0 2023-06-24 14:47:59,998 INFO [train.py:996] (2/4) Epoch 10, batch 20150, loss[loss=0.2712, simple_loss=0.3492, pruned_loss=0.09655, over 21547.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3171, pruned_loss=0.0825, over 4294513.76 frames. ], batch size: 414, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:48:30,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1767672.0, ans=0.0 2023-06-24 14:48:34,259 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:49:03,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1767732.0, ans=0.125 2023-06-24 14:49:18,304 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:49:32,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1767852.0, ans=0.125 2023-06-24 14:49:51,305 INFO [train.py:996] (2/4) Epoch 10, batch 20200, loss[loss=0.2612, simple_loss=0.3628, pruned_loss=0.07973, over 20769.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3213, pruned_loss=0.08397, over 4290127.27 frames. ], batch size: 607, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:50:10,611 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.888e+02 8.305e+02 1.166e+03 1.860e+03 3.941e+03, threshold=2.331e+03, percent-clipped=33.0 2023-06-24 14:50:35,054 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-06-24 14:50:35,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1768032.0, ans=0.1 2023-06-24 14:50:42,166 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 14:50:50,872 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.36 vs. limit=15.0 2023-06-24 14:51:04,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1768152.0, ans=0.5 2023-06-24 14:51:18,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1768152.0, ans=0.125 2023-06-24 14:51:29,550 INFO [train.py:996] (2/4) Epoch 10, batch 20250, loss[loss=0.1885, simple_loss=0.3126, pruned_loss=0.03218, over 19839.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3216, pruned_loss=0.08221, over 4285725.82 frames. ], batch size: 703, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:52:37,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1768392.0, ans=0.1 2023-06-24 14:52:41,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1768452.0, ans=0.125 2023-06-24 14:53:05,406 INFO [train.py:996] (2/4) Epoch 10, batch 20300, loss[loss=0.2395, simple_loss=0.323, pruned_loss=0.07799, over 21762.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3198, pruned_loss=0.07972, over 4281429.93 frames. ], batch size: 298, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:53:10,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1768512.0, ans=0.0 2023-06-24 14:53:27,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1768572.0, ans=0.125 2023-06-24 14:53:28,204 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.352e+02 6.165e+02 8.569e+02 1.423e+03 2.886e+03, threshold=1.714e+03, percent-clipped=5.0 2023-06-24 14:53:33,043 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1768572.0, ans=0.125 2023-06-24 14:54:07,901 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.80 vs. limit=15.0 2023-06-24 14:54:17,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1768752.0, ans=0.1 2023-06-24 14:54:40,792 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=22.5 2023-06-24 14:54:41,363 INFO [train.py:996] (2/4) Epoch 10, batch 20350, loss[loss=0.2483, simple_loss=0.3238, pruned_loss=0.08645, over 21439.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3212, pruned_loss=0.081, over 4269953.02 frames. ], batch size: 131, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 14:54:56,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1768872.0, ans=0.125 2023-06-24 14:56:18,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1769112.0, ans=0.125 2023-06-24 14:56:19,293 INFO [train.py:996] (2/4) Epoch 10, batch 20400, loss[loss=0.2478, simple_loss=0.3274, pruned_loss=0.08412, over 21787.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3227, pruned_loss=0.08323, over 4265577.68 frames. ], batch size: 298, lr: 2.91e-03, grad_scale: 32.0 2023-06-24 14:56:42,152 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.538e+02 7.746e+02 1.148e+03 1.668e+03 3.679e+03, threshold=2.297e+03, percent-clipped=22.0 2023-06-24 14:57:11,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1769232.0, ans=0.125 2023-06-24 14:57:23,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1769292.0, ans=0.2 2023-06-24 14:57:49,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1769352.0, ans=0.035 2023-06-24 14:57:55,668 INFO [train.py:996] (2/4) Epoch 10, batch 20450, loss[loss=0.2206, simple_loss=0.2872, pruned_loss=0.07703, over 21715.00 frames. ], tot_loss[loss=0.2481, simple_loss=0.3251, pruned_loss=0.08555, over 4260032.64 frames. ], batch size: 230, lr: 2.91e-03, grad_scale: 32.0 2023-06-24 14:58:04,762 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.38 vs. limit=22.5 2023-06-24 14:59:19,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1769652.0, ans=0.0 2023-06-24 14:59:25,907 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-24 14:59:32,110 INFO [train.py:996] (2/4) Epoch 10, batch 20500, loss[loss=0.2407, simple_loss=0.301, pruned_loss=0.09021, over 21411.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3223, pruned_loss=0.08574, over 4259916.13 frames. ], batch size: 194, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:00:01,818 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.514e+02 6.913e+02 8.863e+02 1.328e+03 2.262e+03, threshold=1.773e+03, percent-clipped=0.0 2023-06-24 15:00:15,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1769832.0, ans=0.125 2023-06-24 15:00:16,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1769832.0, ans=0.125 2023-06-24 15:00:21,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1769832.0, ans=0.2 2023-06-24 15:00:41,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1769892.0, ans=0.2 2023-06-24 15:01:09,470 INFO [train.py:996] (2/4) Epoch 10, batch 20550, loss[loss=0.197, simple_loss=0.2501, pruned_loss=0.07198, over 20238.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3133, pruned_loss=0.08391, over 4250767.17 frames. ], batch size: 703, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:01:45,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1770072.0, ans=0.125 2023-06-24 15:01:47,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1770072.0, ans=0.2 2023-06-24 15:02:05,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1770192.0, ans=0.125 2023-06-24 15:02:08,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1770192.0, ans=0.2 2023-06-24 15:02:08,892 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.67 vs. limit=12.0 2023-06-24 15:02:46,090 INFO [train.py:996] (2/4) Epoch 10, batch 20600, loss[loss=0.2535, simple_loss=0.3195, pruned_loss=0.09378, over 21881.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3166, pruned_loss=0.08386, over 4242203.68 frames. ], batch size: 351, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:03:06,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1770372.0, ans=0.125 2023-06-24 15:03:15,184 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.661e+02 6.737e+02 1.120e+03 2.042e+03 4.837e+03, threshold=2.240e+03, percent-clipped=29.0 2023-06-24 15:04:17,824 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.76 vs. limit=22.5 2023-06-24 15:04:21,168 INFO [train.py:996] (2/4) Epoch 10, batch 20650, loss[loss=0.2204, simple_loss=0.2814, pruned_loss=0.07973, over 21867.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3115, pruned_loss=0.08343, over 4243920.96 frames. ], batch size: 98, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:04:28,162 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-24 15:05:02,360 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=15.0 2023-06-24 15:05:58,595 INFO [train.py:996] (2/4) Epoch 10, batch 20700, loss[loss=0.21, simple_loss=0.2901, pruned_loss=0.06498, over 21789.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3047, pruned_loss=0.07986, over 4252826.78 frames. ], batch size: 282, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:06:23,997 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.424e+02 6.389e+02 9.253e+02 1.399e+03 2.647e+03, threshold=1.851e+03, percent-clipped=4.0 2023-06-24 15:06:37,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1770972.0, ans=0.1 2023-06-24 15:06:51,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1771032.0, ans=0.125 2023-06-24 15:07:42,269 INFO [train.py:996] (2/4) Epoch 10, batch 20750, loss[loss=0.2617, simple_loss=0.3478, pruned_loss=0.08782, over 21789.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3046, pruned_loss=0.07857, over 4238300.71 frames. ], batch size: 351, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:08:11,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1771272.0, ans=0.0 2023-06-24 15:09:11,956 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=22.5 2023-06-24 15:09:20,730 INFO [train.py:996] (2/4) Epoch 10, batch 20800, loss[loss=0.2179, simple_loss=0.2753, pruned_loss=0.08025, over 21231.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3096, pruned_loss=0.07963, over 4245617.26 frames. ], batch size: 144, lr: 2.91e-03, grad_scale: 32.0 2023-06-24 15:09:36,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1771512.0, ans=0.2 2023-06-24 15:09:42,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1771572.0, ans=0.125 2023-06-24 15:09:47,340 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.552e+02 1.010e+03 1.567e+03 2.337e+03 4.966e+03, threshold=3.135e+03, percent-clipped=39.0 2023-06-24 15:10:02,520 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.52 vs. limit=15.0 2023-06-24 15:10:05,873 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.61 vs. limit=15.0 2023-06-24 15:10:52,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1771752.0, ans=0.1 2023-06-24 15:10:57,059 INFO [train.py:996] (2/4) Epoch 10, batch 20850, loss[loss=0.1799, simple_loss=0.2497, pruned_loss=0.05512, over 21642.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3048, pruned_loss=0.07807, over 4250556.00 frames. ], batch size: 230, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:11:13,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1771812.0, ans=0.125 2023-06-24 15:11:28,866 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1771872.0, ans=0.2 2023-06-24 15:12:18,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1772052.0, ans=0.1 2023-06-24 15:12:31,796 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-24 15:12:33,602 INFO [train.py:996] (2/4) Epoch 10, batch 20900, loss[loss=0.1868, simple_loss=0.2646, pruned_loss=0.05448, over 21772.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3033, pruned_loss=0.07696, over 4251539.83 frames. ], batch size: 112, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:12:58,858 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:12:59,925 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.344e+02 6.808e+02 1.169e+03 1.577e+03 3.825e+03, threshold=2.338e+03, percent-clipped=3.0 2023-06-24 15:13:57,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1772352.0, ans=0.125 2023-06-24 15:14:09,061 INFO [train.py:996] (2/4) Epoch 10, batch 20950, loss[loss=0.1991, simple_loss=0.2769, pruned_loss=0.06063, over 21884.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2984, pruned_loss=0.07406, over 4248919.45 frames. ], batch size: 316, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:14:13,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=1772412.0, ans=15.0 2023-06-24 15:14:39,074 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.53 vs. limit=15.0 2023-06-24 15:14:54,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1772532.0, ans=0.0 2023-06-24 15:15:14,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1772592.0, ans=0.125 2023-06-24 15:15:15,224 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=15.0 2023-06-24 15:15:44,303 INFO [train.py:996] (2/4) Epoch 10, batch 21000, loss[loss=0.2286, simple_loss=0.303, pruned_loss=0.07707, over 21465.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2958, pruned_loss=0.07364, over 4261389.96 frames. ], batch size: 131, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:15:44,303 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 15:16:03,230 INFO [train.py:1028] (2/4) Epoch 10, validation: loss=0.2634, simple_loss=0.3598, pruned_loss=0.08347, over 1796401.00 frames. 2023-06-24 15:16:03,230 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-24 15:16:07,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1772712.0, ans=0.0 2023-06-24 15:16:24,561 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.402e+02 6.763e+02 8.645e+02 1.170e+03 2.024e+03, threshold=1.729e+03, percent-clipped=0.0 2023-06-24 15:16:34,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1772772.0, ans=0.0 2023-06-24 15:16:46,839 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-24 15:17:02,253 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=22.5 2023-06-24 15:17:06,613 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=12.0 2023-06-24 15:17:32,898 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.11 vs. limit=15.0 2023-06-24 15:17:33,514 INFO [train.py:996] (2/4) Epoch 10, batch 21050, loss[loss=0.2232, simple_loss=0.2953, pruned_loss=0.07557, over 21786.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2958, pruned_loss=0.07449, over 4264475.90 frames. ], batch size: 371, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:18:24,163 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:19:00,323 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-06-24 15:19:08,747 INFO [train.py:996] (2/4) Epoch 10, batch 21100, loss[loss=0.2844, simple_loss=0.3135, pruned_loss=0.1276, over 21464.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2931, pruned_loss=0.07489, over 4254767.42 frames. ], batch size: 511, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 15:19:36,716 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.103e+02 5.905e+02 7.931e+02 1.116e+03 2.788e+03, threshold=1.586e+03, percent-clipped=2.0 2023-06-24 15:19:39,091 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.32 vs. limit=10.0 2023-06-24 15:20:14,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1773492.0, ans=0.1 2023-06-24 15:20:45,196 INFO [train.py:996] (2/4) Epoch 10, batch 21150, loss[loss=0.229, simple_loss=0.2908, pruned_loss=0.08354, over 21681.00 frames. ], tot_loss[loss=0.2198, simple_loss=0.2892, pruned_loss=0.07517, over 4256580.70 frames. ], batch size: 333, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 15:20:50,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1773612.0, ans=0.125 2023-06-24 15:21:52,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1773792.0, ans=0.1 2023-06-24 15:22:01,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1773852.0, ans=0.0 2023-06-24 15:22:18,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1773852.0, ans=0.2 2023-06-24 15:22:21,486 INFO [train.py:996] (2/4) Epoch 10, batch 21200, loss[loss=0.1895, simple_loss=0.2616, pruned_loss=0.0587, over 21407.00 frames. ], tot_loss[loss=0.2179, simple_loss=0.2862, pruned_loss=0.0748, over 4258616.81 frames. ], batch size: 194, lr: 2.91e-03, grad_scale: 16.0 2023-06-24 15:22:49,819 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.351e+02 6.353e+02 8.503e+02 1.111e+03 2.488e+03, threshold=1.701e+03, percent-clipped=3.0 2023-06-24 15:23:30,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1774092.0, ans=0.1 2023-06-24 15:23:47,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1774152.0, ans=0.1 2023-06-24 15:23:51,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1774152.0, ans=0.1 2023-06-24 15:23:57,718 INFO [train.py:996] (2/4) Epoch 10, batch 21250, loss[loss=0.2045, simple_loss=0.2727, pruned_loss=0.06814, over 21816.00 frames. ], tot_loss[loss=0.2164, simple_loss=0.2838, pruned_loss=0.07452, over 4262203.65 frames. ], batch size: 98, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 15:24:12,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1774272.0, ans=0.0 2023-06-24 15:24:57,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1774392.0, ans=0.125 2023-06-24 15:25:27,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1774452.0, ans=0.125 2023-06-24 15:25:32,738 INFO [train.py:996] (2/4) Epoch 10, batch 21300, loss[loss=0.2023, simple_loss=0.2611, pruned_loss=0.07174, over 16300.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2907, pruned_loss=0.07672, over 4261979.67 frames. ], batch size: 60, lr: 2.91e-03, grad_scale: 8.0 2023-06-24 15:25:48,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1774572.0, ans=0.0 2023-06-24 15:25:59,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1774572.0, ans=0.125 2023-06-24 15:26:02,198 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.637e+02 7.124e+02 9.830e+02 1.357e+03 3.184e+03, threshold=1.966e+03, percent-clipped=15.0 2023-06-24 15:26:04,815 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-06-24 15:26:24,217 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-24 15:26:34,390 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:26:37,878 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-24 15:26:57,232 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2023-06-24 15:27:10,394 INFO [train.py:996] (2/4) Epoch 10, batch 21350, loss[loss=0.2648, simple_loss=0.3437, pruned_loss=0.093, over 21636.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2955, pruned_loss=0.07776, over 4275164.94 frames. ], batch size: 441, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:27:17,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1774812.0, ans=0.0 2023-06-24 15:27:47,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1774872.0, ans=0.1 2023-06-24 15:28:06,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1774932.0, ans=0.1 2023-06-24 15:28:06,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1774932.0, ans=0.0 2023-06-24 15:28:08,556 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.41 vs. limit=15.0 2023-06-24 15:28:28,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1774992.0, ans=0.0 2023-06-24 15:28:48,139 INFO [train.py:996] (2/4) Epoch 10, batch 21400, loss[loss=0.2849, simple_loss=0.3597, pruned_loss=0.1051, over 21927.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2967, pruned_loss=0.07605, over 4278355.23 frames. ], batch size: 372, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:29:23,241 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.641e+02 5.834e+02 7.962e+02 1.308e+03 2.363e+03, threshold=1.592e+03, percent-clipped=6.0 2023-06-24 15:29:25,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1775172.0, ans=0.2 2023-06-24 15:29:28,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1775232.0, ans=0.05 2023-06-24 15:30:03,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1775292.0, ans=0.125 2023-06-24 15:30:15,157 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-24 15:30:24,975 INFO [train.py:996] (2/4) Epoch 10, batch 21450, loss[loss=0.2528, simple_loss=0.3188, pruned_loss=0.09335, over 21709.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3019, pruned_loss=0.07849, over 4285169.92 frames. ], batch size: 473, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:30:49,052 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.50 vs. limit=6.0 2023-06-24 15:31:12,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1775532.0, ans=0.1 2023-06-24 15:31:29,008 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 15:31:50,181 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.32 vs. limit=22.5 2023-06-24 15:31:53,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1775652.0, ans=10.0 2023-06-24 15:31:54,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1775652.0, ans=0.1 2023-06-24 15:31:54,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1775652.0, ans=0.2 2023-06-24 15:32:06,973 INFO [train.py:996] (2/4) Epoch 10, batch 21500, loss[loss=0.2331, simple_loss=0.2907, pruned_loss=0.0877, over 21557.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3013, pruned_loss=0.07885, over 4276609.17 frames. ], batch size: 391, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:32:36,245 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.266e+02 7.584e+02 1.027e+03 1.446e+03 3.225e+03, threshold=2.054e+03, percent-clipped=19.0 2023-06-24 15:32:55,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1775832.0, ans=0.2 2023-06-24 15:33:07,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1775892.0, ans=0.125 2023-06-24 15:33:33,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1775952.0, ans=0.0 2023-06-24 15:33:37,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1776012.0, ans=0.0 2023-06-24 15:33:45,357 INFO [train.py:996] (2/4) Epoch 10, batch 21550, loss[loss=0.202, simple_loss=0.2689, pruned_loss=0.06755, over 21321.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.2946, pruned_loss=0.07651, over 4263224.22 frames. ], batch size: 131, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:33:47,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1776012.0, ans=0.2 2023-06-24 15:33:47,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1776012.0, ans=0.1 2023-06-24 15:33:51,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1776012.0, ans=0.125 2023-06-24 15:33:52,575 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=22.5 2023-06-24 15:33:55,393 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=22.5 2023-06-24 15:34:01,178 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1776012.0, ans=0.035 2023-06-24 15:34:02,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1776012.0, ans=0.0 2023-06-24 15:34:28,705 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.35 vs. limit=22.5 2023-06-24 15:35:22,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1776252.0, ans=0.1 2023-06-24 15:35:25,096 INFO [train.py:996] (2/4) Epoch 10, batch 21600, loss[loss=0.2172, simple_loss=0.3145, pruned_loss=0.05999, over 21619.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2906, pruned_loss=0.07531, over 4269783.72 frames. ], batch size: 414, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:35:26,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1776312.0, ans=0.125 2023-06-24 15:35:41,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1776312.0, ans=0.2 2023-06-24 15:35:54,474 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=22.5 2023-06-24 15:36:01,299 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.319e+02 8.052e+02 1.212e+03 1.997e+03 4.912e+03, threshold=2.424e+03, percent-clipped=21.0 2023-06-24 15:36:14,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1776432.0, ans=0.125 2023-06-24 15:36:20,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1776432.0, ans=0.125 2023-06-24 15:36:38,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1776492.0, ans=0.0 2023-06-24 15:36:57,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1776552.0, ans=0.1 2023-06-24 15:37:01,695 INFO [train.py:996] (2/4) Epoch 10, batch 21650, loss[loss=0.2025, simple_loss=0.2804, pruned_loss=0.06224, over 21802.00 frames. ], tot_loss[loss=0.2199, simple_loss=0.2931, pruned_loss=0.07339, over 4269804.06 frames. ], batch size: 102, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:37:46,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1776732.0, ans=0.1 2023-06-24 15:38:08,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1776792.0, ans=0.125 2023-06-24 15:38:22,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1776852.0, ans=0.07 2023-06-24 15:38:22,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1776852.0, ans=0.04949747468305833 2023-06-24 15:38:37,594 INFO [train.py:996] (2/4) Epoch 10, batch 21700, loss[loss=0.2063, simple_loss=0.275, pruned_loss=0.06884, over 21864.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2931, pruned_loss=0.07188, over 4270813.55 frames. ], batch size: 107, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:39:11,237 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.59 vs. limit=10.0 2023-06-24 15:39:12,928 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.168e+02 6.660e+02 9.555e+02 1.550e+03 3.491e+03, threshold=1.911e+03, percent-clipped=7.0 2023-06-24 15:39:27,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1777032.0, ans=0.0 2023-06-24 15:39:47,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1777092.0, ans=0.0 2023-06-24 15:40:11,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1777212.0, ans=0.0 2023-06-24 15:40:13,238 INFO [train.py:996] (2/4) Epoch 10, batch 21750, loss[loss=0.1884, simple_loss=0.2593, pruned_loss=0.05873, over 21777.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.29, pruned_loss=0.07168, over 4258269.10 frames. ], batch size: 317, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:40:43,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1777272.0, ans=0.1 2023-06-24 15:41:23,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1777392.0, ans=0.1 2023-06-24 15:41:27,571 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=12.0 2023-06-24 15:41:50,252 INFO [train.py:996] (2/4) Epoch 10, batch 21800, loss[loss=0.289, simple_loss=0.349, pruned_loss=0.1145, over 21416.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2891, pruned_loss=0.07319, over 4267361.94 frames. ], batch size: 473, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:41:52,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.whiten.whitening_limit, batch_count=1777512.0, ans=12.0 2023-06-24 15:42:06,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1777512.0, ans=0.125 2023-06-24 15:42:19,067 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.09 vs. limit=15.0 2023-06-24 15:42:23,844 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.79 vs. limit=15.0 2023-06-24 15:42:25,929 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.465e+02 6.460e+02 8.682e+02 1.144e+03 2.406e+03, threshold=1.736e+03, percent-clipped=3.0 2023-06-24 15:43:09,651 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=22.5 2023-06-24 15:43:25,426 INFO [train.py:996] (2/4) Epoch 10, batch 21850, loss[loss=0.215, simple_loss=0.2723, pruned_loss=0.0788, over 21187.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2959, pruned_loss=0.07427, over 4253532.15 frames. ], batch size: 176, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:43:33,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1777812.0, ans=0.1 2023-06-24 15:44:22,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1777932.0, ans=0.125 2023-06-24 15:44:41,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1778052.0, ans=0.125 2023-06-24 15:44:56,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1778052.0, ans=0.1 2023-06-24 15:45:05,274 INFO [train.py:996] (2/4) Epoch 10, batch 21900, loss[loss=0.2034, simple_loss=0.2631, pruned_loss=0.07188, over 21478.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2949, pruned_loss=0.07579, over 4264577.77 frames. ], batch size: 212, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:45:10,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1778112.0, ans=0.2 2023-06-24 15:45:16,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1778112.0, ans=0.125 2023-06-24 15:45:27,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1778172.0, ans=0.035 2023-06-24 15:45:36,505 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.519e+02 8.112e+02 1.126e+03 1.862e+03 4.122e+03, threshold=2.252e+03, percent-clipped=27.0 2023-06-24 15:45:36,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1778172.0, ans=0.0 2023-06-24 15:45:36,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1778172.0, ans=0.125 2023-06-24 15:46:05,150 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.95 vs. limit=8.0 2023-06-24 15:46:25,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1778352.0, ans=0.0 2023-06-24 15:46:27,633 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=15.0 2023-06-24 15:46:42,328 INFO [train.py:996] (2/4) Epoch 10, batch 21950, loss[loss=0.2084, simple_loss=0.2726, pruned_loss=0.0721, over 21814.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2896, pruned_loss=0.07453, over 4266169.26 frames. ], batch size: 118, lr: 2.90e-03, grad_scale: 8.0 2023-06-24 15:48:05,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1778652.0, ans=0.2 2023-06-24 15:48:18,377 INFO [train.py:996] (2/4) Epoch 10, batch 22000, loss[loss=0.171, simple_loss=0.2424, pruned_loss=0.04983, over 21617.00 frames. ], tot_loss[loss=0.2131, simple_loss=0.2836, pruned_loss=0.07127, over 4255875.79 frames. ], batch size: 263, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:48:18,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1778712.0, ans=0.0 2023-06-24 15:48:19,412 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-24 15:48:54,795 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.014e+02 5.065e+02 6.961e+02 1.085e+03 3.109e+03, threshold=1.392e+03, percent-clipped=2.0 2023-06-24 15:49:09,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1778832.0, ans=0.09899494936611666 2023-06-24 15:50:02,059 INFO [train.py:996] (2/4) Epoch 10, batch 22050, loss[loss=0.1779, simple_loss=0.2524, pruned_loss=0.05173, over 20874.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2893, pruned_loss=0.07242, over 4246898.47 frames. ], batch size: 609, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:50:26,284 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=15.0 2023-06-24 15:51:38,357 INFO [train.py:996] (2/4) Epoch 10, batch 22100, loss[loss=0.2371, simple_loss=0.3079, pruned_loss=0.0832, over 21963.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2985, pruned_loss=0.0769, over 4253404.74 frames. ], batch size: 316, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:51:43,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1779312.0, ans=0.125 2023-06-24 15:51:49,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1779312.0, ans=0.1 2023-06-24 15:51:49,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1779312.0, ans=0.2 2023-06-24 15:52:08,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1779372.0, ans=0.125 2023-06-24 15:52:09,367 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.723e+02 7.259e+02 1.034e+03 1.568e+03 3.837e+03, threshold=2.069e+03, percent-clipped=34.0 2023-06-24 15:53:16,263 INFO [train.py:996] (2/4) Epoch 10, batch 22150, loss[loss=0.2086, simple_loss=0.273, pruned_loss=0.07208, over 21275.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3025, pruned_loss=0.07847, over 4263055.26 frames. ], batch size: 176, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:53:29,560 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.31 vs. limit=6.0 2023-06-24 15:53:50,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=1779732.0, ans=10.0 2023-06-24 15:54:54,165 INFO [train.py:996] (2/4) Epoch 10, batch 22200, loss[loss=0.247, simple_loss=0.3387, pruned_loss=0.07769, over 21727.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3066, pruned_loss=0.08005, over 4272705.83 frames. ], batch size: 441, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:55:24,792 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.644e+02 6.933e+02 1.129e+03 1.517e+03 2.505e+03, threshold=2.259e+03, percent-clipped=10.0 2023-06-24 15:55:25,731 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-24 15:55:26,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1779972.0, ans=0.0 2023-06-24 15:56:03,661 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=15.0 2023-06-24 15:56:31,899 INFO [train.py:996] (2/4) Epoch 10, batch 22250, loss[loss=0.3044, simple_loss=0.372, pruned_loss=0.1184, over 21552.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.314, pruned_loss=0.08213, over 4269544.48 frames. ], batch size: 414, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:57:31,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1780392.0, ans=0.0 2023-06-24 15:58:06,587 INFO [train.py:996] (2/4) Epoch 10, batch 22300, loss[loss=0.2023, simple_loss=0.2691, pruned_loss=0.06777, over 21846.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3149, pruned_loss=0.08364, over 4274963.39 frames. ], batch size: 247, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 15:58:37,328 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.918e+02 7.899e+02 1.099e+03 1.483e+03 2.745e+03, threshold=2.199e+03, percent-clipped=4.0 2023-06-24 15:58:47,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1780632.0, ans=0.0 2023-06-24 15:59:03,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1780692.0, ans=0.125 2023-06-24 15:59:18,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1780692.0, ans=0.2 2023-06-24 15:59:34,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1780752.0, ans=0.1 2023-06-24 15:59:43,737 INFO [train.py:996] (2/4) Epoch 10, batch 22350, loss[loss=0.2182, simple_loss=0.2857, pruned_loss=0.07534, over 21570.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3138, pruned_loss=0.08489, over 4285907.54 frames. ], batch size: 548, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:00:52,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1780992.0, ans=0.125 2023-06-24 16:01:25,435 INFO [train.py:996] (2/4) Epoch 10, batch 22400, loss[loss=0.2499, simple_loss=0.3091, pruned_loss=0.09532, over 21500.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3093, pruned_loss=0.08093, over 4289042.65 frames. ], batch size: 441, lr: 2.90e-03, grad_scale: 32.0 2023-06-24 16:01:32,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1781112.0, ans=0.125 2023-06-24 16:01:51,918 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.474e+02 7.856e+02 9.897e+02 1.374e+03 2.984e+03, threshold=1.979e+03, percent-clipped=5.0 2023-06-24 16:02:21,899 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.00 vs. limit=10.0 2023-06-24 16:02:52,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1781352.0, ans=0.125 2023-06-24 16:02:56,371 INFO [train.py:996] (2/4) Epoch 10, batch 22450, loss[loss=0.2209, simple_loss=0.2724, pruned_loss=0.08473, over 21610.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.303, pruned_loss=0.08003, over 4282496.01 frames. ], batch size: 231, lr: 2.90e-03, grad_scale: 32.0 2023-06-24 16:02:59,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1781412.0, ans=0.125 2023-06-24 16:03:58,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1781592.0, ans=0.125 2023-06-24 16:04:33,429 INFO [train.py:996] (2/4) Epoch 10, batch 22500, loss[loss=0.2183, simple_loss=0.3015, pruned_loss=0.0675, over 21625.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.2981, pruned_loss=0.07888, over 4287704.99 frames. ], batch size: 263, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:04:33,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1781712.0, ans=0.1 2023-06-24 16:04:42,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1781712.0, ans=0.2 2023-06-24 16:04:43,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1781712.0, ans=0.1 2023-06-24 16:04:52,155 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-24 16:05:06,695 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.929e+02 7.137e+02 1.060e+03 1.856e+03 3.830e+03, threshold=2.121e+03, percent-clipped=17.0 2023-06-24 16:05:07,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1781772.0, ans=0.125 2023-06-24 16:05:41,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1781892.0, ans=0.0 2023-06-24 16:06:08,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1781952.0, ans=0.1 2023-06-24 16:06:10,749 INFO [train.py:996] (2/4) Epoch 10, batch 22550, loss[loss=0.2441, simple_loss=0.3143, pruned_loss=0.08697, over 21931.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3001, pruned_loss=0.07864, over 4284961.83 frames. ], batch size: 316, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:07:05,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1782132.0, ans=0.125 2023-06-24 16:07:49,269 INFO [train.py:996] (2/4) Epoch 10, batch 22600, loss[loss=0.2455, simple_loss=0.3271, pruned_loss=0.08201, over 21748.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3046, pruned_loss=0.07955, over 4285539.42 frames. ], batch size: 351, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:07:55,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1782312.0, ans=0.2 2023-06-24 16:08:16,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1782372.0, ans=0.125 2023-06-24 16:08:27,262 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.566e+02 7.815e+02 1.188e+03 1.926e+03 4.524e+03, threshold=2.375e+03, percent-clipped=20.0 2023-06-24 16:09:18,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1782552.0, ans=0.0 2023-06-24 16:09:25,460 INFO [train.py:996] (2/4) Epoch 10, batch 22650, loss[loss=0.2034, simple_loss=0.2715, pruned_loss=0.06765, over 21625.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3012, pruned_loss=0.07904, over 4270228.25 frames. ], batch size: 298, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:09:28,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1782612.0, ans=0.0 2023-06-24 16:10:37,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1782792.0, ans=0.125 2023-06-24 16:10:38,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1782792.0, ans=0.1 2023-06-24 16:10:42,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1782792.0, ans=0.125 2023-06-24 16:10:43,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1782792.0, ans=0.5 2023-06-24 16:10:45,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1782852.0, ans=0.125 2023-06-24 16:10:52,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1782852.0, ans=0.125 2023-06-24 16:11:01,271 INFO [train.py:996] (2/4) Epoch 10, batch 22700, loss[loss=0.2123, simple_loss=0.2705, pruned_loss=0.07711, over 21639.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2953, pruned_loss=0.07857, over 4274196.99 frames. ], batch size: 333, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:11:08,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=1782912.0, ans=15.0 2023-06-24 16:11:09,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1782912.0, ans=0.125 2023-06-24 16:11:38,440 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.321e+02 7.699e+02 1.029e+03 1.382e+03 2.516e+03, threshold=2.058e+03, percent-clipped=2.0 2023-06-24 16:11:39,380 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.39 vs. limit=15.0 2023-06-24 16:11:54,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1783032.0, ans=0.125 2023-06-24 16:12:09,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1783092.0, ans=0.125 2023-06-24 16:12:28,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1783152.0, ans=0.95 2023-06-24 16:12:37,734 INFO [train.py:996] (2/4) Epoch 10, batch 22750, loss[loss=0.2531, simple_loss=0.3698, pruned_loss=0.06822, over 19673.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.2991, pruned_loss=0.08075, over 4274252.44 frames. ], batch size: 703, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:12:45,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1783212.0, ans=0.04949747468305833 2023-06-24 16:12:59,230 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.44 vs. limit=12.0 2023-06-24 16:14:06,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1783452.0, ans=0.125 2023-06-24 16:14:14,057 INFO [train.py:996] (2/4) Epoch 10, batch 22800, loss[loss=0.2636, simple_loss=0.3271, pruned_loss=0.1, over 21732.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.302, pruned_loss=0.08296, over 4278687.09 frames. ], batch size: 389, lr: 2.90e-03, grad_scale: 32.0 2023-06-24 16:14:51,218 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.470e+02 7.113e+02 9.771e+02 1.479e+03 3.289e+03, threshold=1.954e+03, percent-clipped=6.0 2023-06-24 16:15:07,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1783632.0, ans=0.035 2023-06-24 16:15:39,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1783752.0, ans=0.125 2023-06-24 16:15:49,973 INFO [train.py:996] (2/4) Epoch 10, batch 22850, loss[loss=0.2314, simple_loss=0.281, pruned_loss=0.09093, over 21257.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.2989, pruned_loss=0.0824, over 4274936.03 frames. ], batch size: 176, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:16:05,758 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1783872.0, ans=0.125 2023-06-24 16:17:18,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1784052.0, ans=0.125 2023-06-24 16:17:27,969 INFO [train.py:996] (2/4) Epoch 10, batch 22900, loss[loss=0.2809, simple_loss=0.3823, pruned_loss=0.08977, over 21652.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3014, pruned_loss=0.08137, over 4274402.37 frames. ], batch size: 414, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:17:30,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1784112.0, ans=0.2 2023-06-24 16:18:12,788 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.680e+02 7.158e+02 1.057e+03 1.638e+03 3.126e+03, threshold=2.114e+03, percent-clipped=14.0 2023-06-24 16:18:28,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1784232.0, ans=0.125 2023-06-24 16:19:09,137 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.13 vs. limit=10.0 2023-06-24 16:19:16,541 INFO [train.py:996] (2/4) Epoch 10, batch 22950, loss[loss=0.2844, simple_loss=0.4089, pruned_loss=0.07991, over 21591.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3133, pruned_loss=0.07923, over 4276101.24 frames. ], batch size: 389, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:19:37,146 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:20:11,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1784592.0, ans=0.125 2023-06-24 16:20:26,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1784652.0, ans=0.125 2023-06-24 16:20:52,504 INFO [train.py:996] (2/4) Epoch 10, batch 23000, loss[loss=0.2421, simple_loss=0.3155, pruned_loss=0.08435, over 21828.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3149, pruned_loss=0.07726, over 4281214.36 frames. ], batch size: 414, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:21:28,869 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=22.5 2023-06-24 16:21:30,659 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.479e+02 7.106e+02 9.780e+02 1.454e+03 3.933e+03, threshold=1.956e+03, percent-clipped=7.0 2023-06-24 16:21:39,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1784832.0, ans=0.0 2023-06-24 16:22:31,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1784952.0, ans=0.125 2023-06-24 16:22:36,120 INFO [train.py:996] (2/4) Epoch 10, batch 23050, loss[loss=0.2618, simple_loss=0.3353, pruned_loss=0.09414, over 21588.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.316, pruned_loss=0.07971, over 4281707.80 frames. ], batch size: 389, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:22:41,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1785012.0, ans=0.0 2023-06-24 16:22:47,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1785012.0, ans=0.0 2023-06-24 16:23:08,193 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.86 vs. limit=22.5 2023-06-24 16:23:09,755 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.09 vs. limit=15.0 2023-06-24 16:23:10,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1785132.0, ans=0.125 2023-06-24 16:24:13,714 INFO [train.py:996] (2/4) Epoch 10, batch 23100, loss[loss=0.1979, simple_loss=0.2681, pruned_loss=0.06383, over 21379.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3114, pruned_loss=0.07994, over 4273119.42 frames. ], batch size: 131, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:24:34,701 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.83 vs. limit=15.0 2023-06-24 16:24:47,869 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.180e+02 6.884e+02 9.412e+02 1.257e+03 2.198e+03, threshold=1.882e+03, percent-clipped=3.0 2023-06-24 16:24:51,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1785432.0, ans=0.0 2023-06-24 16:25:18,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1785492.0, ans=0.1 2023-06-24 16:25:32,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1785552.0, ans=0.0 2023-06-24 16:25:35,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1785552.0, ans=0.0 2023-06-24 16:25:50,175 INFO [train.py:996] (2/4) Epoch 10, batch 23150, loss[loss=0.2374, simple_loss=0.3035, pruned_loss=0.0857, over 21249.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3054, pruned_loss=0.07971, over 4281092.35 frames. ], batch size: 159, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:25:53,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1785612.0, ans=10.0 2023-06-24 16:26:34,168 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.87 vs. limit=8.0 2023-06-24 16:26:40,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1785792.0, ans=0.125 2023-06-24 16:26:45,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1785792.0, ans=0.2 2023-06-24 16:26:46,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1785792.0, ans=0.2 2023-06-24 16:26:52,836 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:27:20,627 INFO [train.py:996] (2/4) Epoch 10, batch 23200, loss[loss=0.1973, simple_loss=0.2712, pruned_loss=0.06173, over 21663.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3051, pruned_loss=0.08106, over 4281366.37 frames. ], batch size: 263, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:27:59,995 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.595e+02 6.989e+02 9.310e+02 1.266e+03 2.936e+03, threshold=1.862e+03, percent-clipped=9.0 2023-06-24 16:28:03,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1786032.0, ans=0.125 2023-06-24 16:28:06,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1786032.0, ans=0.2 2023-06-24 16:28:14,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1786092.0, ans=0.1 2023-06-24 16:28:19,532 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=12.0 2023-06-24 16:28:39,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1786152.0, ans=0.125 2023-06-24 16:28:42,568 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:28:42,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1786152.0, ans=0.125 2023-06-24 16:29:01,237 INFO [train.py:996] (2/4) Epoch 10, batch 23250, loss[loss=0.2167, simple_loss=0.2791, pruned_loss=0.07717, over 21668.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3044, pruned_loss=0.08186, over 4288061.28 frames. ], batch size: 263, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:29:22,185 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.52 vs. limit=8.0 2023-06-24 16:29:26,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1786272.0, ans=0.125 2023-06-24 16:30:38,196 INFO [train.py:996] (2/4) Epoch 10, batch 23300, loss[loss=0.2992, simple_loss=0.4048, pruned_loss=0.09686, over 21694.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3134, pruned_loss=0.08408, over 4291014.92 frames. ], batch size: 441, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:31:14,184 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.511e+02 8.047e+02 1.147e+03 1.597e+03 3.212e+03, threshold=2.293e+03, percent-clipped=17.0 2023-06-24 16:31:36,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1786692.0, ans=0.5 2023-06-24 16:31:38,653 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=15.0 2023-06-24 16:31:57,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1786692.0, ans=0.0 2023-06-24 16:32:20,651 INFO [train.py:996] (2/4) Epoch 10, batch 23350, loss[loss=0.2075, simple_loss=0.2956, pruned_loss=0.05973, over 21672.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3177, pruned_loss=0.08335, over 4287740.94 frames. ], batch size: 414, lr: 2.90e-03, grad_scale: 16.0 2023-06-24 16:32:40,249 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.50 vs. limit=10.0 2023-06-24 16:32:51,225 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-24 16:33:00,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1786932.0, ans=0.125 2023-06-24 16:33:08,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1786932.0, ans=0.125 2023-06-24 16:33:14,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1786992.0, ans=0.2 2023-06-24 16:33:57,547 INFO [train.py:996] (2/4) Epoch 10, batch 23400, loss[loss=0.2369, simple_loss=0.3089, pruned_loss=0.08245, over 21735.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3106, pruned_loss=0.07952, over 4284527.24 frames. ], batch size: 112, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:34:18,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1787172.0, ans=0.0 2023-06-24 16:34:28,899 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.034e+02 7.282e+02 9.879e+02 1.387e+03 3.167e+03, threshold=1.976e+03, percent-clipped=3.0 2023-06-24 16:35:15,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1787352.0, ans=0.0 2023-06-24 16:35:15,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1787352.0, ans=0.0 2023-06-24 16:35:25,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1787352.0, ans=0.125 2023-06-24 16:35:34,824 INFO [train.py:996] (2/4) Epoch 10, batch 23450, loss[loss=0.2468, simple_loss=0.3141, pruned_loss=0.08973, over 21742.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3103, pruned_loss=0.08142, over 4284832.25 frames. ], batch size: 351, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:35:50,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1787472.0, ans=0.125 2023-06-24 16:35:53,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1787472.0, ans=0.09899494936611666 2023-06-24 16:36:05,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1787532.0, ans=0.125 2023-06-24 16:36:20,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1787532.0, ans=0.035 2023-06-24 16:36:59,571 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1787652.0, ans=0.0 2023-06-24 16:37:09,826 INFO [train.py:996] (2/4) Epoch 10, batch 23500, loss[loss=0.2089, simple_loss=0.2742, pruned_loss=0.07179, over 21619.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3112, pruned_loss=0.08356, over 4285997.59 frames. ], batch size: 548, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:37:45,943 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.121e+02 6.777e+02 9.961e+02 1.518e+03 3.385e+03, threshold=1.992e+03, percent-clipped=9.0 2023-06-24 16:38:37,865 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-24 16:38:42,271 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.96 vs. limit=6.0 2023-06-24 16:38:46,190 INFO [train.py:996] (2/4) Epoch 10, batch 23550, loss[loss=0.2092, simple_loss=0.2761, pruned_loss=0.07112, over 21798.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.307, pruned_loss=0.08326, over 4280423.63 frames. ], batch size: 118, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:39:40,694 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.32 vs. limit=15.0 2023-06-24 16:39:58,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1788192.0, ans=0.1 2023-06-24 16:40:01,243 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:40:18,340 INFO [train.py:996] (2/4) Epoch 10, batch 23600, loss[loss=0.3481, simple_loss=0.3906, pruned_loss=0.1528, over 21326.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3069, pruned_loss=0.08287, over 4276366.77 frames. ], batch size: 508, lr: 2.89e-03, grad_scale: 32.0 2023-06-24 16:40:20,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1788312.0, ans=0.125 2023-06-24 16:40:29,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1788312.0, ans=0.125 2023-06-24 16:40:34,752 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1788372.0, ans=0.125 2023-06-24 16:41:05,496 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.671e+02 8.088e+02 1.163e+03 1.528e+03 3.406e+03, threshold=2.327e+03, percent-clipped=14.0 2023-06-24 16:41:07,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1788432.0, ans=0.1 2023-06-24 16:41:09,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1788432.0, ans=0.125 2023-06-24 16:41:49,988 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-24 16:41:55,353 INFO [train.py:996] (2/4) Epoch 10, batch 23650, loss[loss=0.2297, simple_loss=0.3113, pruned_loss=0.07406, over 21903.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3073, pruned_loss=0.08105, over 4282830.19 frames. ], batch size: 316, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:42:19,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1788672.0, ans=0.0 2023-06-24 16:43:11,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1788792.0, ans=0.125 2023-06-24 16:43:37,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1788912.0, ans=0.125 2023-06-24 16:43:38,447 INFO [train.py:996] (2/4) Epoch 10, batch 23700, loss[loss=0.2271, simple_loss=0.3132, pruned_loss=0.07044, over 21304.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3106, pruned_loss=0.08103, over 4281346.67 frames. ], batch size: 549, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:44:04,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1788972.0, ans=0.2 2023-06-24 16:44:06,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1788972.0, ans=0.125 2023-06-24 16:44:26,159 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.212e+02 6.427e+02 8.775e+02 1.177e+03 2.225e+03, threshold=1.755e+03, percent-clipped=0.0 2023-06-24 16:44:26,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1789032.0, ans=0.125 2023-06-24 16:45:22,434 INFO [train.py:996] (2/4) Epoch 10, batch 23750, loss[loss=0.2231, simple_loss=0.3259, pruned_loss=0.06019, over 21581.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3118, pruned_loss=0.08102, over 4275996.09 frames. ], batch size: 414, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:46:12,934 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.49 vs. limit=15.0 2023-06-24 16:46:32,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1789392.0, ans=0.05 2023-06-24 16:47:01,107 INFO [train.py:996] (2/4) Epoch 10, batch 23800, loss[loss=0.272, simple_loss=0.3948, pruned_loss=0.07455, over 19927.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3102, pruned_loss=0.07924, over 4277281.92 frames. ], batch size: 702, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:47:33,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1789572.0, ans=0.125 2023-06-24 16:47:34,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1789572.0, ans=0.0 2023-06-24 16:47:39,066 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.271e+02 7.251e+02 1.165e+03 1.659e+03 4.396e+03, threshold=2.330e+03, percent-clipped=22.0 2023-06-24 16:47:46,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1789632.0, ans=0.125 2023-06-24 16:48:03,078 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.44 vs. limit=15.0 2023-06-24 16:48:05,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1789692.0, ans=0.1 2023-06-24 16:48:25,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1789752.0, ans=0.2 2023-06-24 16:48:25,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1789752.0, ans=0.1 2023-06-24 16:48:44,370 INFO [train.py:996] (2/4) Epoch 10, batch 23850, loss[loss=0.2423, simple_loss=0.3221, pruned_loss=0.08122, over 21594.00 frames. ], tot_loss[loss=0.24, simple_loss=0.318, pruned_loss=0.08097, over 4274195.96 frames. ], batch size: 389, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:49:41,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1789992.0, ans=0.1 2023-06-24 16:50:08,092 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=15.0 2023-06-24 16:50:20,387 INFO [train.py:996] (2/4) Epoch 10, batch 23900, loss[loss=0.2296, simple_loss=0.2873, pruned_loss=0.08597, over 20264.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3218, pruned_loss=0.08279, over 4275469.43 frames. ], batch size: 703, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:50:22,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1790112.0, ans=0.2 2023-06-24 16:50:59,820 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.038e+02 1.066e+03 1.472e+03 2.085e+03 4.372e+03, threshold=2.943e+03, percent-clipped=19.0 2023-06-24 16:51:41,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1790352.0, ans=0.125 2023-06-24 16:51:58,851 INFO [train.py:996] (2/4) Epoch 10, batch 23950, loss[loss=0.1938, simple_loss=0.2555, pruned_loss=0.06604, over 21589.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3188, pruned_loss=0.08339, over 4263932.62 frames. ], batch size: 231, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 16:52:17,123 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:53:36,850 INFO [train.py:996] (2/4) Epoch 10, batch 24000, loss[loss=0.2665, simple_loss=0.3327, pruned_loss=0.1002, over 21813.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3207, pruned_loss=0.08656, over 4269701.53 frames. ], batch size: 282, lr: 2.89e-03, grad_scale: 32.0 2023-06-24 16:53:36,851 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 16:53:52,748 INFO [train.py:1028] (2/4) Epoch 10, validation: loss=0.2655, simple_loss=0.3589, pruned_loss=0.08609, over 1796401.00 frames. 2023-06-24 16:53:52,749 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-24 16:53:54,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1790712.0, ans=0.0 2023-06-24 16:54:13,854 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 16:54:41,003 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.470e+02 7.229e+02 9.460e+02 1.386e+03 2.838e+03, threshold=1.892e+03, percent-clipped=0.0 2023-06-24 16:54:57,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1790832.0, ans=0.025 2023-06-24 16:55:23,453 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.03 vs. limit=15.0 2023-06-24 16:55:31,743 INFO [train.py:996] (2/4) Epoch 10, batch 24050, loss[loss=0.2342, simple_loss=0.3219, pruned_loss=0.07322, over 21748.00 frames. ], tot_loss[loss=0.2487, simple_loss=0.323, pruned_loss=0.08718, over 4269856.01 frames. ], batch size: 332, lr: 2.89e-03, grad_scale: 32.0 2023-06-24 16:56:09,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1791072.0, ans=0.2 2023-06-24 16:56:24,852 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.77 vs. limit=22.5 2023-06-24 16:56:33,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1791132.0, ans=0.0 2023-06-24 16:57:03,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1791252.0, ans=0.0 2023-06-24 16:57:13,054 INFO [train.py:996] (2/4) Epoch 10, batch 24100, loss[loss=0.2309, simple_loss=0.3189, pruned_loss=0.07144, over 21693.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3215, pruned_loss=0.08454, over 4269926.12 frames. ], batch size: 298, lr: 2.89e-03, grad_scale: 32.0 2023-06-24 16:57:14,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1791312.0, ans=0.0 2023-06-24 16:57:49,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1791372.0, ans=0.09899494936611666 2023-06-24 16:58:01,292 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.582e+02 7.131e+02 9.675e+02 1.402e+03 3.208e+03, threshold=1.935e+03, percent-clipped=13.0 2023-06-24 16:58:03,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1791432.0, ans=0.0 2023-06-24 16:58:07,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1791432.0, ans=0.0 2023-06-24 16:58:08,426 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=15.0 2023-06-24 16:58:21,011 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.16 vs. limit=12.0 2023-06-24 16:58:52,123 INFO [train.py:996] (2/4) Epoch 10, batch 24150, loss[loss=0.2701, simple_loss=0.3283, pruned_loss=0.1059, over 21323.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3221, pruned_loss=0.08632, over 4272421.71 frames. ], batch size: 159, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:00:02,200 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.80 vs. limit=15.0 2023-06-24 17:00:30,316 INFO [train.py:996] (2/4) Epoch 10, batch 24200, loss[loss=0.2641, simple_loss=0.3431, pruned_loss=0.09257, over 21701.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.324, pruned_loss=0.08721, over 4271831.47 frames. ], batch size: 351, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:00:41,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1791912.0, ans=0.0 2023-06-24 17:00:43,763 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=22.5 2023-06-24 17:00:49,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1791972.0, ans=0.0 2023-06-24 17:01:11,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1792032.0, ans=0.5 2023-06-24 17:01:14,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1792032.0, ans=0.0 2023-06-24 17:01:15,899 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.664e+02 7.347e+02 1.079e+03 1.481e+03 2.381e+03, threshold=2.159e+03, percent-clipped=5.0 2023-06-24 17:01:22,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1792032.0, ans=0.04949747468305833 2023-06-24 17:01:23,247 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=12.0 2023-06-24 17:02:14,120 INFO [train.py:996] (2/4) Epoch 10, batch 24250, loss[loss=0.2505, simple_loss=0.3635, pruned_loss=0.06877, over 19902.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3206, pruned_loss=0.08084, over 4276295.11 frames. ], batch size: 702, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:02:30,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1792212.0, ans=0.1 2023-06-24 17:02:36,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1792272.0, ans=0.2 2023-06-24 17:02:41,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1792272.0, ans=0.0 2023-06-24 17:03:02,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1792332.0, ans=0.2 2023-06-24 17:03:10,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1792392.0, ans=0.125 2023-06-24 17:03:56,596 INFO [train.py:996] (2/4) Epoch 10, batch 24300, loss[loss=0.1997, simple_loss=0.2667, pruned_loss=0.06635, over 21706.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3126, pruned_loss=0.07458, over 4278089.93 frames. ], batch size: 112, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:04:07,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1792512.0, ans=0.1 2023-06-24 17:04:13,303 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.74 vs. limit=15.0 2023-06-24 17:04:17,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1792572.0, ans=0.0 2023-06-24 17:04:32,128 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=15.0 2023-06-24 17:04:32,604 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.370e+02 5.749e+02 8.681e+02 1.337e+03 2.668e+03, threshold=1.736e+03, percent-clipped=3.0 2023-06-24 17:04:39,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1792632.0, ans=0.2 2023-06-24 17:05:29,312 INFO [train.py:996] (2/4) Epoch 10, batch 24350, loss[loss=0.2189, simple_loss=0.2893, pruned_loss=0.07422, over 21653.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3093, pruned_loss=0.07461, over 4276275.71 frames. ], batch size: 263, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:05:37,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1792812.0, ans=0.1 2023-06-24 17:05:39,704 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.06 vs. limit=22.5 2023-06-24 17:06:42,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1792992.0, ans=0.0 2023-06-24 17:07:07,861 INFO [train.py:996] (2/4) Epoch 10, batch 24400, loss[loss=0.2581, simple_loss=0.3298, pruned_loss=0.09323, over 21818.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3125, pruned_loss=0.07747, over 4268698.20 frames. ], batch size: 441, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:07:52,232 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.844e+02 8.479e+02 1.173e+03 1.615e+03 2.996e+03, threshold=2.346e+03, percent-clipped=19.0 2023-06-24 17:08:11,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=1793292.0, ans=0.02 2023-06-24 17:08:24,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1793292.0, ans=0.125 2023-06-24 17:08:30,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1793292.0, ans=0.0 2023-06-24 17:08:33,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1793352.0, ans=0.1 2023-06-24 17:08:48,994 INFO [train.py:996] (2/4) Epoch 10, batch 24450, loss[loss=0.2188, simple_loss=0.288, pruned_loss=0.07479, over 20101.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3153, pruned_loss=0.07961, over 4268449.75 frames. ], batch size: 707, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:08:49,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1793412.0, ans=0.125 2023-06-24 17:09:23,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1793472.0, ans=0.2 2023-06-24 17:09:44,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1793592.0, ans=0.125 2023-06-24 17:10:26,060 INFO [train.py:996] (2/4) Epoch 10, batch 24500, loss[loss=0.2609, simple_loss=0.3338, pruned_loss=0.09402, over 21715.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3162, pruned_loss=0.08013, over 4271898.87 frames. ], batch size: 389, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:10:45,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1793772.0, ans=0.125 2023-06-24 17:10:55,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1793772.0, ans=0.125 2023-06-24 17:11:07,633 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.494e+02 6.645e+02 1.000e+03 1.722e+03 3.391e+03, threshold=2.001e+03, percent-clipped=7.0 2023-06-24 17:11:14,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1793832.0, ans=0.125 2023-06-24 17:11:59,896 INFO [train.py:996] (2/4) Epoch 10, batch 24550, loss[loss=0.2714, simple_loss=0.3416, pruned_loss=0.1006, over 21535.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3195, pruned_loss=0.08241, over 4278138.83 frames. ], batch size: 194, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:12:28,809 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-24 17:12:36,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1794132.0, ans=0.125 2023-06-24 17:12:56,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1794132.0, ans=0.125 2023-06-24 17:13:37,075 INFO [train.py:996] (2/4) Epoch 10, batch 24600, loss[loss=0.234, simple_loss=0.3667, pruned_loss=0.05066, over 19844.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3166, pruned_loss=0.08314, over 4276242.22 frames. ], batch size: 702, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:14:28,287 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.632e+02 7.490e+02 1.116e+03 1.562e+03 6.451e+03, threshold=2.232e+03, percent-clipped=18.0 2023-06-24 17:15:15,835 INFO [train.py:996] (2/4) Epoch 10, batch 24650, loss[loss=0.187, simple_loss=0.2546, pruned_loss=0.05973, over 21587.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3093, pruned_loss=0.08165, over 4274626.26 frames. ], batch size: 263, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:15:41,796 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:15:48,910 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-24 17:16:11,383 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:16:53,143 INFO [train.py:996] (2/4) Epoch 10, batch 24700, loss[loss=0.1753, simple_loss=0.2362, pruned_loss=0.0572, over 20819.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.306, pruned_loss=0.0802, over 4271329.22 frames. ], batch size: 609, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:17:48,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1795032.0, ans=0.2 2023-06-24 17:17:49,168 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.586e+02 6.249e+02 8.587e+02 1.281e+03 3.151e+03, threshold=1.717e+03, percent-clipped=6.0 2023-06-24 17:17:56,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1795032.0, ans=0.125 2023-06-24 17:17:59,756 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=22.5 2023-06-24 17:18:00,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1795092.0, ans=0.0 2023-06-24 17:18:18,049 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.70 vs. limit=22.5 2023-06-24 17:18:28,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1795152.0, ans=0.0 2023-06-24 17:18:31,297 INFO [train.py:996] (2/4) Epoch 10, batch 24750, loss[loss=0.2135, simple_loss=0.2808, pruned_loss=0.07308, over 21627.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2992, pruned_loss=0.07761, over 4271142.77 frames. ], batch size: 298, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:18:46,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=1795272.0, ans=22.5 2023-06-24 17:19:09,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1795272.0, ans=0.125 2023-06-24 17:20:02,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1795452.0, ans=0.125 2023-06-24 17:20:03,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1795452.0, ans=0.0 2023-06-24 17:20:05,103 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.55 vs. limit=15.0 2023-06-24 17:20:07,376 INFO [train.py:996] (2/4) Epoch 10, batch 24800, loss[loss=0.2381, simple_loss=0.3074, pruned_loss=0.0844, over 21846.00 frames. ], tot_loss[loss=0.2231, simple_loss=0.293, pruned_loss=0.07656, over 4278908.45 frames. ], batch size: 124, lr: 2.89e-03, grad_scale: 32.0 2023-06-24 17:20:26,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1795572.0, ans=0.125 2023-06-24 17:20:54,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1795632.0, ans=0.125 2023-06-24 17:20:59,562 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.326e+02 5.922e+02 8.431e+02 1.285e+03 2.453e+03, threshold=1.686e+03, percent-clipped=12.0 2023-06-24 17:21:19,355 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=15.0 2023-06-24 17:21:20,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1795692.0, ans=0.125 2023-06-24 17:21:33,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1795752.0, ans=0.1 2023-06-24 17:21:45,136 INFO [train.py:996] (2/4) Epoch 10, batch 24850, loss[loss=0.2652, simple_loss=0.3353, pruned_loss=0.09757, over 21864.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.294, pruned_loss=0.07776, over 4280737.27 frames. ], batch size: 371, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:21:45,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1795812.0, ans=0.125 2023-06-24 17:22:56,413 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.79 vs. limit=15.0 2023-06-24 17:23:03,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1795992.0, ans=0.0 2023-06-24 17:23:22,411 INFO [train.py:996] (2/4) Epoch 10, batch 24900, loss[loss=0.2294, simple_loss=0.2862, pruned_loss=0.08634, over 20182.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2957, pruned_loss=0.078, over 4284812.85 frames. ], batch size: 703, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:23:35,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1796112.0, ans=0.2 2023-06-24 17:23:39,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1796112.0, ans=0.125 2023-06-24 17:23:48,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_na.min_abs, batch_count=1796172.0, ans=0.02 2023-06-24 17:24:20,584 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.038e+02 8.107e+02 1.257e+03 1.989e+03 3.453e+03, threshold=2.514e+03, percent-clipped=33.0 2023-06-24 17:24:38,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1796292.0, ans=0.125 2023-06-24 17:24:57,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1796352.0, ans=0.125 2023-06-24 17:24:59,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1796412.0, ans=0.1 2023-06-24 17:25:00,298 INFO [train.py:996] (2/4) Epoch 10, batch 24950, loss[loss=0.247, simple_loss=0.3199, pruned_loss=0.08705, over 21390.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3047, pruned_loss=0.08204, over 4282883.44 frames. ], batch size: 549, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:25:19,735 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=22.5 2023-06-24 17:25:58,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1796532.0, ans=0.1 2023-06-24 17:26:24,847 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1796652.0, ans=0.2 2023-06-24 17:26:43,381 INFO [train.py:996] (2/4) Epoch 10, batch 25000, loss[loss=0.2154, simple_loss=0.2829, pruned_loss=0.0739, over 21649.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3101, pruned_loss=0.08367, over 4276635.76 frames. ], batch size: 332, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:26:43,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1796712.0, ans=0.125 2023-06-24 17:27:18,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1796772.0, ans=0.0 2023-06-24 17:27:18,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1796772.0, ans=0.2 2023-06-24 17:27:37,645 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.496e+02 7.141e+02 9.275e+02 1.381e+03 2.945e+03, threshold=1.855e+03, percent-clipped=4.0 2023-06-24 17:27:53,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1796892.0, ans=0.2 2023-06-24 17:28:31,901 INFO [train.py:996] (2/4) Epoch 10, batch 25050, loss[loss=0.2467, simple_loss=0.3045, pruned_loss=0.09449, over 21590.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3045, pruned_loss=0.08235, over 4270691.13 frames. ], batch size: 415, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:28:50,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1797072.0, ans=0.125 2023-06-24 17:29:00,576 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-24 17:29:05,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1797132.0, ans=0.2 2023-06-24 17:29:10,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1797132.0, ans=0.0 2023-06-24 17:29:31,858 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-24 17:30:03,456 INFO [train.py:996] (2/4) Epoch 10, batch 25100, loss[loss=0.207, simple_loss=0.2701, pruned_loss=0.07192, over 21526.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2985, pruned_loss=0.08049, over 4278462.29 frames. ], batch size: 391, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:30:29,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1797372.0, ans=0.1 2023-06-24 17:30:39,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1797372.0, ans=0.125 2023-06-24 17:30:47,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1797432.0, ans=0.0 2023-06-24 17:30:47,830 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:30:51,806 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.233e+02 6.255e+02 9.404e+02 1.549e+03 2.850e+03, threshold=1.881e+03, percent-clipped=12.0 2023-06-24 17:30:53,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1797432.0, ans=0.0 2023-06-24 17:30:53,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1797432.0, ans=0.1 2023-06-24 17:31:35,822 INFO [train.py:996] (2/4) Epoch 10, batch 25150, loss[loss=0.1986, simple_loss=0.2927, pruned_loss=0.05226, over 21405.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3015, pruned_loss=0.07844, over 4262992.54 frames. ], batch size: 211, lr: 2.89e-03, grad_scale: 8.0 2023-06-24 17:32:22,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1797732.0, ans=0.0 2023-06-24 17:33:12,420 INFO [train.py:996] (2/4) Epoch 10, batch 25200, loss[loss=0.195, simple_loss=0.2573, pruned_loss=0.06638, over 20247.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3023, pruned_loss=0.07706, over 4265859.40 frames. ], batch size: 703, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:34:00,966 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.255e+02 6.948e+02 1.057e+03 1.508e+03 2.758e+03, threshold=2.115e+03, percent-clipped=16.0 2023-06-24 17:34:49,119 INFO [train.py:996] (2/4) Epoch 10, batch 25250, loss[loss=0.2142, simple_loss=0.272, pruned_loss=0.07824, over 21230.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3, pruned_loss=0.07565, over 4267204.08 frames. ], batch size: 176, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:36:04,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1798452.0, ans=0.2 2023-06-24 17:36:26,953 INFO [train.py:996] (2/4) Epoch 10, batch 25300, loss[loss=0.2232, simple_loss=0.298, pruned_loss=0.07418, over 21782.00 frames. ], tot_loss[loss=0.224, simple_loss=0.298, pruned_loss=0.07501, over 4270111.02 frames. ], batch size: 282, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:36:52,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1798572.0, ans=0.1 2023-06-24 17:36:54,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1798572.0, ans=0.125 2023-06-24 17:37:00,706 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=7.35 vs. limit=12.0 2023-06-24 17:37:11,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1798632.0, ans=0.1 2023-06-24 17:37:15,746 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.390e+02 6.891e+02 9.341e+02 1.406e+03 3.031e+03, threshold=1.868e+03, percent-clipped=2.0 2023-06-24 17:37:22,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1798692.0, ans=0.0 2023-06-24 17:37:24,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1798692.0, ans=0.125 2023-06-24 17:37:31,242 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.16 vs. limit=22.5 2023-06-24 17:38:10,231 INFO [train.py:996] (2/4) Epoch 10, batch 25350, loss[loss=0.2049, simple_loss=0.2852, pruned_loss=0.06232, over 21819.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3, pruned_loss=0.07505, over 4265956.73 frames. ], batch size: 107, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:38:17,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=1798812.0, ans=0.02 2023-06-24 17:38:20,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1798812.0, ans=0.125 2023-06-24 17:38:48,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1798932.0, ans=0.2 2023-06-24 17:39:07,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1798992.0, ans=0.04949747468305833 2023-06-24 17:39:33,357 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1799052.0, ans=0.1 2023-06-24 17:39:42,334 INFO [train.py:996] (2/4) Epoch 10, batch 25400, loss[loss=0.1924, simple_loss=0.2866, pruned_loss=0.04907, over 21623.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.297, pruned_loss=0.07444, over 4245270.49 frames. ], batch size: 263, lr: 2.89e-03, grad_scale: 16.0 2023-06-24 17:40:31,211 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.306e+02 6.316e+02 8.896e+02 1.149e+03 2.761e+03, threshold=1.779e+03, percent-clipped=5.0 2023-06-24 17:40:36,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1799232.0, ans=0.0 2023-06-24 17:41:24,897 INFO [train.py:996] (2/4) Epoch 10, batch 25450, loss[loss=0.1974, simple_loss=0.279, pruned_loss=0.05786, over 21496.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2966, pruned_loss=0.07539, over 4254104.72 frames. ], batch size: 131, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 17:42:13,703 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=15.0 2023-06-24 17:42:17,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1799592.0, ans=0.5 2023-06-24 17:42:21,169 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1799592.0, ans=0.0 2023-06-24 17:42:30,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1799592.0, ans=0.2 2023-06-24 17:42:38,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1799592.0, ans=0.1 2023-06-24 17:42:51,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1799652.0, ans=0.125 2023-06-24 17:43:04,173 INFO [train.py:996] (2/4) Epoch 10, batch 25500, loss[loss=0.1982, simple_loss=0.2837, pruned_loss=0.0563, over 21293.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2976, pruned_loss=0.07315, over 4254990.20 frames. ], batch size: 176, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 17:43:12,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1799712.0, ans=0.125 2023-06-24 17:43:25,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1799772.0, ans=0.04949747468305833 2023-06-24 17:43:32,792 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:43:43,841 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.327e+02 6.563e+02 1.058e+03 1.442e+03 3.790e+03, threshold=2.117e+03, percent-clipped=15.0 2023-06-24 17:44:13,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1799892.0, ans=0.125 2023-06-24 17:44:39,388 INFO [train.py:996] (2/4) Epoch 10, batch 25550, loss[loss=0.2478, simple_loss=0.3486, pruned_loss=0.07356, over 21708.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3039, pruned_loss=0.07295, over 4243612.75 frames. ], batch size: 441, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:44:46,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1800012.0, ans=0.1 2023-06-24 17:45:02,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1800072.0, ans=0.2 2023-06-24 17:46:06,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1800252.0, ans=0.125 2023-06-24 17:46:14,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1800252.0, ans=10.0 2023-06-24 17:46:17,025 INFO [train.py:996] (2/4) Epoch 10, batch 25600, loss[loss=0.2454, simple_loss=0.3233, pruned_loss=0.08375, over 21604.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3072, pruned_loss=0.07327, over 4232444.12 frames. ], batch size: 414, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 17:46:24,429 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.25 vs. limit=22.5 2023-06-24 17:46:58,009 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.521e+02 8.019e+02 1.314e+03 1.703e+03 3.186e+03, threshold=2.628e+03, percent-clipped=13.0 2023-06-24 17:47:54,765 INFO [train.py:996] (2/4) Epoch 10, batch 25650, loss[loss=0.2042, simple_loss=0.2703, pruned_loss=0.06904, over 21335.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3076, pruned_loss=0.07577, over 4238729.02 frames. ], batch size: 211, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:47:55,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1800612.0, ans=0.0 2023-06-24 17:48:06,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1800612.0, ans=0.125 2023-06-24 17:48:09,764 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.84 vs. limit=10.0 2023-06-24 17:48:24,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=1800732.0, ans=0.2 2023-06-24 17:49:28,874 INFO [train.py:996] (2/4) Epoch 10, batch 25700, loss[loss=0.1999, simple_loss=0.2854, pruned_loss=0.05721, over 21784.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3051, pruned_loss=0.07657, over 4239881.55 frames. ], batch size: 282, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:49:49,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1800972.0, ans=0.125 2023-06-24 17:50:16,725 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.285e+02 8.817e+02 1.358e+03 2.096e+03 4.463e+03, threshold=2.717e+03, percent-clipped=14.0 2023-06-24 17:50:18,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1801032.0, ans=0.125 2023-06-24 17:50:37,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1801092.0, ans=0.125 2023-06-24 17:50:45,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1801152.0, ans=0.2 2023-06-24 17:50:55,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1801152.0, ans=0.0 2023-06-24 17:51:03,503 INFO [train.py:996] (2/4) Epoch 10, batch 25750, loss[loss=0.294, simple_loss=0.3861, pruned_loss=0.1009, over 21909.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.31, pruned_loss=0.07936, over 4243457.86 frames. ], batch size: 316, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:52:13,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1801392.0, ans=0.0 2023-06-24 17:52:37,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1801512.0, ans=0.0 2023-06-24 17:52:38,269 INFO [train.py:996] (2/4) Epoch 10, batch 25800, loss[loss=0.2817, simple_loss=0.3504, pruned_loss=0.1065, over 21432.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3207, pruned_loss=0.08376, over 4251408.68 frames. ], batch size: 211, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:53:01,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1801512.0, ans=0.0 2023-06-24 17:53:35,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1801632.0, ans=0.125 2023-06-24 17:53:40,820 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.838e+02 7.699e+02 1.038e+03 1.788e+03 4.629e+03, threshold=2.076e+03, percent-clipped=8.0 2023-06-24 17:53:50,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1801692.0, ans=0.2 2023-06-24 17:54:15,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1801812.0, ans=0.125 2023-06-24 17:54:17,136 INFO [train.py:996] (2/4) Epoch 10, batch 25850, loss[loss=0.2558, simple_loss=0.3159, pruned_loss=0.09784, over 21780.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3223, pruned_loss=0.08308, over 4260867.90 frames. ], batch size: 441, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:54:36,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1801812.0, ans=0.125 2023-06-24 17:55:06,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1801932.0, ans=0.1 2023-06-24 17:55:18,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1801992.0, ans=0.2 2023-06-24 17:55:21,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1801992.0, ans=0.125 2023-06-24 17:56:01,198 INFO [train.py:996] (2/4) Epoch 10, batch 25900, loss[loss=0.2671, simple_loss=0.3257, pruned_loss=0.1043, over 21637.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3271, pruned_loss=0.0857, over 4273172.23 frames. ], batch size: 471, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:56:30,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1802172.0, ans=0.0 2023-06-24 17:56:34,246 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-24 17:56:46,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1802232.0, ans=0.125 2023-06-24 17:56:53,707 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.611e+02 7.173e+02 1.132e+03 1.463e+03 2.574e+03, threshold=2.264e+03, percent-clipped=5.0 2023-06-24 17:57:03,020 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.40 vs. limit=15.0 2023-06-24 17:57:45,023 INFO [train.py:996] (2/4) Epoch 10, batch 25950, loss[loss=0.2548, simple_loss=0.3277, pruned_loss=0.09094, over 21582.00 frames. ], tot_loss[loss=0.2551, simple_loss=0.333, pruned_loss=0.08857, over 4274465.87 frames. ], batch size: 389, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 17:57:45,987 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.44 vs. limit=15.0 2023-06-24 17:58:02,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1802472.0, ans=0.125 2023-06-24 17:58:07,038 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 17:58:22,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1802532.0, ans=0.1 2023-06-24 17:58:26,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1802532.0, ans=0.125 2023-06-24 17:59:00,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1802652.0, ans=0.0 2023-06-24 17:59:17,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1802652.0, ans=0.0 2023-06-24 17:59:23,435 INFO [train.py:996] (2/4) Epoch 10, batch 26000, loss[loss=0.287, simple_loss=0.3624, pruned_loss=0.1058, over 21709.00 frames. ], tot_loss[loss=0.254, simple_loss=0.3335, pruned_loss=0.0872, over 4269033.75 frames. ], batch size: 351, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:00:02,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1802832.0, ans=0.125 2023-06-24 18:00:06,703 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.493e+02 7.586e+02 1.074e+03 1.522e+03 3.008e+03, threshold=2.148e+03, percent-clipped=6.0 2023-06-24 18:00:32,724 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-24 18:00:40,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1802952.0, ans=0.0 2023-06-24 18:00:56,213 INFO [train.py:996] (2/4) Epoch 10, batch 26050, loss[loss=0.2481, simple_loss=0.3136, pruned_loss=0.09129, over 21853.00 frames. ], tot_loss[loss=0.2538, simple_loss=0.3329, pruned_loss=0.08728, over 4268106.81 frames. ], batch size: 298, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:00:59,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1803012.0, ans=0.95 2023-06-24 18:00:59,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1803012.0, ans=0.0 2023-06-24 18:02:32,709 INFO [train.py:996] (2/4) Epoch 10, batch 26100, loss[loss=0.2655, simple_loss=0.3382, pruned_loss=0.09642, over 21773.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3261, pruned_loss=0.08631, over 4277174.46 frames. ], batch size: 112, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:03:15,403 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.148e+02 7.025e+02 9.795e+02 1.501e+03 3.322e+03, threshold=1.959e+03, percent-clipped=9.0 2023-06-24 18:03:24,324 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.28 vs. limit=22.5 2023-06-24 18:04:05,710 INFO [train.py:996] (2/4) Epoch 10, batch 26150, loss[loss=0.2378, simple_loss=0.306, pruned_loss=0.08486, over 21801.00 frames. ], tot_loss[loss=0.2476, simple_loss=0.3223, pruned_loss=0.08644, over 4282000.46 frames. ], batch size: 247, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:04:07,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1803612.0, ans=0.125 2023-06-24 18:04:43,694 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.65 vs. limit=6.0 2023-06-24 18:05:20,110 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-06-24 18:05:43,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1803912.0, ans=0.04949747468305833 2023-06-24 18:05:44,637 INFO [train.py:996] (2/4) Epoch 10, batch 26200, loss[loss=0.2639, simple_loss=0.3634, pruned_loss=0.08218, over 21632.00 frames. ], tot_loss[loss=0.246, simple_loss=0.3228, pruned_loss=0.08457, over 4284329.86 frames. ], batch size: 389, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:06:01,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1803972.0, ans=0.0 2023-06-24 18:06:36,095 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.529e+02 6.225e+02 8.349e+02 1.175e+03 2.397e+03, threshold=1.670e+03, percent-clipped=3.0 2023-06-24 18:07:01,620 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.05 vs. limit=15.0 2023-06-24 18:07:02,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1804152.0, ans=0.125 2023-06-24 18:07:08,990 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.15 vs. limit=22.5 2023-06-24 18:07:17,717 INFO [train.py:996] (2/4) Epoch 10, batch 26250, loss[loss=0.2688, simple_loss=0.3392, pruned_loss=0.09918, over 21916.00 frames. ], tot_loss[loss=0.2471, simple_loss=0.3267, pruned_loss=0.08377, over 4278226.03 frames. ], batch size: 124, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:07:25,896 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1804212.0, ans=0.0 2023-06-24 18:07:27,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1804212.0, ans=0.1 2023-06-24 18:07:54,249 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.27 vs. limit=15.0 2023-06-24 18:08:25,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1804392.0, ans=0.0 2023-06-24 18:08:53,829 INFO [train.py:996] (2/4) Epoch 10, batch 26300, loss[loss=0.2461, simple_loss=0.3139, pruned_loss=0.08916, over 21404.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3245, pruned_loss=0.08466, over 4282837.65 frames. ], batch size: 144, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:09:10,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1804572.0, ans=0.125 2023-06-24 18:09:51,136 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.353e+02 7.232e+02 9.217e+02 1.300e+03 2.808e+03, threshold=1.843e+03, percent-clipped=15.0 2023-06-24 18:09:53,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1804632.0, ans=0.2 2023-06-24 18:09:55,525 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.04 vs. limit=22.5 2023-06-24 18:10:04,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1804692.0, ans=0.05 2023-06-24 18:10:14,818 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.82 vs. limit=22.5 2023-06-24 18:10:26,798 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1804812.0, ans=0.125 2023-06-24 18:10:27,969 INFO [train.py:996] (2/4) Epoch 10, batch 26350, loss[loss=0.2922, simple_loss=0.3562, pruned_loss=0.1141, over 21785.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3222, pruned_loss=0.08505, over 4285168.71 frames. ], batch size: 441, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:10:30,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1804812.0, ans=0.125 2023-06-24 18:10:47,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1804872.0, ans=0.1 2023-06-24 18:11:26,924 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.19 vs. limit=15.0 2023-06-24 18:11:37,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1804992.0, ans=15.0 2023-06-24 18:11:49,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1805052.0, ans=0.125 2023-06-24 18:11:52,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1805052.0, ans=0.125 2023-06-24 18:12:00,214 INFO [train.py:996] (2/4) Epoch 10, batch 26400, loss[loss=0.1947, simple_loss=0.2623, pruned_loss=0.06355, over 21762.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3182, pruned_loss=0.08602, over 4270557.50 frames. ], batch size: 351, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:12:17,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1805112.0, ans=0.1 2023-06-24 18:12:43,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1805172.0, ans=0.125 2023-06-24 18:12:45,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1805172.0, ans=0.2 2023-06-24 18:12:54,024 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-24 18:13:04,637 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.504e+02 7.130e+02 9.166e+02 1.346e+03 2.893e+03, threshold=1.833e+03, percent-clipped=10.0 2023-06-24 18:13:44,721 INFO [train.py:996] (2/4) Epoch 10, batch 26450, loss[loss=0.2721, simple_loss=0.3995, pruned_loss=0.0724, over 20794.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3182, pruned_loss=0.08543, over 4269663.81 frames. ], batch size: 607, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:14:12,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1805412.0, ans=0.125 2023-06-24 18:14:16,147 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.49 vs. limit=15.0 2023-06-24 18:14:17,466 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-24 18:14:17,520 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.54 vs. limit=22.5 2023-06-24 18:15:33,617 INFO [train.py:996] (2/4) Epoch 10, batch 26500, loss[loss=0.2098, simple_loss=0.2868, pruned_loss=0.06642, over 21764.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3182, pruned_loss=0.0827, over 4264994.74 frames. ], batch size: 247, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:15:45,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1805712.0, ans=0.125 2023-06-24 18:16:19,140 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.291e+02 8.492e+02 1.713e+03 2.381e+03 4.815e+03, threshold=3.427e+03, percent-clipped=46.0 2023-06-24 18:16:57,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1805952.0, ans=0.0 2023-06-24 18:17:13,482 INFO [train.py:996] (2/4) Epoch 10, batch 26550, loss[loss=0.2016, simple_loss=0.3065, pruned_loss=0.04834, over 21640.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3156, pruned_loss=0.08032, over 4262704.94 frames. ], batch size: 414, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:17:30,680 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.91 vs. limit=15.0 2023-06-24 18:17:33,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1806072.0, ans=0.0 2023-06-24 18:17:49,027 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1806132.0, ans=0.09899494936611666 2023-06-24 18:18:29,215 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.65 vs. limit=15.0 2023-06-24 18:18:51,159 INFO [train.py:996] (2/4) Epoch 10, batch 26600, loss[loss=0.202, simple_loss=0.2794, pruned_loss=0.0623, over 21605.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3137, pruned_loss=0.07739, over 4256017.93 frames. ], batch size: 298, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:18:56,485 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=15.0 2023-06-24 18:19:07,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1806372.0, ans=0.125 2023-06-24 18:19:50,174 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.252e+02 6.735e+02 9.367e+02 1.418e+03 2.947e+03, threshold=1.873e+03, percent-clipped=0.0 2023-06-24 18:20:00,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1806492.0, ans=0.0 2023-06-24 18:20:27,314 INFO [train.py:996] (2/4) Epoch 10, batch 26650, loss[loss=0.2154, simple_loss=0.293, pruned_loss=0.06888, over 21640.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3072, pruned_loss=0.07643, over 4263110.25 frames. ], batch size: 415, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:20:46,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1806672.0, ans=0.125 2023-06-24 18:21:01,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1806732.0, ans=0.125 2023-06-24 18:21:01,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1806732.0, ans=0.125 2023-06-24 18:21:19,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1806732.0, ans=0.0 2023-06-24 18:21:41,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1806792.0, ans=0.125 2023-06-24 18:22:04,425 INFO [train.py:996] (2/4) Epoch 10, batch 26700, loss[loss=0.2126, simple_loss=0.2813, pruned_loss=0.07195, over 21538.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.3006, pruned_loss=0.07346, over 4267775.72 frames. ], batch size: 212, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:22:06,545 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:22:39,702 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=15.0 2023-06-24 18:23:03,956 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.609e+02 5.904e+02 8.895e+02 1.266e+03 2.611e+03, threshold=1.779e+03, percent-clipped=6.0 2023-06-24 18:23:37,431 INFO [train.py:996] (2/4) Epoch 10, batch 26750, loss[loss=0.2597, simple_loss=0.3322, pruned_loss=0.09363, over 21313.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3004, pruned_loss=0.07261, over 4276933.41 frames. ], batch size: 159, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:23:47,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1807212.0, ans=0.125 2023-06-24 18:24:59,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1807452.0, ans=0.1 2023-06-24 18:25:17,206 INFO [train.py:996] (2/4) Epoch 10, batch 26800, loss[loss=0.204, simple_loss=0.2759, pruned_loss=0.06607, over 21923.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3079, pruned_loss=0.07634, over 4270823.84 frames. ], batch size: 98, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:25:55,803 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:25:58,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1807572.0, ans=0.0 2023-06-24 18:26:08,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1807632.0, ans=0.1 2023-06-24 18:26:16,848 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.147e+02 7.518e+02 9.832e+02 1.417e+03 2.844e+03, threshold=1.966e+03, percent-clipped=8.0 2023-06-24 18:26:29,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1807692.0, ans=0.125 2023-06-24 18:26:59,134 INFO [train.py:996] (2/4) Epoch 10, batch 26850, loss[loss=0.2367, simple_loss=0.2969, pruned_loss=0.08826, over 21389.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3082, pruned_loss=0.0788, over 4270363.22 frames. ], batch size: 131, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:27:04,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1807812.0, ans=0.0 2023-06-24 18:27:28,349 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=12.0 2023-06-24 18:27:48,838 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.17 vs. limit=10.0 2023-06-24 18:27:53,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1807932.0, ans=0.5 2023-06-24 18:28:02,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1807992.0, ans=0.05 2023-06-24 18:28:30,937 INFO [train.py:996] (2/4) Epoch 10, batch 26900, loss[loss=0.1919, simple_loss=0.2547, pruned_loss=0.06455, over 21661.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3, pruned_loss=0.07821, over 4269377.66 frames. ], batch size: 282, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:28:51,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1808112.0, ans=0.025 2023-06-24 18:29:03,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1808172.0, ans=0.0 2023-06-24 18:29:21,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1808232.0, ans=0.0 2023-06-24 18:29:30,463 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.980e+02 7.494e+02 9.472e+02 1.508e+03 3.136e+03, threshold=1.894e+03, percent-clipped=8.0 2023-06-24 18:30:07,713 INFO [train.py:996] (2/4) Epoch 10, batch 26950, loss[loss=0.2533, simple_loss=0.3367, pruned_loss=0.08495, over 21428.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.2992, pruned_loss=0.07777, over 4272762.28 frames. ], batch size: 194, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:30:45,861 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.91 vs. limit=22.5 2023-06-24 18:31:19,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1808592.0, ans=0.2 2023-06-24 18:31:54,840 INFO [train.py:996] (2/4) Epoch 10, batch 27000, loss[loss=0.1797, simple_loss=0.267, pruned_loss=0.04624, over 21353.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3002, pruned_loss=0.0759, over 4269343.54 frames. ], batch size: 194, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:31:54,840 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 18:32:16,364 INFO [train.py:1028] (2/4) Epoch 10, validation: loss=0.2412, simple_loss=0.3374, pruned_loss=0.07247, over 1796401.00 frames. 2023-06-24 18:32:16,365 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-24 18:32:26,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1808712.0, ans=0.1 2023-06-24 18:32:45,542 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:33:08,818 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.928e+02 6.448e+02 9.390e+02 1.351e+03 2.937e+03, threshold=1.878e+03, percent-clipped=11.0 2023-06-24 18:33:55,939 INFO [train.py:996] (2/4) Epoch 10, batch 27050, loss[loss=0.2124, simple_loss=0.3023, pruned_loss=0.0612, over 21095.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3033, pruned_loss=0.07298, over 4273402.77 frames. ], batch size: 607, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:34:13,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1809072.0, ans=0.0 2023-06-24 18:34:19,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1809072.0, ans=0.125 2023-06-24 18:34:43,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1809132.0, ans=0.0 2023-06-24 18:35:32,515 INFO [train.py:996] (2/4) Epoch 10, batch 27100, loss[loss=0.1902, simple_loss=0.2646, pruned_loss=0.0579, over 21796.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3047, pruned_loss=0.07374, over 4279583.82 frames. ], batch size: 102, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:35:46,166 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.94 vs. limit=22.5 2023-06-24 18:36:20,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1809432.0, ans=0.125 2023-06-24 18:36:20,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1809432.0, ans=0.2 2023-06-24 18:36:24,773 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.409e+02 6.272e+02 8.340e+02 1.180e+03 2.454e+03, threshold=1.668e+03, percent-clipped=3.0 2023-06-24 18:36:49,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1809552.0, ans=0.125 2023-06-24 18:37:10,881 INFO [train.py:996] (2/4) Epoch 10, batch 27150, loss[loss=0.258, simple_loss=0.3543, pruned_loss=0.08088, over 21757.00 frames. ], tot_loss[loss=0.237, simple_loss=0.318, pruned_loss=0.07796, over 4275065.69 frames. ], batch size: 332, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:37:23,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1809612.0, ans=0.0 2023-06-24 18:37:30,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1809672.0, ans=0.125 2023-06-24 18:37:47,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1809732.0, ans=0.0 2023-06-24 18:38:41,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1809852.0, ans=0.1 2023-06-24 18:38:49,496 INFO [train.py:996] (2/4) Epoch 10, batch 27200, loss[loss=0.2991, simple_loss=0.3712, pruned_loss=0.1135, over 21444.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3266, pruned_loss=0.08101, over 4273166.33 frames. ], batch size: 471, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:39:01,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1809912.0, ans=0.125 2023-06-24 18:39:02,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1809912.0, ans=0.025 2023-06-24 18:39:56,025 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.592e+02 7.616e+02 1.186e+03 1.800e+03 4.357e+03, threshold=2.372e+03, percent-clipped=30.0 2023-06-24 18:40:01,875 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=7.83 vs. limit=15.0 2023-06-24 18:40:27,551 INFO [train.py:996] (2/4) Epoch 10, batch 27250, loss[loss=0.2681, simple_loss=0.342, pruned_loss=0.09711, over 21627.00 frames. ], tot_loss[loss=0.2488, simple_loss=0.3287, pruned_loss=0.08447, over 4277855.35 frames. ], batch size: 263, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:40:50,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1810272.0, ans=0.0 2023-06-24 18:40:59,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1810272.0, ans=0.125 2023-06-24 18:41:11,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1810272.0, ans=0.1 2023-06-24 18:41:13,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1810332.0, ans=0.0 2023-06-24 18:42:12,490 INFO [train.py:996] (2/4) Epoch 10, batch 27300, loss[loss=0.2081, simple_loss=0.2729, pruned_loss=0.07159, over 16629.00 frames. ], tot_loss[loss=0.2507, simple_loss=0.3302, pruned_loss=0.08559, over 4275447.07 frames. ], batch size: 60, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:42:58,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1810572.0, ans=0.0 2023-06-24 18:43:18,100 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.724e+02 6.828e+02 8.478e+02 1.184e+03 2.294e+03, threshold=1.696e+03, percent-clipped=0.0 2023-06-24 18:43:28,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1810692.0, ans=0.2 2023-06-24 18:43:41,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1810752.0, ans=0.0 2023-06-24 18:43:55,760 INFO [train.py:996] (2/4) Epoch 10, batch 27350, loss[loss=0.2864, simple_loss=0.3671, pruned_loss=0.1028, over 21524.00 frames. ], tot_loss[loss=0.2524, simple_loss=0.3333, pruned_loss=0.08576, over 4275035.23 frames. ], batch size: 471, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:44:44,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1810872.0, ans=0.0 2023-06-24 18:45:22,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1811052.0, ans=0.0 2023-06-24 18:45:36,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1811052.0, ans=0.035 2023-06-24 18:45:48,112 INFO [train.py:996] (2/4) Epoch 10, batch 27400, loss[loss=0.2242, simple_loss=0.2918, pruned_loss=0.07833, over 21812.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3281, pruned_loss=0.08526, over 4281240.81 frames. ], batch size: 118, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:46:31,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1811232.0, ans=0.0 2023-06-24 18:46:49,505 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.586e+02 6.744e+02 9.251e+02 1.244e+03 3.904e+03, threshold=1.850e+03, percent-clipped=13.0 2023-06-24 18:47:11,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1811352.0, ans=0.2 2023-06-24 18:47:13,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1811352.0, ans=0.1 2023-06-24 18:47:39,087 INFO [train.py:996] (2/4) Epoch 10, batch 27450, loss[loss=0.22, simple_loss=0.3052, pruned_loss=0.06741, over 21410.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3223, pruned_loss=0.08421, over 4280243.20 frames. ], batch size: 194, lr: 2.88e-03, grad_scale: 16.0 2023-06-24 18:48:09,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1811472.0, ans=0.1 2023-06-24 18:48:32,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1811532.0, ans=0.2 2023-06-24 18:48:49,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1811592.0, ans=0.125 2023-06-24 18:48:52,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1811592.0, ans=0.125 2023-06-24 18:48:54,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1811592.0, ans=0.07 2023-06-24 18:49:00,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1811652.0, ans=0.125 2023-06-24 18:49:19,532 INFO [train.py:996] (2/4) Epoch 10, batch 27500, loss[loss=0.2263, simple_loss=0.3037, pruned_loss=0.07441, over 21662.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3194, pruned_loss=0.08417, over 4284327.66 frames. ], batch size: 230, lr: 2.88e-03, grad_scale: 8.0 2023-06-24 18:49:54,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1811772.0, ans=0.09899494936611666 2023-06-24 18:50:12,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1811832.0, ans=0.1 2023-06-24 18:50:17,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1811832.0, ans=0.0 2023-06-24 18:50:26,977 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.182e+02 6.516e+02 9.018e+02 1.642e+03 4.000e+03, threshold=1.804e+03, percent-clipped=22.0 2023-06-24 18:50:32,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1811892.0, ans=0.125 2023-06-24 18:50:37,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1811892.0, ans=0.025 2023-06-24 18:51:02,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1812012.0, ans=0.125 2023-06-24 18:51:04,406 INFO [train.py:996] (2/4) Epoch 10, batch 27550, loss[loss=0.2643, simple_loss=0.3167, pruned_loss=0.106, over 21285.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3149, pruned_loss=0.08154, over 4281070.58 frames. ], batch size: 471, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 18:52:00,718 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=1.046e-02 2023-06-24 18:52:07,342 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.75 vs. limit=15.0 2023-06-24 18:52:52,100 INFO [train.py:996] (2/4) Epoch 10, batch 27600, loss[loss=0.1922, simple_loss=0.2633, pruned_loss=0.06049, over 21616.00 frames. ], tot_loss[loss=0.234, simple_loss=0.308, pruned_loss=0.08001, over 4257835.87 frames. ], batch size: 298, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 18:53:51,031 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.487e+02 7.022e+02 8.920e+02 1.196e+03 2.930e+03, threshold=1.784e+03, percent-clipped=9.0 2023-06-24 18:54:01,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1812492.0, ans=0.125 2023-06-24 18:54:33,762 INFO [train.py:996] (2/4) Epoch 10, batch 27650, loss[loss=0.2229, simple_loss=0.293, pruned_loss=0.07639, over 21848.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3024, pruned_loss=0.07908, over 4260980.89 frames. ], batch size: 98, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 18:55:06,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1812672.0, ans=0.0 2023-06-24 18:55:17,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1812732.0, ans=0.0 2023-06-24 18:55:44,566 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 18:56:07,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1812852.0, ans=0.04949747468305833 2023-06-24 18:56:23,898 INFO [train.py:996] (2/4) Epoch 10, batch 27700, loss[loss=0.227, simple_loss=0.3059, pruned_loss=0.07401, over 21630.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3027, pruned_loss=0.07759, over 4257095.73 frames. ], batch size: 230, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 18:56:24,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1812912.0, ans=0.125 2023-06-24 18:57:20,918 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.319e+02 7.274e+02 1.072e+03 1.414e+03 3.667e+03, threshold=2.145e+03, percent-clipped=20.0 2023-06-24 18:58:08,669 INFO [train.py:996] (2/4) Epoch 10, batch 27750, loss[loss=0.171, simple_loss=0.2404, pruned_loss=0.05079, over 16558.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3054, pruned_loss=0.07715, over 4262460.00 frames. ], batch size: 60, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 18:58:22,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1813212.0, ans=0.04949747468305833 2023-06-24 18:58:46,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1813332.0, ans=0.0 2023-06-24 18:59:27,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1813452.0, ans=0.035 2023-06-24 18:59:29,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1813452.0, ans=0.125 2023-06-24 18:59:33,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1813452.0, ans=0.1 2023-06-24 18:59:34,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1813452.0, ans=0.125 2023-06-24 18:59:45,887 INFO [train.py:996] (2/4) Epoch 10, batch 27800, loss[loss=0.1913, simple_loss=0.2615, pruned_loss=0.06057, over 21185.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3054, pruned_loss=0.07761, over 4270613.82 frames. ], batch size: 608, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:00:37,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1813632.0, ans=0.125 2023-06-24 19:00:49,655 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.608e+02 7.266e+02 8.786e+02 1.229e+03 3.001e+03, threshold=1.757e+03, percent-clipped=8.0 2023-06-24 19:01:40,618 INFO [train.py:996] (2/4) Epoch 10, batch 27850, loss[loss=0.2576, simple_loss=0.3266, pruned_loss=0.09433, over 21896.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3052, pruned_loss=0.07942, over 4274519.34 frames. ], batch size: 107, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:01:55,647 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.26 vs. limit=15.0 2023-06-24 19:03:12,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1814052.0, ans=0.125 2023-06-24 19:03:17,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1814052.0, ans=0.0 2023-06-24 19:03:33,929 INFO [train.py:996] (2/4) Epoch 10, batch 27900, loss[loss=0.2001, simple_loss=0.2685, pruned_loss=0.06588, over 16850.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3117, pruned_loss=0.07966, over 4271409.39 frames. ], batch size: 61, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:04:37,096 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.993e+02 8.012e+02 1.137e+03 1.760e+03 3.581e+03, threshold=2.273e+03, percent-clipped=25.0 2023-06-24 19:05:11,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1814352.0, ans=0.125 2023-06-24 19:05:21,032 INFO [train.py:996] (2/4) Epoch 10, batch 27950, loss[loss=0.2292, simple_loss=0.3169, pruned_loss=0.0707, over 21748.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3115, pruned_loss=0.07673, over 4266972.31 frames. ], batch size: 298, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:05:34,856 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 19:05:47,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1814472.0, ans=0.1 2023-06-24 19:06:06,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1814532.0, ans=0.125 2023-06-24 19:06:10,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1814532.0, ans=0.2 2023-06-24 19:06:34,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1814592.0, ans=0.025 2023-06-24 19:06:55,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1814652.0, ans=0.125 2023-06-24 19:07:07,508 INFO [train.py:996] (2/4) Epoch 10, batch 28000, loss[loss=0.2118, simple_loss=0.3036, pruned_loss=0.06003, over 21038.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3109, pruned_loss=0.07506, over 4272182.06 frames. ], batch size: 607, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 19:07:18,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1814712.0, ans=0.125 2023-06-24 19:07:48,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1814772.0, ans=0.0 2023-06-24 19:08:13,041 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.030e+02 6.703e+02 9.141e+02 1.311e+03 2.491e+03, threshold=1.828e+03, percent-clipped=2.0 2023-06-24 19:08:33,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1814892.0, ans=0.0 2023-06-24 19:08:56,446 INFO [train.py:996] (2/4) Epoch 10, batch 28050, loss[loss=0.2174, simple_loss=0.2835, pruned_loss=0.07562, over 21760.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3078, pruned_loss=0.07603, over 4276789.92 frames. ], batch size: 247, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:09:07,391 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.29 vs. limit=15.0 2023-06-24 19:09:08,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1815012.0, ans=0.125 2023-06-24 19:09:33,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1815072.0, ans=0.125 2023-06-24 19:10:04,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1815192.0, ans=0.0 2023-06-24 19:10:37,042 INFO [train.py:996] (2/4) Epoch 10, batch 28100, loss[loss=0.2037, simple_loss=0.2683, pruned_loss=0.06957, over 21634.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3066, pruned_loss=0.07653, over 4276077.96 frames. ], batch size: 282, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:11:53,207 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.563e+02 7.316e+02 9.478e+02 1.492e+03 2.732e+03, threshold=1.896e+03, percent-clipped=11.0 2023-06-24 19:12:28,316 INFO [train.py:996] (2/4) Epoch 10, batch 28150, loss[loss=0.1714, simple_loss=0.2283, pruned_loss=0.05722, over 20725.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.2991, pruned_loss=0.07653, over 4276417.12 frames. ], batch size: 608, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:13:39,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1815792.0, ans=0.125 2023-06-24 19:13:50,454 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=15.0 2023-06-24 19:14:00,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1815852.0, ans=0.1 2023-06-24 19:14:01,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1815852.0, ans=0.1 2023-06-24 19:14:22,721 INFO [train.py:996] (2/4) Epoch 10, batch 28200, loss[loss=0.2992, simple_loss=0.3543, pruned_loss=0.1221, over 21665.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2981, pruned_loss=0.07821, over 4269778.20 frames. ], batch size: 441, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:15:09,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1816032.0, ans=0.2 2023-06-24 19:15:24,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1816092.0, ans=0.0 2023-06-24 19:15:29,921 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.315e+02 7.583e+02 1.086e+03 1.711e+03 4.107e+03, threshold=2.171e+03, percent-clipped=18.0 2023-06-24 19:15:38,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1816092.0, ans=0.125 2023-06-24 19:16:09,733 INFO [train.py:996] (2/4) Epoch 10, batch 28250, loss[loss=0.2143, simple_loss=0.2767, pruned_loss=0.07597, over 21572.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.301, pruned_loss=0.08062, over 4272434.19 frames. ], batch size: 263, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:16:13,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1816212.0, ans=0.0 2023-06-24 19:16:23,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1816212.0, ans=0.125 2023-06-24 19:17:11,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1816392.0, ans=0.125 2023-06-24 19:17:18,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1816392.0, ans=0.05 2023-06-24 19:17:19,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1816392.0, ans=0.2 2023-06-24 19:17:47,388 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 19:17:57,259 INFO [train.py:996] (2/4) Epoch 10, batch 28300, loss[loss=0.2126, simple_loss=0.2785, pruned_loss=0.07339, over 21861.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.2989, pruned_loss=0.0783, over 4276567.80 frames. ], batch size: 107, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:18:31,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1816572.0, ans=0.2 2023-06-24 19:18:32,010 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.56 vs. limit=15.0 2023-06-24 19:18:41,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1816632.0, ans=0.0 2023-06-24 19:19:02,674 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.412e+02 7.987e+02 1.217e+03 2.083e+03 3.970e+03, threshold=2.435e+03, percent-clipped=20.0 2023-06-24 19:19:43,768 INFO [train.py:996] (2/4) Epoch 10, batch 28350, loss[loss=0.1801, simple_loss=0.2601, pruned_loss=0.05008, over 21240.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2959, pruned_loss=0.07266, over 4280963.34 frames. ], batch size: 549, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:20:10,342 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 19:21:07,020 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 19:21:27,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1817052.0, ans=0.1 2023-06-24 19:21:30,125 INFO [train.py:996] (2/4) Epoch 10, batch 28400, loss[loss=0.2275, simple_loss=0.2921, pruned_loss=0.08147, over 21503.00 frames. ], tot_loss[loss=0.2195, simple_loss=0.2936, pruned_loss=0.07273, over 4271379.88 frames. ], batch size: 389, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:21:45,598 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2023-06-24 19:22:16,706 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.99 vs. limit=22.5 2023-06-24 19:22:48,471 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.250e+02 7.623e+02 1.000e+03 1.515e+03 3.438e+03, threshold=2.000e+03, percent-clipped=5.0 2023-06-24 19:22:56,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1817292.0, ans=0.0 2023-06-24 19:23:00,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1817352.0, ans=0.1 2023-06-24 19:23:22,172 INFO [train.py:996] (2/4) Epoch 10, batch 28450, loss[loss=0.2508, simple_loss=0.3251, pruned_loss=0.08828, over 21753.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3014, pruned_loss=0.07711, over 4270554.55 frames. ], batch size: 298, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:24:35,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1817592.0, ans=0.125 2023-06-24 19:24:42,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1817592.0, ans=0.0 2023-06-24 19:24:49,944 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.80 vs. limit=6.0 2023-06-24 19:25:01,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1817652.0, ans=0.125 2023-06-24 19:25:19,415 INFO [train.py:996] (2/4) Epoch 10, batch 28500, loss[loss=0.2578, simple_loss=0.3226, pruned_loss=0.09647, over 20761.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3046, pruned_loss=0.08038, over 4275774.38 frames. ], batch size: 607, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:25:39,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1817772.0, ans=0.1 2023-06-24 19:25:53,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1817772.0, ans=0.2 2023-06-24 19:26:10,237 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-24 19:26:27,792 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.659e+02 7.035e+02 9.248e+02 1.262e+03 2.146e+03, threshold=1.850e+03, percent-clipped=2.0 2023-06-24 19:27:07,854 INFO [train.py:996] (2/4) Epoch 10, batch 28550, loss[loss=0.2682, simple_loss=0.3629, pruned_loss=0.08669, over 21903.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.312, pruned_loss=0.0824, over 4277683.35 frames. ], batch size: 317, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:27:27,344 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.86 vs. limit=10.0 2023-06-24 19:27:36,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1818072.0, ans=0.0 2023-06-24 19:27:39,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1818072.0, ans=0.125 2023-06-24 19:28:09,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1818192.0, ans=0.125 2023-06-24 19:28:55,059 INFO [train.py:996] (2/4) Epoch 10, batch 28600, loss[loss=0.2299, simple_loss=0.3099, pruned_loss=0.07499, over 21431.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3182, pruned_loss=0.08445, over 4280895.54 frames. ], batch size: 211, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:28:56,186 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-24 19:29:18,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1818372.0, ans=0.1 2023-06-24 19:29:21,316 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.74 vs. limit=10.0 2023-06-24 19:30:04,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1818492.0, ans=0.1 2023-06-24 19:30:08,435 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.757e+02 7.121e+02 1.044e+03 1.505e+03 3.342e+03, threshold=2.089e+03, percent-clipped=18.0 2023-06-24 19:30:42,094 INFO [train.py:996] (2/4) Epoch 10, batch 28650, loss[loss=0.2109, simple_loss=0.2709, pruned_loss=0.07541, over 21328.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3108, pruned_loss=0.08266, over 4284991.66 frames. ], batch size: 211, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:30:42,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1818612.0, ans=0.125 2023-06-24 19:31:21,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1818672.0, ans=0.0 2023-06-24 19:32:34,012 INFO [train.py:996] (2/4) Epoch 10, batch 28700, loss[loss=0.2949, simple_loss=0.3533, pruned_loss=0.1182, over 21448.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3109, pruned_loss=0.08422, over 4285325.91 frames. ], batch size: 507, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:32:41,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1818912.0, ans=0.0 2023-06-24 19:33:00,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1818972.0, ans=0.0 2023-06-24 19:33:52,201 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.396e+02 6.260e+02 7.915e+02 1.085e+03 2.283e+03, threshold=1.583e+03, percent-clipped=3.0 2023-06-24 19:34:20,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1819152.0, ans=0.125 2023-06-24 19:34:23,849 INFO [train.py:996] (2/4) Epoch 10, batch 28750, loss[loss=0.2332, simple_loss=0.3052, pruned_loss=0.08059, over 21909.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3105, pruned_loss=0.08438, over 4288206.33 frames. ], batch size: 414, lr: 2.87e-03, grad_scale: 8.0 2023-06-24 19:34:50,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1819272.0, ans=0.125 2023-06-24 19:35:07,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1819272.0, ans=0.125 2023-06-24 19:35:37,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1819392.0, ans=0.1 2023-06-24 19:36:17,103 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.62 vs. limit=12.0 2023-06-24 19:36:21,342 INFO [train.py:996] (2/4) Epoch 10, batch 28800, loss[loss=0.2048, simple_loss=0.3294, pruned_loss=0.0401, over 19869.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3147, pruned_loss=0.08468, over 4281684.36 frames. ], batch size: 702, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:36:57,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1819572.0, ans=0.2 2023-06-24 19:37:29,863 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.249e+02 6.492e+02 9.041e+02 1.378e+03 2.887e+03, threshold=1.808e+03, percent-clipped=17.0 2023-06-24 19:37:56,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1819752.0, ans=0.125 2023-06-24 19:38:07,908 INFO [train.py:996] (2/4) Epoch 10, batch 28850, loss[loss=0.2351, simple_loss=0.2949, pruned_loss=0.08768, over 21665.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3156, pruned_loss=0.08593, over 4284270.78 frames. ], batch size: 230, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:38:52,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1819872.0, ans=0.125 2023-06-24 19:39:04,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1819932.0, ans=0.0 2023-06-24 19:39:04,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1819932.0, ans=0.125 2023-06-24 19:39:58,243 INFO [train.py:996] (2/4) Epoch 10, batch 28900, loss[loss=0.2315, simple_loss=0.3, pruned_loss=0.08152, over 21290.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.319, pruned_loss=0.08791, over 4284121.62 frames. ], batch size: 143, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:40:11,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1820112.0, ans=0.0 2023-06-24 19:40:41,806 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.59 vs. limit=15.0 2023-06-24 19:40:42,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1820172.0, ans=0.125 2023-06-24 19:40:43,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1820172.0, ans=0.0 2023-06-24 19:40:49,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1820232.0, ans=0.0 2023-06-24 19:41:22,755 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.789e+02 7.717e+02 1.074e+03 1.480e+03 3.570e+03, threshold=2.148e+03, percent-clipped=12.0 2023-06-24 19:41:35,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1820352.0, ans=0.125 2023-06-24 19:41:46,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1820352.0, ans=0.07 2023-06-24 19:41:56,135 INFO [train.py:996] (2/4) Epoch 10, batch 28950, loss[loss=0.2345, simple_loss=0.353, pruned_loss=0.05798, over 21210.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3222, pruned_loss=0.08751, over 4276197.28 frames. ], batch size: 548, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:41:59,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1820412.0, ans=0.125 2023-06-24 19:42:10,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=1820412.0, ans=22.5 2023-06-24 19:42:15,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1820412.0, ans=0.125 2023-06-24 19:42:34,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1820472.0, ans=0.125 2023-06-24 19:43:32,865 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 19:43:53,413 INFO [train.py:996] (2/4) Epoch 10, batch 29000, loss[loss=0.2862, simple_loss=0.3568, pruned_loss=0.1078, over 21769.00 frames. ], tot_loss[loss=0.2492, simple_loss=0.3246, pruned_loss=0.08687, over 4269384.95 frames. ], batch size: 441, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:43:53,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1820712.0, ans=0.1 2023-06-24 19:44:53,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1820832.0, ans=0.1 2023-06-24 19:45:02,232 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.875e+02 6.963e+02 8.880e+02 1.390e+03 4.828e+03, threshold=1.776e+03, percent-clipped=11.0 2023-06-24 19:45:11,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1820892.0, ans=0.0 2023-06-24 19:45:17,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1820952.0, ans=0.125 2023-06-24 19:45:20,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1820952.0, ans=0.125 2023-06-24 19:45:41,324 INFO [train.py:996] (2/4) Epoch 10, batch 29050, loss[loss=0.2528, simple_loss=0.3159, pruned_loss=0.09488, over 21808.00 frames. ], tot_loss[loss=0.2496, simple_loss=0.3238, pruned_loss=0.08768, over 4276531.91 frames. ], batch size: 112, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:46:05,897 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=15.0 2023-06-24 19:46:08,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1821072.0, ans=0.0 2023-06-24 19:47:09,120 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.30 vs. limit=22.5 2023-06-24 19:47:17,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1821252.0, ans=0.125 2023-06-24 19:47:18,672 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=22.5 2023-06-24 19:47:27,548 INFO [train.py:996] (2/4) Epoch 10, batch 29100, loss[loss=0.2201, simple_loss=0.2762, pruned_loss=0.08205, over 21569.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3155, pruned_loss=0.08545, over 4280666.12 frames. ], batch size: 231, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:47:31,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1821312.0, ans=0.125 2023-06-24 19:47:42,120 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.05 vs. limit=15.0 2023-06-24 19:48:11,689 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.03 vs. limit=15.0 2023-06-24 19:48:40,592 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.869e+02 7.571e+02 1.020e+03 1.542e+03 3.510e+03, threshold=2.040e+03, percent-clipped=14.0 2023-06-24 19:48:51,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1821552.0, ans=0.0 2023-06-24 19:49:07,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1821552.0, ans=0.125 2023-06-24 19:49:15,509 INFO [train.py:996] (2/4) Epoch 10, batch 29150, loss[loss=0.2316, simple_loss=0.3203, pruned_loss=0.0714, over 21378.00 frames. ], tot_loss[loss=0.2398, simple_loss=0.3127, pruned_loss=0.08347, over 4268140.63 frames. ], batch size: 176, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 19:49:34,719 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1821672.0, ans=0.125 2023-06-24 19:49:34,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1821672.0, ans=0.04949747468305833 2023-06-24 19:49:59,396 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.82 vs. limit=15.0 2023-06-24 19:50:00,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1821732.0, ans=0.125 2023-06-24 19:50:07,510 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.52 vs. limit=15.0 2023-06-24 19:50:43,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1821792.0, ans=0.125 2023-06-24 19:51:04,117 INFO [train.py:996] (2/4) Epoch 10, batch 29200, loss[loss=0.2325, simple_loss=0.3146, pruned_loss=0.07522, over 21731.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3074, pruned_loss=0.0817, over 4263967.73 frames. ], batch size: 351, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 19:51:43,844 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.40 vs. limit=22.5 2023-06-24 19:51:52,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1822032.0, ans=0.125 2023-06-24 19:52:17,861 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.365e+02 7.371e+02 1.140e+03 1.562e+03 2.800e+03, threshold=2.281e+03, percent-clipped=10.0 2023-06-24 19:52:53,160 INFO [train.py:996] (2/4) Epoch 10, batch 29250, loss[loss=0.189, simple_loss=0.2654, pruned_loss=0.05629, over 21778.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3066, pruned_loss=0.07964, over 4268177.72 frames. ], batch size: 118, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 19:52:54,404 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.58 vs. limit=15.0 2023-06-24 19:53:21,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1822272.0, ans=0.04949747468305833 2023-06-24 19:53:48,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1822332.0, ans=0.0 2023-06-24 19:54:12,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1822392.0, ans=0.1 2023-06-24 19:54:20,795 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.08 vs. limit=6.0 2023-06-24 19:54:36,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1822452.0, ans=0.125 2023-06-24 19:54:40,203 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.47 vs. limit=15.0 2023-06-24 19:54:41,046 INFO [train.py:996] (2/4) Epoch 10, batch 29300, loss[loss=0.215, simple_loss=0.2799, pruned_loss=0.07509, over 21212.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3073, pruned_loss=0.07896, over 4266317.41 frames. ], batch size: 159, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 19:54:42,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1822512.0, ans=0.0 2023-06-24 19:54:43,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1822512.0, ans=0.125 2023-06-24 19:55:22,843 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-24 19:56:01,901 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.11 vs. limit=15.0 2023-06-24 19:56:02,856 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.304e+02 6.696e+02 9.053e+02 1.473e+03 3.162e+03, threshold=1.811e+03, percent-clipped=5.0 2023-06-24 19:56:30,735 INFO [train.py:996] (2/4) Epoch 10, batch 29350, loss[loss=0.2245, simple_loss=0.2923, pruned_loss=0.07834, over 15276.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3044, pruned_loss=0.07874, over 4263854.15 frames. ], batch size: 61, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 19:56:31,700 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.12 vs. limit=15.0 2023-06-24 19:56:43,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1822812.0, ans=0.125 2023-06-24 19:57:38,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1822932.0, ans=0.125 2023-06-24 19:57:48,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1822992.0, ans=0.07 2023-06-24 19:58:07,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1823052.0, ans=0.1 2023-06-24 19:58:31,661 INFO [train.py:996] (2/4) Epoch 10, batch 29400, loss[loss=0.2093, simple_loss=0.316, pruned_loss=0.05127, over 20796.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.304, pruned_loss=0.07616, over 4265033.43 frames. ], batch size: 609, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 19:58:52,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1823172.0, ans=0.2 2023-06-24 19:58:59,677 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2023-06-24 19:59:39,415 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.735e+02 7.217e+02 1.208e+03 1.784e+03 4.520e+03, threshold=2.416e+03, percent-clipped=24.0 2023-06-24 19:59:46,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1823292.0, ans=0.0 2023-06-24 19:59:51,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1823352.0, ans=0.1 2023-06-24 20:00:18,183 INFO [train.py:996] (2/4) Epoch 10, batch 29450, loss[loss=0.2145, simple_loss=0.2887, pruned_loss=0.07013, over 21493.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.3028, pruned_loss=0.07545, over 4257950.28 frames. ], batch size: 194, lr: 2.87e-03, grad_scale: 32.0 2023-06-24 20:00:20,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1823412.0, ans=0.1 2023-06-24 20:00:27,583 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-24 20:00:37,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1823412.0, ans=0.125 2023-06-24 20:00:59,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1823472.0, ans=0.125 2023-06-24 20:01:00,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1823532.0, ans=0.1 2023-06-24 20:01:09,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=1823532.0, ans=0.025 2023-06-24 20:01:58,241 INFO [train.py:996] (2/4) Epoch 10, batch 29500, loss[loss=0.2221, simple_loss=0.2853, pruned_loss=0.07941, over 21345.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3066, pruned_loss=0.07878, over 4269148.67 frames. ], batch size: 159, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 20:02:22,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1823772.0, ans=10.0 2023-06-24 20:03:06,149 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.744e+02 7.201e+02 9.951e+02 1.281e+03 3.079e+03, threshold=1.990e+03, percent-clipped=2.0 2023-06-24 20:03:45,381 INFO [train.py:996] (2/4) Epoch 10, batch 29550, loss[loss=0.2444, simple_loss=0.3023, pruned_loss=0.09324, over 21429.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3061, pruned_loss=0.08102, over 4281427.16 frames. ], batch size: 211, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 20:04:47,747 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=15.0 2023-06-24 20:04:52,131 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1824192.0, ans=0.0 2023-06-24 20:05:38,089 INFO [train.py:996] (2/4) Epoch 10, batch 29600, loss[loss=0.3113, simple_loss=0.4373, pruned_loss=0.09264, over 19863.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3141, pruned_loss=0.08348, over 4286787.38 frames. ], batch size: 702, lr: 2.87e-03, grad_scale: 16.0 2023-06-24 20:06:10,232 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=22.5 2023-06-24 20:06:16,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1824432.0, ans=0.2 2023-06-24 20:06:23,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1824432.0, ans=0.125 2023-06-24 20:06:44,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1824492.0, ans=0.125 2023-06-24 20:06:54,891 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.014e+02 8.080e+02 1.124e+03 1.418e+03 4.047e+03, threshold=2.247e+03, percent-clipped=9.0 2023-06-24 20:07:22,970 INFO [train.py:996] (2/4) Epoch 10, batch 29650, loss[loss=0.1954, simple_loss=0.2683, pruned_loss=0.06129, over 21363.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3122, pruned_loss=0.07995, over 4263561.38 frames. ], batch size: 194, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:07:51,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1824672.0, ans=0.0 2023-06-24 20:08:06,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1824732.0, ans=0.0 2023-06-24 20:08:09,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1824732.0, ans=0.0 2023-06-24 20:08:22,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1824792.0, ans=0.125 2023-06-24 20:09:10,401 INFO [train.py:996] (2/4) Epoch 10, batch 29700, loss[loss=0.2058, simple_loss=0.2819, pruned_loss=0.0648, over 16106.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3116, pruned_loss=0.07998, over 4269056.92 frames. ], batch size: 60, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:09:14,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1824912.0, ans=0.2 2023-06-24 20:09:33,573 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=15.0 2023-06-24 20:09:44,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1824972.0, ans=0.025 2023-06-24 20:10:00,519 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=22.5 2023-06-24 20:10:01,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1825032.0, ans=0.5 2023-06-24 20:10:06,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1825092.0, ans=0.125 2023-06-24 20:10:29,380 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.654e+02 7.202e+02 1.096e+03 1.682e+03 4.155e+03, threshold=2.193e+03, percent-clipped=13.0 2023-06-24 20:10:58,055 INFO [train.py:996] (2/4) Epoch 10, batch 29750, loss[loss=0.2794, simple_loss=0.3455, pruned_loss=0.1066, over 21988.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3168, pruned_loss=0.07993, over 4276642.49 frames. ], batch size: 113, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:11:28,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1825272.0, ans=0.125 2023-06-24 20:11:30,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1825272.0, ans=0.1 2023-06-24 20:11:35,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1825332.0, ans=0.125 2023-06-24 20:11:35,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1825332.0, ans=10.0 2023-06-24 20:12:34,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1825452.0, ans=0.1 2023-06-24 20:12:44,341 INFO [train.py:996] (2/4) Epoch 10, batch 29800, loss[loss=0.198, simple_loss=0.2808, pruned_loss=0.05758, over 21485.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3183, pruned_loss=0.08105, over 4276999.82 frames. ], batch size: 194, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:13:40,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1825692.0, ans=0.0 2023-06-24 20:13:40,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1825692.0, ans=0.125 2023-06-24 20:14:03,885 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.147e+02 7.156e+02 9.428e+02 1.298e+03 2.212e+03, threshold=1.886e+03, percent-clipped=2.0 2023-06-24 20:14:26,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1825752.0, ans=0.125 2023-06-24 20:14:29,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1825752.0, ans=0.5 2023-06-24 20:14:32,498 INFO [train.py:996] (2/4) Epoch 10, batch 29850, loss[loss=0.2094, simple_loss=0.2901, pruned_loss=0.06433, over 21839.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.314, pruned_loss=0.07839, over 4280011.27 frames. ], batch size: 391, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:14:54,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1825872.0, ans=0.125 2023-06-24 20:15:16,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1825932.0, ans=0.125 2023-06-24 20:15:18,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1825932.0, ans=0.125 2023-06-24 20:15:45,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1825992.0, ans=0.05 2023-06-24 20:16:13,614 INFO [train.py:996] (2/4) Epoch 10, batch 29900, loss[loss=0.2513, simple_loss=0.3165, pruned_loss=0.09306, over 21760.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3119, pruned_loss=0.07936, over 4289071.81 frames. ], batch size: 414, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:16:19,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1826112.0, ans=0.125 2023-06-24 20:16:32,308 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.50 vs. limit=22.5 2023-06-24 20:16:49,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1826172.0, ans=0.125 2023-06-24 20:17:08,541 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-06-24 20:17:20,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1826292.0, ans=0.2 2023-06-24 20:17:37,191 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.094e+02 6.347e+02 7.852e+02 1.102e+03 2.302e+03, threshold=1.570e+03, percent-clipped=5.0 2023-06-24 20:18:07,360 INFO [train.py:996] (2/4) Epoch 10, batch 29950, loss[loss=0.2994, simple_loss=0.3691, pruned_loss=0.1149, over 21784.00 frames. ], tot_loss[loss=0.2405, simple_loss=0.3151, pruned_loss=0.08293, over 4290118.43 frames. ], batch size: 124, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:18:47,285 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.68 vs. limit=15.0 2023-06-24 20:18:48,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=1826532.0, ans=0.1 2023-06-24 20:19:55,931 INFO [train.py:996] (2/4) Epoch 10, batch 30000, loss[loss=0.272, simple_loss=0.3659, pruned_loss=0.08901, over 21598.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3181, pruned_loss=0.08299, over 4285318.88 frames. ], batch size: 441, lr: 2.86e-03, grad_scale: 32.0 2023-06-24 20:19:55,932 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 20:20:14,325 INFO [train.py:1028] (2/4) Epoch 10, validation: loss=0.2483, simple_loss=0.3443, pruned_loss=0.07614, over 1796401.00 frames. 2023-06-24 20:20:14,325 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-24 20:20:31,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1826712.0, ans=0.125 2023-06-24 20:20:41,675 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.32 vs. limit=22.5 2023-06-24 20:21:40,534 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.194e+02 6.464e+02 9.828e+02 1.489e+03 3.469e+03, threshold=1.966e+03, percent-clipped=22.0 2023-06-24 20:22:00,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1826952.0, ans=0.2 2023-06-24 20:22:16,882 INFO [train.py:996] (2/4) Epoch 10, batch 30050, loss[loss=0.1966, simple_loss=0.2727, pruned_loss=0.06026, over 19948.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3204, pruned_loss=0.08018, over 4274528.52 frames. ], batch size: 702, lr: 2.86e-03, grad_scale: 32.0 2023-06-24 20:22:54,894 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.89 vs. limit=12.0 2023-06-24 20:23:18,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1827192.0, ans=0.125 2023-06-24 20:23:23,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1827192.0, ans=0.0 2023-06-24 20:24:02,293 INFO [train.py:996] (2/4) Epoch 10, batch 30100, loss[loss=0.2341, simple_loss=0.2942, pruned_loss=0.087, over 21785.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3196, pruned_loss=0.07977, over 4266008.36 frames. ], batch size: 102, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:24:19,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1827312.0, ans=0.04949747468305833 2023-06-24 20:24:33,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1827372.0, ans=0.1 2023-06-24 20:24:40,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1827432.0, ans=0.125 2023-06-24 20:25:20,859 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.518e+02 7.739e+02 1.138e+03 1.850e+03 3.841e+03, threshold=2.275e+03, percent-clipped=20.0 2023-06-24 20:25:53,266 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1827612.0, ans=0.125 2023-06-24 20:25:53,985 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.80 vs. limit=15.0 2023-06-24 20:25:54,240 INFO [train.py:996] (2/4) Epoch 10, batch 30150, loss[loss=0.2351, simple_loss=0.3063, pruned_loss=0.08197, over 21425.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3157, pruned_loss=0.08134, over 4259240.22 frames. ], batch size: 159, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:26:01,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1827612.0, ans=0.125 2023-06-24 20:26:12,965 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.56 vs. limit=15.0 2023-06-24 20:26:15,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1827672.0, ans=0.1 2023-06-24 20:26:19,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1827672.0, ans=0.125 2023-06-24 20:26:36,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1827732.0, ans=0.0 2023-06-24 20:26:56,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1827792.0, ans=0.1 2023-06-24 20:27:38,162 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.64 vs. limit=10.0 2023-06-24 20:27:43,913 INFO [train.py:996] (2/4) Epoch 10, batch 30200, loss[loss=0.2139, simple_loss=0.3126, pruned_loss=0.05758, over 21814.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3178, pruned_loss=0.07984, over 4265260.54 frames. ], batch size: 282, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:27:48,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1827912.0, ans=0.125 2023-06-24 20:28:02,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1827912.0, ans=0.125 2023-06-24 20:28:15,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1827972.0, ans=0.0 2023-06-24 20:28:23,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1827972.0, ans=0.0 2023-06-24 20:29:04,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1828092.0, ans=0.125 2023-06-24 20:29:09,508 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.062e+02 7.107e+02 1.003e+03 1.591e+03 3.654e+03, threshold=2.006e+03, percent-clipped=8.0 2023-06-24 20:29:38,177 INFO [train.py:996] (2/4) Epoch 10, batch 30250, loss[loss=0.3078, simple_loss=0.4086, pruned_loss=0.1035, over 21790.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3254, pruned_loss=0.08188, over 4265655.21 frames. ], batch size: 332, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:30:06,431 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.57 vs. limit=6.0 2023-06-24 20:30:31,635 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=15.0 2023-06-24 20:30:34,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1828332.0, ans=0.2 2023-06-24 20:31:07,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1828452.0, ans=0.125 2023-06-24 20:31:07,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1828452.0, ans=0.125 2023-06-24 20:31:24,950 INFO [train.py:996] (2/4) Epoch 10, batch 30300, loss[loss=0.1882, simple_loss=0.2572, pruned_loss=0.05961, over 21283.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3219, pruned_loss=0.08154, over 4266836.60 frames. ], batch size: 176, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:32:12,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1828632.0, ans=0.1 2023-06-24 20:32:28,959 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=12.62 vs. limit=15.0 2023-06-24 20:32:33,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1828692.0, ans=0.125 2023-06-24 20:32:47,061 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.072e+02 7.345e+02 9.964e+02 1.414e+03 3.448e+03, threshold=1.993e+03, percent-clipped=9.0 2023-06-24 20:33:08,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1828752.0, ans=0.125 2023-06-24 20:33:21,517 INFO [train.py:996] (2/4) Epoch 10, batch 30350, loss[loss=0.1792, simple_loss=0.2389, pruned_loss=0.05975, over 21181.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3237, pruned_loss=0.08316, over 4269246.64 frames. ], batch size: 159, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:33:24,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1828812.0, ans=0.125 2023-06-24 20:34:07,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1828932.0, ans=0.1 2023-06-24 20:34:08,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1828932.0, ans=0.2 2023-06-24 20:34:14,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1828992.0, ans=0.125 2023-06-24 20:34:33,123 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.32 vs. limit=15.0 2023-06-24 20:34:50,284 INFO [train.py:996] (2/4) Epoch 10, batch 30400, loss[loss=0.2095, simple_loss=0.2677, pruned_loss=0.07569, over 20217.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3186, pruned_loss=0.08144, over 4260103.17 frames. ], batch size: 703, lr: 2.86e-03, grad_scale: 32.0 2023-06-24 20:35:01,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1829112.0, ans=0.125 2023-06-24 20:35:23,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1829232.0, ans=0.0 2023-06-24 20:35:57,898 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.475e+02 9.226e+02 1.502e+03 2.385e+03 9.827e+03, threshold=3.004e+03, percent-clipped=34.0 2023-06-24 20:36:18,152 INFO [train.py:996] (2/4) Epoch 10, batch 30450, loss[loss=0.2833, simple_loss=0.4084, pruned_loss=0.07912, over 19852.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3191, pruned_loss=0.08033, over 4201070.70 frames. ], batch size: 702, lr: 2.86e-03, grad_scale: 16.0 2023-06-24 20:36:45,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1829472.0, ans=0.0 2023-06-24 20:39:22,098 INFO [train.py:996] (2/4) Epoch 11, batch 0, loss[loss=0.2372, simple_loss=0.3048, pruned_loss=0.08481, over 21680.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3048, pruned_loss=0.08481, over 21680.00 frames. ], batch size: 282, lr: 2.72e-03, grad_scale: 32.0 2023-06-24 20:39:22,099 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 20:39:38,861 INFO [train.py:1028] (2/4) Epoch 11, validation: loss=0.2455, simple_loss=0.3504, pruned_loss=0.07029, over 1796401.00 frames. 2023-06-24 20:39:38,861 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-24 20:39:53,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1829676.0, ans=0.125 2023-06-24 20:39:57,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1829736.0, ans=0.0 2023-06-24 20:40:09,018 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.49 vs. limit=12.0 2023-06-24 20:40:23,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1829796.0, ans=0.1 2023-06-24 20:40:43,497 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.85 vs. limit=15.0 2023-06-24 20:41:04,932 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.679e+02 1.420e+03 2.190e+03 4.363e+03 1.061e+04, threshold=4.380e+03, percent-clipped=34.0 2023-06-24 20:41:05,797 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.08 vs. limit=12.0 2023-06-24 20:41:17,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1829916.0, ans=0.1 2023-06-24 20:41:20,421 INFO [train.py:996] (2/4) Epoch 11, batch 50, loss[loss=0.219, simple_loss=0.2835, pruned_loss=0.07728, over 21834.00 frames. ], tot_loss[loss=0.2513, simple_loss=0.3317, pruned_loss=0.08548, over 971898.56 frames. ], batch size: 98, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:41:48,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1830036.0, ans=0.0 2023-06-24 20:41:54,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1830036.0, ans=0.125 2023-06-24 20:42:01,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1830096.0, ans=0.125 2023-06-24 20:42:04,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1830096.0, ans=0.125 2023-06-24 20:42:32,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1830156.0, ans=0.1 2023-06-24 20:42:56,720 INFO [train.py:996] (2/4) Epoch 11, batch 100, loss[loss=0.2003, simple_loss=0.2653, pruned_loss=0.0676, over 21902.00 frames. ], tot_loss[loss=0.2543, simple_loss=0.3394, pruned_loss=0.08458, over 1702062.79 frames. ], batch size: 98, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:43:21,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1830336.0, ans=0.2 2023-06-24 20:44:30,176 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.668e+02 7.799e+02 1.011e+03 1.345e+03 2.704e+03, threshold=2.023e+03, percent-clipped=0.0 2023-06-24 20:44:32,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1830516.0, ans=0.0 2023-06-24 20:44:50,610 INFO [train.py:996] (2/4) Epoch 11, batch 150, loss[loss=0.2402, simple_loss=0.336, pruned_loss=0.07215, over 21768.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3375, pruned_loss=0.08401, over 2267065.27 frames. ], batch size: 298, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:45:00,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1830576.0, ans=0.125 2023-06-24 20:45:54,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1830756.0, ans=0.125 2023-06-24 20:45:56,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1830756.0, ans=0.0 2023-06-24 20:45:56,724 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=15.0 2023-06-24 20:45:57,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1830756.0, ans=0.2 2023-06-24 20:46:09,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1830816.0, ans=0.1 2023-06-24 20:46:31,227 INFO [train.py:996] (2/4) Epoch 11, batch 200, loss[loss=0.2508, simple_loss=0.3159, pruned_loss=0.09278, over 21852.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.3358, pruned_loss=0.08254, over 2704257.81 frames. ], batch size: 441, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:46:33,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1830876.0, ans=0.0 2023-06-24 20:47:55,400 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.519e+02 7.230e+02 1.005e+03 1.517e+03 6.245e+03, threshold=2.009e+03, percent-clipped=15.0 2023-06-24 20:48:14,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1831176.0, ans=0.05 2023-06-24 20:48:15,603 INFO [train.py:996] (2/4) Epoch 11, batch 250, loss[loss=0.2394, simple_loss=0.3399, pruned_loss=0.06949, over 21839.00 frames. ], tot_loss[loss=0.2466, simple_loss=0.3302, pruned_loss=0.08153, over 3059008.50 frames. ], batch size: 371, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:48:24,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=1831176.0, ans=0.2 2023-06-24 20:48:51,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1831236.0, ans=0.125 2023-06-24 20:48:58,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1831296.0, ans=0.0 2023-06-24 20:49:20,212 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.12 vs. limit=15.0 2023-06-24 20:50:01,285 INFO [train.py:996] (2/4) Epoch 11, batch 300, loss[loss=0.2012, simple_loss=0.2675, pruned_loss=0.06747, over 21586.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3217, pruned_loss=0.0802, over 3326969.65 frames. ], batch size: 263, lr: 2.72e-03, grad_scale: 8.0 2023-06-24 20:50:10,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1831476.0, ans=0.1 2023-06-24 20:50:11,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1831476.0, ans=0.125 2023-06-24 20:50:13,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1831476.0, ans=0.2 2023-06-24 20:50:59,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1831596.0, ans=0.125 2023-06-24 20:51:16,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1831656.0, ans=0.0 2023-06-24 20:51:24,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1831716.0, ans=0.0 2023-06-24 20:51:30,477 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.045e+02 7.499e+02 1.164e+03 1.692e+03 3.059e+03, threshold=2.329e+03, percent-clipped=16.0 2023-06-24 20:51:50,064 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.07 vs. limit=15.0 2023-06-24 20:51:50,454 INFO [train.py:996] (2/4) Epoch 11, batch 350, loss[loss=0.2067, simple_loss=0.2962, pruned_loss=0.05859, over 21297.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3132, pruned_loss=0.07946, over 3538979.16 frames. ], batch size: 176, lr: 2.72e-03, grad_scale: 8.0 2023-06-24 20:52:06,449 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.01 vs. limit=22.5 2023-06-24 20:52:15,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1831836.0, ans=0.125 2023-06-24 20:52:20,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1831836.0, ans=0.1 2023-06-24 20:52:33,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1831896.0, ans=0.1 2023-06-24 20:52:52,634 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=12.0 2023-06-24 20:53:31,107 INFO [train.py:996] (2/4) Epoch 11, batch 400, loss[loss=0.1932, simple_loss=0.2529, pruned_loss=0.06672, over 21282.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3104, pruned_loss=0.07829, over 3694261.55 frames. ], batch size: 144, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:53:32,202 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.86 vs. limit=6.0 2023-06-24 20:53:46,663 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.59 vs. limit=6.0 2023-06-24 20:54:40,139 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.93 vs. limit=22.5 2023-06-24 20:55:11,319 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.125e+02 8.397e+02 1.523e+03 1.983e+03 4.862e+03, threshold=3.046e+03, percent-clipped=16.0 2023-06-24 20:55:13,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1832316.0, ans=0.125 2023-06-24 20:55:18,139 INFO [train.py:996] (2/4) Epoch 11, batch 450, loss[loss=0.255, simple_loss=0.3062, pruned_loss=0.1019, over 21308.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.306, pruned_loss=0.07753, over 3821213.00 frames. ], batch size: 473, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:55:21,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1832376.0, ans=0.95 2023-06-24 20:55:44,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1832436.0, ans=0.0 2023-06-24 20:56:29,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1832556.0, ans=0.125 2023-06-24 20:56:53,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1832616.0, ans=0.1 2023-06-24 20:57:08,312 INFO [train.py:996] (2/4) Epoch 11, batch 500, loss[loss=0.2634, simple_loss=0.3551, pruned_loss=0.08587, over 21788.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3065, pruned_loss=0.07645, over 3929482.25 frames. ], batch size: 282, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 20:57:08,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1832676.0, ans=0.125 2023-06-24 20:57:12,459 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.96 vs. limit=12.0 2023-06-24 20:58:00,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1832796.0, ans=0.125 2023-06-24 20:58:03,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1832796.0, ans=0.2 2023-06-24 20:58:04,462 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.33 vs. limit=15.0 2023-06-24 20:58:07,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1832856.0, ans=0.2 2023-06-24 20:58:11,237 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=15.0 2023-06-24 20:58:12,842 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-24 20:58:40,204 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.448e+02 9.876e+02 1.724e+03 2.578e+03 4.436e+03, threshold=3.448e+03, percent-clipped=13.0 2023-06-24 20:58:53,038 INFO [train.py:996] (2/4) Epoch 11, batch 550, loss[loss=0.3051, simple_loss=0.4257, pruned_loss=0.09224, over 20712.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.31, pruned_loss=0.07474, over 4000359.45 frames. ], batch size: 607, lr: 2.72e-03, grad_scale: 16.0 2023-06-24 21:00:13,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1833156.0, ans=0.1 2023-06-24 21:00:22,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1833216.0, ans=0.125 2023-06-24 21:00:38,827 INFO [train.py:996] (2/4) Epoch 11, batch 600, loss[loss=0.2177, simple_loss=0.2807, pruned_loss=0.07736, over 21849.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3164, pruned_loss=0.0759, over 4065831.07 frames. ], batch size: 98, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:00:53,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1833336.0, ans=0.04949747468305833 2023-06-24 21:01:09,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1833336.0, ans=0.0 2023-06-24 21:01:21,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1833336.0, ans=15.0 2023-06-24 21:01:48,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1833456.0, ans=0.125 2023-06-24 21:02:13,831 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.680e+02 7.057e+02 1.048e+03 1.641e+03 3.624e+03, threshold=2.096e+03, percent-clipped=2.0 2023-06-24 21:02:26,650 INFO [train.py:996] (2/4) Epoch 11, batch 650, loss[loss=0.2166, simple_loss=0.3508, pruned_loss=0.04118, over 20735.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3171, pruned_loss=0.07545, over 4111855.66 frames. ], batch size: 608, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:03:01,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1833636.0, ans=0.07 2023-06-24 21:03:01,838 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.85 vs. limit=6.0 2023-06-24 21:03:15,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1833696.0, ans=0.125 2023-06-24 21:03:22,614 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.14 vs. limit=10.0 2023-06-24 21:03:36,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1833756.0, ans=0.1 2023-06-24 21:04:04,921 INFO [train.py:996] (2/4) Epoch 11, batch 700, loss[loss=0.2299, simple_loss=0.3009, pruned_loss=0.07943, over 21786.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.316, pruned_loss=0.07683, over 4157794.58 frames. ], batch size: 332, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:05:06,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1833996.0, ans=0.125 2023-06-24 21:05:18,260 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1834056.0, ans=10.0 2023-06-24 21:05:44,874 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.745e+02 9.389e+02 1.418e+03 2.158e+03 4.228e+03, threshold=2.836e+03, percent-clipped=28.0 2023-06-24 21:05:51,316 INFO [train.py:996] (2/4) Epoch 11, batch 750, loss[loss=0.2146, simple_loss=0.2831, pruned_loss=0.07311, over 21750.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3127, pruned_loss=0.07773, over 4192799.58 frames. ], batch size: 316, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:05:51,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1834176.0, ans=0.125 2023-06-24 21:06:12,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1834236.0, ans=0.125 2023-06-24 21:06:50,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1834296.0, ans=0.125 2023-06-24 21:07:08,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1834356.0, ans=0.1 2023-06-24 21:07:39,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1834476.0, ans=0.125 2023-06-24 21:07:40,971 INFO [train.py:996] (2/4) Epoch 11, batch 800, loss[loss=0.1879, simple_loss=0.2676, pruned_loss=0.05406, over 21726.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3109, pruned_loss=0.07839, over 4196718.57 frames. ], batch size: 124, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:07:59,187 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.96 vs. limit=22.5 2023-06-24 21:08:38,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1834596.0, ans=0.125 2023-06-24 21:08:51,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1834656.0, ans=0.0 2023-06-24 21:09:21,224 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.816e+02 7.606e+02 1.363e+03 2.031e+03 4.976e+03, threshold=2.727e+03, percent-clipped=7.0 2023-06-24 21:09:32,172 INFO [train.py:996] (2/4) Epoch 11, batch 850, loss[loss=0.2446, simple_loss=0.3104, pruned_loss=0.08944, over 21466.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3086, pruned_loss=0.0786, over 4224091.20 frames. ], batch size: 131, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:10:03,754 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1834836.0, ans=0.125 2023-06-24 21:10:16,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1834896.0, ans=0.1 2023-06-24 21:10:19,736 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-24 21:10:56,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1835016.0, ans=10.0 2023-06-24 21:11:09,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1835016.0, ans=0.5 2023-06-24 21:11:20,168 INFO [train.py:996] (2/4) Epoch 11, batch 900, loss[loss=0.2206, simple_loss=0.315, pruned_loss=0.06312, over 21839.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3046, pruned_loss=0.07781, over 4244867.01 frames. ], batch size: 371, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:11:44,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1835136.0, ans=0.125 2023-06-24 21:12:06,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1835196.0, ans=0.125 2023-06-24 21:12:55,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1835316.0, ans=0.0 2023-06-24 21:13:04,988 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.933e+02 7.581e+02 9.805e+02 1.489e+03 3.191e+03, threshold=1.961e+03, percent-clipped=4.0 2023-06-24 21:13:08,469 INFO [train.py:996] (2/4) Epoch 11, batch 950, loss[loss=0.2607, simple_loss=0.3943, pruned_loss=0.06351, over 19734.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3026, pruned_loss=0.07715, over 4250466.28 frames. ], batch size: 702, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:13:47,186 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:14:00,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1835496.0, ans=0.125 2023-06-24 21:14:57,541 INFO [train.py:996] (2/4) Epoch 11, batch 1000, loss[loss=0.2605, simple_loss=0.3311, pruned_loss=0.09491, over 21818.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3032, pruned_loss=0.07751, over 4262035.53 frames. ], batch size: 118, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:16:07,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1835856.0, ans=0.125 2023-06-24 21:16:32,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1835916.0, ans=0.125 2023-06-24 21:16:49,677 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.933e+02 6.726e+02 9.371e+02 1.402e+03 3.411e+03, threshold=1.874e+03, percent-clipped=8.0 2023-06-24 21:16:52,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1835976.0, ans=0.125 2023-06-24 21:16:53,242 INFO [train.py:996] (2/4) Epoch 11, batch 1050, loss[loss=0.2496, simple_loss=0.3213, pruned_loss=0.08892, over 21321.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3034, pruned_loss=0.07743, over 4267831.10 frames. ], batch size: 143, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:16:58,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1835976.0, ans=0.0 2023-06-24 21:17:41,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1836096.0, ans=0.0 2023-06-24 21:17:55,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1836156.0, ans=0.125 2023-06-24 21:18:17,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1836156.0, ans=0.04949747468305833 2023-06-24 21:18:22,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1836216.0, ans=0.125 2023-06-24 21:18:43,428 INFO [train.py:996] (2/4) Epoch 11, batch 1100, loss[loss=0.2426, simple_loss=0.3096, pruned_loss=0.08777, over 21890.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3045, pruned_loss=0.07707, over 4271694.40 frames. ], batch size: 351, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:18:45,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1836276.0, ans=0.0 2023-06-24 21:20:25,776 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1836516.0, ans=0.0 2023-06-24 21:20:26,807 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.856e+02 8.203e+02 1.251e+03 2.125e+03 4.416e+03, threshold=2.502e+03, percent-clipped=31.0 2023-06-24 21:20:36,472 INFO [train.py:996] (2/4) Epoch 11, batch 1150, loss[loss=0.2356, simple_loss=0.3127, pruned_loss=0.07929, over 21315.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3069, pruned_loss=0.07705, over 4285963.22 frames. ], batch size: 194, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:21:17,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1836696.0, ans=0.0 2023-06-24 21:21:57,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1836756.0, ans=0.125 2023-06-24 21:22:25,707 INFO [train.py:996] (2/4) Epoch 11, batch 1200, loss[loss=0.2154, simple_loss=0.2976, pruned_loss=0.06657, over 21829.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3083, pruned_loss=0.07724, over 4286030.96 frames. ], batch size: 282, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:22:34,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1836876.0, ans=0.125 2023-06-24 21:23:08,106 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:23:11,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1836996.0, ans=0.05 2023-06-24 21:23:50,254 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 21:24:02,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1837116.0, ans=0.2 2023-06-24 21:24:05,093 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.099e+02 7.723e+02 1.059e+03 1.468e+03 2.676e+03, threshold=2.118e+03, percent-clipped=4.0 2023-06-24 21:24:11,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1837116.0, ans=0.125 2023-06-24 21:24:14,372 INFO [train.py:996] (2/4) Epoch 11, batch 1250, loss[loss=0.2331, simple_loss=0.3097, pruned_loss=0.07819, over 21199.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3114, pruned_loss=0.07868, over 4291372.35 frames. ], batch size: 143, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:24:25,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1837176.0, ans=0.04949747468305833 2023-06-24 21:24:48,359 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.14 vs. limit=22.5 2023-06-24 21:24:48,473 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.28 vs. limit=10.0 2023-06-24 21:24:51,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1837236.0, ans=0.125 2023-06-24 21:24:56,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1837296.0, ans=0.0 2023-06-24 21:25:59,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1837416.0, ans=0.0 2023-06-24 21:26:04,413 INFO [train.py:996] (2/4) Epoch 11, batch 1300, loss[loss=0.2035, simple_loss=0.2951, pruned_loss=0.05599, over 21421.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3137, pruned_loss=0.08013, over 4296310.60 frames. ], batch size: 211, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:26:51,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1837596.0, ans=0.125 2023-06-24 21:27:05,600 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.42 vs. limit=10.0 2023-06-24 21:27:52,120 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.439e+02 7.719e+02 9.846e+02 1.503e+03 2.792e+03, threshold=1.969e+03, percent-clipped=4.0 2023-06-24 21:27:53,902 INFO [train.py:996] (2/4) Epoch 11, batch 1350, loss[loss=0.1813, simple_loss=0.2321, pruned_loss=0.06519, over 19951.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3122, pruned_loss=0.07947, over 4293290.15 frames. ], batch size: 703, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:28:36,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1837896.0, ans=0.0 2023-06-24 21:29:15,724 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.77 vs. limit=6.0 2023-06-24 21:29:40,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1838016.0, ans=0.2 2023-06-24 21:29:43,454 INFO [train.py:996] (2/4) Epoch 11, batch 1400, loss[loss=0.2861, simple_loss=0.3705, pruned_loss=0.1008, over 21477.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3114, pruned_loss=0.08011, over 4286360.09 frames. ], batch size: 471, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:30:34,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1838196.0, ans=10.0 2023-06-24 21:30:37,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1838196.0, ans=0.035 2023-06-24 21:31:02,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1838256.0, ans=0.0 2023-06-24 21:31:23,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1838316.0, ans=0.125 2023-06-24 21:31:31,853 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.111e+02 8.921e+02 1.291e+03 1.879e+03 3.355e+03, threshold=2.582e+03, percent-clipped=19.0 2023-06-24 21:31:33,598 INFO [train.py:996] (2/4) Epoch 11, batch 1450, loss[loss=0.2488, simple_loss=0.3232, pruned_loss=0.08726, over 21678.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3097, pruned_loss=0.08011, over 4288979.67 frames. ], batch size: 332, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:32:33,539 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.73 vs. limit=15.0 2023-06-24 21:33:01,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1838556.0, ans=0.1 2023-06-24 21:33:02,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1838616.0, ans=0.125 2023-06-24 21:33:21,316 INFO [train.py:996] (2/4) Epoch 11, batch 1500, loss[loss=0.239, simple_loss=0.306, pruned_loss=0.08601, over 21926.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3112, pruned_loss=0.08094, over 4280744.99 frames. ], batch size: 316, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:33:35,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1838676.0, ans=0.04949747468305833 2023-06-24 21:34:32,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1838856.0, ans=0.1 2023-06-24 21:34:32,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1838856.0, ans=0.2 2023-06-24 21:34:35,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1838856.0, ans=0.125 2023-06-24 21:34:51,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1838916.0, ans=0.125 2023-06-24 21:35:08,571 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.229e+02 8.037e+02 1.041e+03 1.486e+03 3.371e+03, threshold=2.081e+03, percent-clipped=9.0 2023-06-24 21:35:10,366 INFO [train.py:996] (2/4) Epoch 11, batch 1550, loss[loss=0.2491, simple_loss=0.3009, pruned_loss=0.09863, over 21550.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3088, pruned_loss=0.07994, over 4273516.75 frames. ], batch size: 441, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:36:02,269 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.70 vs. limit=22.5 2023-06-24 21:36:44,001 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=12.0 2023-06-24 21:36:46,910 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.76 vs. limit=15.0 2023-06-24 21:37:01,931 INFO [train.py:996] (2/4) Epoch 11, batch 1600, loss[loss=0.2036, simple_loss=0.264, pruned_loss=0.07163, over 21198.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3076, pruned_loss=0.07858, over 4280277.20 frames. ], batch size: 159, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:37:21,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1839276.0, ans=0.0 2023-06-24 21:38:59,345 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.877e+02 8.023e+02 1.190e+03 1.791e+03 3.601e+03, threshold=2.379e+03, percent-clipped=18.0 2023-06-24 21:39:01,023 INFO [train.py:996] (2/4) Epoch 11, batch 1650, loss[loss=0.2354, simple_loss=0.3257, pruned_loss=0.07256, over 20934.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.308, pruned_loss=0.07829, over 4274859.71 frames. ], batch size: 607, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:39:01,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1839576.0, ans=0.1 2023-06-24 21:39:59,981 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.23 vs. limit=15.0 2023-06-24 21:40:13,644 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1839756.0, ans=0.125 2023-06-24 21:40:22,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1839756.0, ans=0.1 2023-06-24 21:40:41,119 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.05 vs. limit=22.5 2023-06-24 21:40:50,789 INFO [train.py:996] (2/4) Epoch 11, batch 1700, loss[loss=0.2559, simple_loss=0.3457, pruned_loss=0.08299, over 21610.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3108, pruned_loss=0.07843, over 4276743.53 frames. ], batch size: 441, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:41:11,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1839876.0, ans=0.2 2023-06-24 21:42:02,300 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.54 vs. limit=12.0 2023-06-24 21:42:24,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1840116.0, ans=0.125 2023-06-24 21:42:47,740 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.929e+02 7.610e+02 1.287e+03 1.983e+03 3.488e+03, threshold=2.574e+03, percent-clipped=18.0 2023-06-24 21:42:49,497 INFO [train.py:996] (2/4) Epoch 11, batch 1750, loss[loss=0.1611, simple_loss=0.2291, pruned_loss=0.04655, over 21780.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3106, pruned_loss=0.07782, over 4270506.37 frames. ], batch size: 124, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:42:50,553 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.24 vs. limit=12.0 2023-06-24 21:43:44,253 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=22.5 2023-06-24 21:44:17,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1840416.0, ans=0.125 2023-06-24 21:44:19,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1840416.0, ans=0.0 2023-06-24 21:44:42,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1840416.0, ans=0.2 2023-06-24 21:44:50,785 INFO [train.py:996] (2/4) Epoch 11, batch 1800, loss[loss=0.2739, simple_loss=0.3515, pruned_loss=0.09819, over 21344.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3079, pruned_loss=0.07513, over 4261857.72 frames. ], batch size: 549, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:44:51,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1840476.0, ans=0.0 2023-06-24 21:44:52,165 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.75 vs. limit=15.0 2023-06-24 21:45:30,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1840596.0, ans=0.0 2023-06-24 21:45:34,647 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.30 vs. limit=6.0 2023-06-24 21:45:59,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1840656.0, ans=15.0 2023-06-24 21:46:40,512 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.496e+02 6.892e+02 1.019e+03 1.831e+03 4.064e+03, threshold=2.037e+03, percent-clipped=9.0 2023-06-24 21:46:48,436 INFO [train.py:996] (2/4) Epoch 11, batch 1850, loss[loss=0.2376, simple_loss=0.3115, pruned_loss=0.08183, over 21411.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3062, pruned_loss=0.07273, over 4262089.81 frames. ], batch size: 131, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 21:46:55,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1840776.0, ans=0.07 2023-06-24 21:46:59,677 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.99 vs. limit=22.5 2023-06-24 21:47:44,177 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.60 vs. limit=15.0 2023-06-24 21:48:01,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1840956.0, ans=0.125 2023-06-24 21:48:32,721 INFO [train.py:996] (2/4) Epoch 11, batch 1900, loss[loss=0.1972, simple_loss=0.2808, pruned_loss=0.05684, over 21814.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.3082, pruned_loss=0.07333, over 4264555.32 frames. ], batch size: 282, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:48:33,670 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.10 vs. limit=10.0 2023-06-24 21:48:38,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1841076.0, ans=0.125 2023-06-24 21:48:44,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1841076.0, ans=0.1 2023-06-24 21:49:13,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1841196.0, ans=0.2 2023-06-24 21:49:18,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1841196.0, ans=0.0 2023-06-24 21:50:20,910 INFO [train.py:996] (2/4) Epoch 11, batch 1950, loss[loss=0.2063, simple_loss=0.263, pruned_loss=0.07486, over 21907.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3056, pruned_loss=0.07275, over 4261964.55 frames. ], batch size: 125, lr: 2.71e-03, grad_scale: 4.0 2023-06-24 21:50:22,716 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.695e+02 9.605e+02 1.769e+03 2.616e+03 5.034e+03, threshold=3.539e+03, percent-clipped=42.0 2023-06-24 21:50:27,305 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.90 vs. limit=15.0 2023-06-24 21:51:07,319 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.55 vs. limit=22.5 2023-06-24 21:52:05,552 INFO [train.py:996] (2/4) Epoch 11, batch 2000, loss[loss=0.2684, simple_loss=0.3488, pruned_loss=0.09398, over 19923.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3026, pruned_loss=0.07271, over 4267422.56 frames. ], batch size: 703, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:53:03,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1841856.0, ans=0.0 2023-06-24 21:53:35,373 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.70 vs. limit=15.0 2023-06-24 21:53:55,941 INFO [train.py:996] (2/4) Epoch 11, batch 2050, loss[loss=0.2214, simple_loss=0.3016, pruned_loss=0.07055, over 21462.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3035, pruned_loss=0.07408, over 4275944.93 frames. ], batch size: 131, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:53:57,626 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.961e+02 9.295e+02 1.430e+03 2.343e+03 5.111e+03, threshold=2.860e+03, percent-clipped=7.0 2023-06-24 21:54:12,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1842036.0, ans=0.125 2023-06-24 21:54:18,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1842036.0, ans=0.0 2023-06-24 21:55:19,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1842156.0, ans=0.0 2023-06-24 21:55:47,679 INFO [train.py:996] (2/4) Epoch 11, batch 2100, loss[loss=0.3053, simple_loss=0.3776, pruned_loss=0.1165, over 21777.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3066, pruned_loss=0.07631, over 4282515.69 frames. ], batch size: 414, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:56:15,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1842336.0, ans=0.05 2023-06-24 21:57:38,442 INFO [train.py:996] (2/4) Epoch 11, batch 2150, loss[loss=0.2055, simple_loss=0.2673, pruned_loss=0.07187, over 21743.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3108, pruned_loss=0.07945, over 4279433.81 frames. ], batch size: 124, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:57:39,924 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.865e+02 8.663e+02 1.127e+03 1.659e+03 3.855e+03, threshold=2.253e+03, percent-clipped=2.0 2023-06-24 21:57:42,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1842576.0, ans=0.125 2023-06-24 21:57:49,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1842576.0, ans=0.015 2023-06-24 21:58:16,638 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.97 vs. limit=15.0 2023-06-24 21:58:21,552 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.73 vs. limit=10.0 2023-06-24 21:58:30,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1842696.0, ans=0.125 2023-06-24 21:59:26,324 INFO [train.py:996] (2/4) Epoch 11, batch 2200, loss[loss=0.2116, simple_loss=0.2913, pruned_loss=0.06593, over 21817.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3138, pruned_loss=0.08039, over 4280719.60 frames. ], batch size: 124, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 21:59:36,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1842876.0, ans=0.125 2023-06-24 21:59:36,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1842876.0, ans=0.2 2023-06-24 21:59:39,005 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-06-24 21:59:52,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1842936.0, ans=0.0 2023-06-24 22:00:06,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1842996.0, ans=0.125 2023-06-24 22:00:23,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1842996.0, ans=0.125 2023-06-24 22:01:16,577 INFO [train.py:996] (2/4) Epoch 11, batch 2250, loss[loss=0.2298, simple_loss=0.3577, pruned_loss=0.051, over 20825.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3146, pruned_loss=0.07913, over 4277760.47 frames. ], batch size: 608, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 22:01:18,197 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.153e+02 9.027e+02 1.396e+03 1.956e+03 3.592e+03, threshold=2.793e+03, percent-clipped=17.0 2023-06-24 22:01:24,515 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.14 vs. limit=15.0 2023-06-24 22:02:02,080 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=22.5 2023-06-24 22:02:11,826 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=22.5 2023-06-24 22:02:43,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1843356.0, ans=0.0 2023-06-24 22:02:45,250 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.95 vs. limit=6.0 2023-06-24 22:03:05,565 INFO [train.py:996] (2/4) Epoch 11, batch 2300, loss[loss=0.2618, simple_loss=0.3484, pruned_loss=0.0876, over 21570.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3113, pruned_loss=0.07827, over 4281438.80 frames. ], batch size: 471, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 22:04:57,278 INFO [train.py:996] (2/4) Epoch 11, batch 2350, loss[loss=0.2235, simple_loss=0.2962, pruned_loss=0.0754, over 21647.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3039, pruned_loss=0.07689, over 4272088.99 frames. ], batch size: 332, lr: 2.71e-03, grad_scale: 8.0 2023-06-24 22:04:58,969 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.967e+02 8.353e+02 1.301e+03 1.765e+03 5.491e+03, threshold=2.603e+03, percent-clipped=6.0 2023-06-24 22:05:02,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1843776.0, ans=0.125 2023-06-24 22:05:03,006 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=12.91 vs. limit=15.0 2023-06-24 22:05:20,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1843836.0, ans=0.0 2023-06-24 22:06:08,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1843896.0, ans=0.0 2023-06-24 22:06:21,392 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=22.5 2023-06-24 22:06:28,031 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.41 vs. limit=10.0 2023-06-24 22:06:47,692 INFO [train.py:996] (2/4) Epoch 11, batch 2400, loss[loss=0.2059, simple_loss=0.2864, pruned_loss=0.06271, over 21083.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3053, pruned_loss=0.07802, over 4263787.66 frames. ], batch size: 607, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:06:48,840 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.07 vs. limit=10.0 2023-06-24 22:07:09,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1844136.0, ans=0.125 2023-06-24 22:08:41,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1844316.0, ans=0.0 2023-06-24 22:08:44,042 INFO [train.py:996] (2/4) Epoch 11, batch 2450, loss[loss=0.2272, simple_loss=0.2963, pruned_loss=0.07904, over 21822.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3095, pruned_loss=0.08111, over 4271509.42 frames. ], batch size: 317, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:08:45,753 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.999e+02 9.036e+02 1.390e+03 1.907e+03 3.347e+03, threshold=2.779e+03, percent-clipped=7.0 2023-06-24 22:09:25,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1844436.0, ans=0.0 2023-06-24 22:09:34,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1844496.0, ans=0.125 2023-06-24 22:09:41,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1844496.0, ans=0.125 2023-06-24 22:09:47,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1844496.0, ans=0.125 2023-06-24 22:09:51,321 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 22:10:21,093 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.50 vs. limit=15.0 2023-06-24 22:10:24,537 INFO [train.py:996] (2/4) Epoch 11, batch 2500, loss[loss=0.2267, simple_loss=0.3011, pruned_loss=0.07615, over 21448.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3054, pruned_loss=0.08037, over 4266956.96 frames. ], batch size: 211, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:10:46,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1844736.0, ans=0.1 2023-06-24 22:11:05,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1844736.0, ans=0.125 2023-06-24 22:11:31,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1844796.0, ans=0.0 2023-06-24 22:11:54,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1844856.0, ans=0.125 2023-06-24 22:12:05,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=1844916.0, ans=0.02 2023-06-24 22:12:21,377 INFO [train.py:996] (2/4) Epoch 11, batch 2550, loss[loss=0.2646, simple_loss=0.3329, pruned_loss=0.0982, over 21642.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3055, pruned_loss=0.08042, over 4273385.65 frames. ], batch size: 263, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:12:22,881 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.374e+02 8.761e+02 1.237e+03 1.691e+03 3.223e+03, threshold=2.475e+03, percent-clipped=6.0 2023-06-24 22:12:23,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1844976.0, ans=0.0 2023-06-24 22:12:37,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1845036.0, ans=0.125 2023-06-24 22:12:46,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1845036.0, ans=0.1 2023-06-24 22:12:51,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1845036.0, ans=0.0 2023-06-24 22:13:47,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1845216.0, ans=0.0 2023-06-24 22:14:11,175 INFO [train.py:996] (2/4) Epoch 11, batch 2600, loss[loss=0.2283, simple_loss=0.2961, pruned_loss=0.0803, over 21754.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3063, pruned_loss=0.08026, over 4266929.34 frames. ], batch size: 247, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:14:47,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1845336.0, ans=0.125 2023-06-24 22:15:29,487 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-06-24 22:15:35,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1845516.0, ans=0.125 2023-06-24 22:16:00,038 INFO [train.py:996] (2/4) Epoch 11, batch 2650, loss[loss=0.1915, simple_loss=0.2655, pruned_loss=0.05871, over 21619.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3085, pruned_loss=0.08163, over 4271883.47 frames. ], batch size: 263, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:16:01,623 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.364e+02 1.066e+03 1.667e+03 2.223e+03 5.089e+03, threshold=3.334e+03, percent-clipped=18.0 2023-06-24 22:16:22,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1845636.0, ans=0.125 2023-06-24 22:16:39,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1845636.0, ans=0.07 2023-06-24 22:16:43,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1845636.0, ans=0.1 2023-06-24 22:17:46,142 INFO [train.py:996] (2/4) Epoch 11, batch 2700, loss[loss=0.2091, simple_loss=0.278, pruned_loss=0.07014, over 21659.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3068, pruned_loss=0.08052, over 4266720.76 frames. ], batch size: 247, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:18:09,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1845936.0, ans=0.125 2023-06-24 22:18:13,513 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=10.93 vs. limit=22.5 2023-06-24 22:18:38,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1845996.0, ans=0.0 2023-06-24 22:19:28,025 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.39 vs. limit=5.0 2023-06-24 22:19:36,530 INFO [train.py:996] (2/4) Epoch 11, batch 2750, loss[loss=0.2093, simple_loss=0.3004, pruned_loss=0.05906, over 21624.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3068, pruned_loss=0.08078, over 4266153.54 frames. ], batch size: 263, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:19:38,337 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.314e+02 7.487e+02 1.146e+03 1.660e+03 3.901e+03, threshold=2.292e+03, percent-clipped=2.0 2023-06-24 22:20:09,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1846236.0, ans=0.0 2023-06-24 22:20:10,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1846236.0, ans=0.0 2023-06-24 22:21:19,844 INFO [train.py:996] (2/4) Epoch 11, batch 2800, loss[loss=0.2869, simple_loss=0.3775, pruned_loss=0.09819, over 21641.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3096, pruned_loss=0.08181, over 4272676.15 frames. ], batch size: 414, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:21:48,626 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.70 vs. limit=15.0 2023-06-24 22:22:18,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1846596.0, ans=0.125 2023-06-24 22:23:10,942 INFO [train.py:996] (2/4) Epoch 11, batch 2850, loss[loss=0.2245, simple_loss=0.3596, pruned_loss=0.04472, over 19718.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3107, pruned_loss=0.08217, over 4264560.43 frames. ], batch size: 703, lr: 2.71e-03, grad_scale: 16.0 2023-06-24 22:23:19,718 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.356e+02 9.385e+02 1.588e+03 2.448e+03 5.122e+03, threshold=3.175e+03, percent-clipped=28.0 2023-06-24 22:24:06,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1846896.0, ans=0.125 2023-06-24 22:24:12,524 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.32 vs. limit=10.0 2023-06-24 22:24:13,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1846956.0, ans=0.0 2023-06-24 22:24:59,825 INFO [train.py:996] (2/4) Epoch 11, batch 2900, loss[loss=0.2388, simple_loss=0.3089, pruned_loss=0.08437, over 21931.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.31, pruned_loss=0.08214, over 4275851.74 frames. ], batch size: 107, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:25:03,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1847076.0, ans=0.2 2023-06-24 22:25:04,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1847076.0, ans=0.125 2023-06-24 22:25:08,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1847076.0, ans=0.1 2023-06-24 22:25:56,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1847196.0, ans=0.125 2023-06-24 22:26:12,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1847256.0, ans=0.125 2023-06-24 22:26:48,153 INFO [train.py:996] (2/4) Epoch 11, batch 2950, loss[loss=0.2419, simple_loss=0.3355, pruned_loss=0.07419, over 21816.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3111, pruned_loss=0.081, over 4277306.42 frames. ], batch size: 351, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:26:51,471 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.380e+02 7.849e+02 1.003e+03 1.596e+03 3.041e+03, threshold=2.006e+03, percent-clipped=1.0 2023-06-24 22:27:47,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1847496.0, ans=0.1 2023-06-24 22:28:05,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1847556.0, ans=0.125 2023-06-24 22:28:08,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1847556.0, ans=0.1 2023-06-24 22:28:39,687 INFO [train.py:996] (2/4) Epoch 11, batch 3000, loss[loss=0.2509, simple_loss=0.3294, pruned_loss=0.08617, over 21762.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3156, pruned_loss=0.08143, over 4282399.23 frames. ], batch size: 332, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:28:39,688 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-24 22:29:02,933 INFO [train.py:1028] (2/4) Epoch 11, validation: loss=0.2533, simple_loss=0.3467, pruned_loss=0.07995, over 1796401.00 frames. 2023-06-24 22:29:02,934 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-24 22:30:50,696 INFO [train.py:996] (2/4) Epoch 11, batch 3050, loss[loss=0.2066, simple_loss=0.2813, pruned_loss=0.06592, over 21824.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3162, pruned_loss=0.08009, over 4272507.49 frames. ], batch size: 282, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:30:56,007 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.911e+02 9.249e+02 1.451e+03 2.091e+03 4.098e+03, threshold=2.902e+03, percent-clipped=32.0 2023-06-24 22:31:06,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1847976.0, ans=0.1 2023-06-24 22:31:40,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1848096.0, ans=0.0 2023-06-24 22:32:39,980 INFO [train.py:996] (2/4) Epoch 11, batch 3100, loss[loss=0.2262, simple_loss=0.3015, pruned_loss=0.07545, over 20843.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3153, pruned_loss=0.07929, over 4264748.22 frames. ], batch size: 607, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:33:48,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1848456.0, ans=0.125 2023-06-24 22:33:51,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1848456.0, ans=0.0 2023-06-24 22:34:30,823 INFO [train.py:996] (2/4) Epoch 11, batch 3150, loss[loss=0.2525, simple_loss=0.3248, pruned_loss=0.09006, over 21266.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3146, pruned_loss=0.0785, over 4265393.11 frames. ], batch size: 143, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:34:41,494 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.651e+02 8.004e+02 1.417e+03 1.894e+03 2.816e+03, threshold=2.834e+03, percent-clipped=0.0 2023-06-24 22:34:49,181 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=15.0 2023-06-24 22:35:04,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1848636.0, ans=0.125 2023-06-24 22:35:32,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1848696.0, ans=0.0 2023-06-24 22:35:56,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1848756.0, ans=0.1 2023-06-24 22:36:20,718 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 22:36:27,211 INFO [train.py:996] (2/4) Epoch 11, batch 3200, loss[loss=0.2788, simple_loss=0.3471, pruned_loss=0.1052, over 21354.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3172, pruned_loss=0.07931, over 4267440.15 frames. ], batch size: 211, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:36:37,119 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.23 vs. limit=6.0 2023-06-24 22:36:38,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1848876.0, ans=0.1 2023-06-24 22:36:53,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1848936.0, ans=0.2 2023-06-24 22:37:13,905 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2023-06-24 22:38:12,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1849116.0, ans=0.0 2023-06-24 22:38:15,091 INFO [train.py:996] (2/4) Epoch 11, batch 3250, loss[loss=0.2013, simple_loss=0.2597, pruned_loss=0.07149, over 20065.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3171, pruned_loss=0.07987, over 4276331.38 frames. ], batch size: 702, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:38:20,055 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.217e+02 9.282e+02 1.304e+03 1.953e+03 5.530e+03, threshold=2.608e+03, percent-clipped=11.0 2023-06-24 22:38:58,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1849236.0, ans=0.0 2023-06-24 22:39:10,863 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=12.0 2023-06-24 22:39:46,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1849416.0, ans=0.2 2023-06-24 22:40:04,418 INFO [train.py:996] (2/4) Epoch 11, batch 3300, loss[loss=0.2785, simple_loss=0.3493, pruned_loss=0.1038, over 21339.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3144, pruned_loss=0.08079, over 4274703.63 frames. ], batch size: 471, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:40:06,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1849476.0, ans=0.125 2023-06-24 22:41:43,351 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.15 vs. limit=15.0 2023-06-24 22:41:54,791 INFO [train.py:996] (2/4) Epoch 11, batch 3350, loss[loss=0.2343, simple_loss=0.3044, pruned_loss=0.08214, over 21374.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3158, pruned_loss=0.08072, over 4273505.59 frames. ], batch size: 211, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:42:01,405 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.009e+02 8.020e+02 1.184e+03 1.979e+03 5.260e+03, threshold=2.368e+03, percent-clipped=15.0 2023-06-24 22:42:16,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1849776.0, ans=0.125 2023-06-24 22:42:25,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1849836.0, ans=0.0 2023-06-24 22:42:48,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1849896.0, ans=0.0 2023-06-24 22:43:39,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1850016.0, ans=0.2 2023-06-24 22:43:50,821 INFO [train.py:996] (2/4) Epoch 11, batch 3400, loss[loss=0.2225, simple_loss=0.2943, pruned_loss=0.07539, over 21836.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3159, pruned_loss=0.08102, over 4279679.32 frames. ], batch size: 107, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:43:51,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1850076.0, ans=0.025 2023-06-24 22:44:05,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1850076.0, ans=0.0 2023-06-24 22:45:09,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1850256.0, ans=0.0 2023-06-24 22:45:40,228 INFO [train.py:996] (2/4) Epoch 11, batch 3450, loss[loss=0.2308, simple_loss=0.2899, pruned_loss=0.08584, over 21808.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3111, pruned_loss=0.07997, over 4279488.14 frames. ], batch size: 352, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:45:52,976 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.862e+02 7.621e+02 1.155e+03 1.643e+03 3.444e+03, threshold=2.310e+03, percent-clipped=7.0 2023-06-24 22:46:30,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1850496.0, ans=0.1 2023-06-24 22:47:05,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1850616.0, ans=0.125 2023-06-24 22:47:11,234 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.98 vs. limit=22.5 2023-06-24 22:47:35,890 INFO [train.py:996] (2/4) Epoch 11, batch 3500, loss[loss=0.3166, simple_loss=0.3788, pruned_loss=0.1272, over 21476.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3187, pruned_loss=0.08354, over 4283278.62 frames. ], batch size: 471, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:48:17,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1850736.0, ans=0.125 2023-06-24 22:48:26,499 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1850796.0, ans=0.125 2023-06-24 22:48:48,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1850856.0, ans=0.0 2023-06-24 22:49:10,370 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.18 vs. limit=15.0 2023-06-24 22:49:32,532 INFO [train.py:996] (2/4) Epoch 11, batch 3550, loss[loss=0.2327, simple_loss=0.2916, pruned_loss=0.0869, over 21829.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3215, pruned_loss=0.08482, over 4272701.86 frames. ], batch size: 98, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 22:49:39,361 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.955e+02 9.872e+02 1.548e+03 2.414e+03 6.693e+03, threshold=3.097e+03, percent-clipped=26.0 2023-06-24 22:49:46,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1850976.0, ans=0.125 2023-06-24 22:50:05,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1851036.0, ans=0.0 2023-06-24 22:50:05,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1851036.0, ans=0.04949747468305833 2023-06-24 22:50:49,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1851156.0, ans=0.2 2023-06-24 22:51:19,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1851216.0, ans=0.1 2023-06-24 22:51:22,526 INFO [train.py:996] (2/4) Epoch 11, batch 3600, loss[loss=0.2391, simple_loss=0.2966, pruned_loss=0.09081, over 21719.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3155, pruned_loss=0.08401, over 4274340.11 frames. ], batch size: 112, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:51:28,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1851276.0, ans=0.125 2023-06-24 22:52:01,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1851396.0, ans=0.1 2023-06-24 22:52:05,543 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.81 vs. limit=15.0 2023-06-24 22:52:06,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1851396.0, ans=0.125 2023-06-24 22:52:32,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1851456.0, ans=0.0 2023-06-24 22:52:58,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1851516.0, ans=0.125 2023-06-24 22:53:00,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1851516.0, ans=0.125 2023-06-24 22:53:10,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1851516.0, ans=0.2 2023-06-24 22:53:11,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1851516.0, ans=0.1 2023-06-24 22:53:14,648 INFO [train.py:996] (2/4) Epoch 11, batch 3650, loss[loss=0.188, simple_loss=0.2849, pruned_loss=0.04551, over 19979.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3158, pruned_loss=0.08415, over 4272434.54 frames. ], batch size: 703, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:53:21,481 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.826e+02 7.978e+02 1.076e+03 1.568e+03 3.181e+03, threshold=2.152e+03, percent-clipped=1.0 2023-06-24 22:53:28,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1851576.0, ans=0.0 2023-06-24 22:53:55,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.min_positive, batch_count=1851696.0, ans=0.025 2023-06-24 22:54:05,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1851696.0, ans=0.125 2023-06-24 22:54:11,050 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.23 vs. limit=22.5 2023-06-24 22:54:20,915 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.54 vs. limit=10.0 2023-06-24 22:54:38,379 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.83 vs. limit=15.0 2023-06-24 22:54:53,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1851816.0, ans=0.125 2023-06-24 22:55:01,469 INFO [train.py:996] (2/4) Epoch 11, batch 3700, loss[loss=0.243, simple_loss=0.3236, pruned_loss=0.08118, over 21763.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3138, pruned_loss=0.08311, over 4274397.24 frames. ], batch size: 441, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:55:24,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1851936.0, ans=0.125 2023-06-24 22:55:39,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1851996.0, ans=0.2 2023-06-24 22:55:58,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1852056.0, ans=0.0 2023-06-24 22:56:10,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1852056.0, ans=0.125 2023-06-24 22:56:44,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1852116.0, ans=0.2 2023-06-24 22:56:55,503 INFO [train.py:996] (2/4) Epoch 11, batch 3750, loss[loss=0.3151, simple_loss=0.3893, pruned_loss=0.1204, over 21426.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.315, pruned_loss=0.08407, over 4280901.62 frames. ], batch size: 549, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:57:01,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1852176.0, ans=0.125 2023-06-24 22:57:02,983 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.268e+02 7.373e+02 1.096e+03 1.771e+03 3.259e+03, threshold=2.192e+03, percent-clipped=16.0 2023-06-24 22:57:03,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1852176.0, ans=0.125 2023-06-24 22:57:17,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1852236.0, ans=0.125 2023-06-24 22:57:24,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1852236.0, ans=0.125 2023-06-24 22:58:44,365 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.81 vs. limit=15.0 2023-06-24 22:58:44,867 INFO [train.py:996] (2/4) Epoch 11, batch 3800, loss[loss=0.2106, simple_loss=0.3128, pruned_loss=0.05421, over 19865.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3113, pruned_loss=0.08165, over 4279013.36 frames. ], batch size: 702, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 22:59:53,268 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.72 vs. limit=15.0 2023-06-24 23:00:22,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1852716.0, ans=0.04949747468305833 2023-06-24 23:00:22,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1852716.0, ans=0.2 2023-06-24 23:00:32,133 INFO [train.py:996] (2/4) Epoch 11, batch 3850, loss[loss=0.2097, simple_loss=0.2869, pruned_loss=0.06621, over 20175.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3085, pruned_loss=0.0818, over 4277372.34 frames. ], batch size: 703, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:00:39,275 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.734e+02 8.310e+02 1.331e+03 1.906e+03 3.711e+03, threshold=2.662e+03, percent-clipped=19.0 2023-06-24 23:00:43,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1852776.0, ans=0.125 2023-06-24 23:00:44,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1852776.0, ans=0.0 2023-06-24 23:00:49,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1852836.0, ans=0.1 2023-06-24 23:02:03,408 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.93 vs. limit=22.5 2023-06-24 23:02:10,274 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.48 vs. limit=15.0 2023-06-24 23:02:19,601 INFO [train.py:996] (2/4) Epoch 11, batch 3900, loss[loss=0.2218, simple_loss=0.2881, pruned_loss=0.07776, over 21699.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3041, pruned_loss=0.08172, over 4288172.87 frames. ], batch size: 391, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:04:02,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1853316.0, ans=0.125 2023-06-24 23:04:11,741 INFO [train.py:996] (2/4) Epoch 11, batch 3950, loss[loss=0.1883, simple_loss=0.2612, pruned_loss=0.05771, over 21840.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3065, pruned_loss=0.08098, over 4286341.97 frames. ], batch size: 124, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:04:18,252 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.637e+02 6.470e+02 9.111e+02 1.353e+03 4.725e+03, threshold=1.822e+03, percent-clipped=4.0 2023-06-24 23:04:54,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1853496.0, ans=0.1 2023-06-24 23:05:15,158 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.43 vs. limit=22.5 2023-06-24 23:06:01,583 INFO [train.py:996] (2/4) Epoch 11, batch 4000, loss[loss=0.2237, simple_loss=0.2839, pruned_loss=0.08179, over 22009.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3015, pruned_loss=0.07766, over 4282266.86 frames. ], batch size: 103, lr: 2.70e-03, grad_scale: 32.0 2023-06-24 23:06:42,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1853796.0, ans=0.0 2023-06-24 23:07:41,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1853916.0, ans=0.125 2023-06-24 23:07:48,545 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=13.67 vs. limit=15.0 2023-06-24 23:07:49,286 INFO [train.py:996] (2/4) Epoch 11, batch 4050, loss[loss=0.2459, simple_loss=0.3463, pruned_loss=0.07274, over 19765.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3003, pruned_loss=0.07608, over 4271575.49 frames. ], batch size: 703, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:07:57,274 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.864e+02 8.238e+02 1.474e+03 2.566e+03 6.233e+03, threshold=2.948e+03, percent-clipped=38.0 2023-06-24 23:08:22,252 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.53 vs. limit=22.5 2023-06-24 23:09:27,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=1854216.0, ans=0.025 2023-06-24 23:09:34,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1854216.0, ans=0.5 2023-06-24 23:09:37,367 INFO [train.py:996] (2/4) Epoch 11, batch 4100, loss[loss=0.2003, simple_loss=0.2868, pruned_loss=0.05686, over 21517.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3022, pruned_loss=0.07708, over 4272041.49 frames. ], batch size: 195, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:09:59,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1854336.0, ans=0.125 2023-06-24 23:10:06,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1854336.0, ans=0.125 2023-06-24 23:10:24,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1854396.0, ans=0.0 2023-06-24 23:11:27,834 INFO [train.py:996] (2/4) Epoch 11, batch 4150, loss[loss=0.1907, simple_loss=0.2724, pruned_loss=0.05449, over 21452.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3029, pruned_loss=0.0748, over 4277058.50 frames. ], batch size: 195, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:11:29,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1854576.0, ans=0.125 2023-06-24 23:11:44,221 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.496e+02 6.391e+02 9.658e+02 1.367e+03 3.515e+03, threshold=1.932e+03, percent-clipped=2.0 2023-06-24 23:11:53,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1854636.0, ans=0.0 2023-06-24 23:12:11,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1854636.0, ans=0.0 2023-06-24 23:12:12,098 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.09 vs. limit=22.5 2023-06-24 23:12:42,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1854696.0, ans=0.125 2023-06-24 23:12:44,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1854756.0, ans=0.1 2023-06-24 23:13:04,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1854816.0, ans=0.2 2023-06-24 23:13:27,339 INFO [train.py:996] (2/4) Epoch 11, batch 4200, loss[loss=0.2395, simple_loss=0.3083, pruned_loss=0.0853, over 21696.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3037, pruned_loss=0.07418, over 4275200.96 frames. ], batch size: 316, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:14:27,524 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.88 vs. limit=15.0 2023-06-24 23:14:28,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1854996.0, ans=0.125 2023-06-24 23:14:33,984 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.11 vs. limit=6.0 2023-06-24 23:15:04,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1855116.0, ans=0.07 2023-06-24 23:15:24,074 INFO [train.py:996] (2/4) Epoch 11, batch 4250, loss[loss=0.2594, simple_loss=0.3351, pruned_loss=0.09185, over 21774.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3092, pruned_loss=0.07523, over 4272631.47 frames. ], batch size: 332, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:15:32,160 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.310e+02 8.743e+02 1.334e+03 2.102e+03 4.812e+03, threshold=2.669e+03, percent-clipped=26.0 2023-06-24 23:16:03,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1855236.0, ans=0.1 2023-06-24 23:16:20,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1855296.0, ans=0.0 2023-06-24 23:16:22,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1855296.0, ans=0.125 2023-06-24 23:17:10,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1855416.0, ans=0.1 2023-06-24 23:17:15,269 INFO [train.py:996] (2/4) Epoch 11, batch 4300, loss[loss=0.2153, simple_loss=0.3011, pruned_loss=0.06478, over 21407.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3154, pruned_loss=0.07733, over 4272447.78 frames. ], batch size: 131, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:18:41,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1855656.0, ans=0.0 2023-06-24 23:18:55,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1855716.0, ans=0.0 2023-06-24 23:18:57,846 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.77 vs. limit=12.0 2023-06-24 23:19:09,585 INFO [train.py:996] (2/4) Epoch 11, batch 4350, loss[loss=0.2244, simple_loss=0.2785, pruned_loss=0.08516, over 21445.00 frames. ], tot_loss[loss=0.2345, simple_loss=0.3147, pruned_loss=0.07715, over 4272865.64 frames. ], batch size: 211, lr: 2.70e-03, grad_scale: 8.0 2023-06-24 23:19:25,420 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.637e+02 8.181e+02 1.022e+03 1.627e+03 5.028e+03, threshold=2.045e+03, percent-clipped=6.0 2023-06-24 23:19:49,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1855896.0, ans=0.035 2023-06-24 23:20:23,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1855956.0, ans=0.125 2023-06-24 23:20:36,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1856016.0, ans=0.09899494936611666 2023-06-24 23:20:59,170 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.36 vs. limit=15.0 2023-06-24 23:21:06,786 INFO [train.py:996] (2/4) Epoch 11, batch 4400, loss[loss=0.2343, simple_loss=0.3226, pruned_loss=0.07305, over 21904.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3112, pruned_loss=0.07716, over 4261607.79 frames. ], batch size: 373, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:21:29,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1856136.0, ans=0.0 2023-06-24 23:21:33,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1856136.0, ans=0.125 2023-06-24 23:21:49,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1856196.0, ans=0.125 2023-06-24 23:21:51,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1856196.0, ans=0.0 2023-06-24 23:22:07,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1856196.0, ans=0.125 2023-06-24 23:22:57,254 INFO [train.py:996] (2/4) Epoch 11, batch 4450, loss[loss=0.2534, simple_loss=0.3487, pruned_loss=0.07905, over 21275.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3192, pruned_loss=0.07901, over 4267036.10 frames. ], batch size: 548, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:23:07,955 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.276e+02 9.671e+02 1.476e+03 2.549e+03 6.148e+03, threshold=2.952e+03, percent-clipped=35.0 2023-06-24 23:23:41,088 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.90 vs. limit=15.0 2023-06-24 23:23:42,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1856496.0, ans=0.125 2023-06-24 23:24:39,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1856616.0, ans=0.125 2023-06-24 23:24:47,347 INFO [train.py:996] (2/4) Epoch 11, batch 4500, loss[loss=0.268, simple_loss=0.3633, pruned_loss=0.08637, over 20744.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3208, pruned_loss=0.08023, over 4265439.28 frames. ], batch size: 608, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:26:00,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1856856.0, ans=0.1 2023-06-24 23:26:28,419 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.81 vs. limit=15.0 2023-06-24 23:26:34,491 INFO [train.py:996] (2/4) Epoch 11, batch 4550, loss[loss=0.2665, simple_loss=0.3444, pruned_loss=0.09429, over 21587.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3228, pruned_loss=0.0807, over 4270641.81 frames. ], batch size: 414, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:26:44,478 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.220e+02 1.037e+03 1.526e+03 2.248e+03 5.276e+03, threshold=3.053e+03, percent-clipped=11.0 2023-06-24 23:27:02,873 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=15.0 2023-06-24 23:27:04,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1857036.0, ans=0.1 2023-06-24 23:28:23,406 INFO [train.py:996] (2/4) Epoch 11, batch 4600, loss[loss=0.1947, simple_loss=0.2773, pruned_loss=0.05602, over 21470.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3248, pruned_loss=0.08228, over 4277194.03 frames. ], batch size: 211, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:28:54,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1857336.0, ans=0.0 2023-06-24 23:29:11,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1857396.0, ans=0.0 2023-06-24 23:29:23,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1857396.0, ans=0.0 2023-06-24 23:30:01,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1857516.0, ans=0.0 2023-06-24 23:30:01,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1857516.0, ans=0.2 2023-06-24 23:30:02,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1857516.0, ans=0.2 2023-06-24 23:30:12,144 INFO [train.py:996] (2/4) Epoch 11, batch 4650, loss[loss=0.1687, simple_loss=0.2476, pruned_loss=0.04492, over 21538.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3197, pruned_loss=0.08145, over 4284224.20 frames. ], batch size: 131, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:30:16,300 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.01 vs. limit=6.0 2023-06-24 23:30:29,044 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.938e+02 7.698e+02 1.029e+03 1.673e+03 3.855e+03, threshold=2.058e+03, percent-clipped=3.0 2023-06-24 23:30:53,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1857636.0, ans=0.2 2023-06-24 23:30:57,862 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=22.5 2023-06-24 23:31:16,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1857696.0, ans=0.0 2023-06-24 23:31:25,570 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-24 23:31:55,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1857816.0, ans=0.125 2023-06-24 23:32:07,103 INFO [train.py:996] (2/4) Epoch 11, batch 4700, loss[loss=0.2273, simple_loss=0.2898, pruned_loss=0.0824, over 21656.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3115, pruned_loss=0.07943, over 4287726.19 frames. ], batch size: 393, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:32:34,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1857936.0, ans=0.0 2023-06-24 23:33:27,029 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.62 vs. limit=15.0 2023-06-24 23:33:34,880 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.21 vs. limit=15.0 2023-06-24 23:33:48,519 INFO [train.py:996] (2/4) Epoch 11, batch 4750, loss[loss=0.2214, simple_loss=0.2904, pruned_loss=0.07618, over 21420.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3073, pruned_loss=0.07912, over 4281084.60 frames. ], batch size: 131, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:33:53,204 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.35 vs. limit=10.0 2023-06-24 23:34:05,958 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.601e+02 8.320e+02 1.239e+03 2.079e+03 4.364e+03, threshold=2.479e+03, percent-clipped=25.0 2023-06-24 23:34:06,921 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.02 vs. limit=22.5 2023-06-24 23:34:47,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1858296.0, ans=0.125 2023-06-24 23:34:59,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1858296.0, ans=0.0 2023-06-24 23:35:01,949 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.58 vs. limit=15.0 2023-06-24 23:35:09,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1858356.0, ans=10.0 2023-06-24 23:35:15,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1858356.0, ans=0.0 2023-06-24 23:35:25,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1858416.0, ans=0.125 2023-06-24 23:35:42,499 INFO [train.py:996] (2/4) Epoch 11, batch 4800, loss[loss=0.2139, simple_loss=0.3185, pruned_loss=0.0546, over 21807.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3086, pruned_loss=0.07972, over 4281989.05 frames. ], batch size: 351, lr: 2.70e-03, grad_scale: 32.0 2023-06-24 23:37:06,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1858716.0, ans=0.0 2023-06-24 23:37:12,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1858716.0, ans=0.0 2023-06-24 23:37:23,166 INFO [train.py:996] (2/4) Epoch 11, batch 4850, loss[loss=0.2653, simple_loss=0.335, pruned_loss=0.09782, over 21524.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3067, pruned_loss=0.07916, over 4283568.50 frames. ], batch size: 471, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:37:41,872 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.322e+02 1.130e+03 1.666e+03 2.337e+03 4.462e+03, threshold=3.333e+03, percent-clipped=23.0 2023-06-24 23:38:35,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1858956.0, ans=0.0 2023-06-24 23:38:40,717 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.84 vs. limit=15.0 2023-06-24 23:39:06,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1859016.0, ans=0.125 2023-06-24 23:39:15,288 INFO [train.py:996] (2/4) Epoch 11, batch 4900, loss[loss=0.2428, simple_loss=0.3117, pruned_loss=0.08696, over 21779.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3095, pruned_loss=0.08008, over 4287669.19 frames. ], batch size: 441, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:39:22,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1859076.0, ans=0.0 2023-06-24 23:39:34,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1859076.0, ans=0.04949747468305833 2023-06-24 23:39:42,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1859136.0, ans=0.0 2023-06-24 23:40:02,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1859196.0, ans=0.04949747468305833 2023-06-24 23:40:31,572 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.03 vs. limit=22.5 2023-06-24 23:41:05,224 INFO [train.py:996] (2/4) Epoch 11, batch 4950, loss[loss=0.1925, simple_loss=0.2884, pruned_loss=0.04826, over 21735.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3125, pruned_loss=0.07845, over 4287873.43 frames. ], batch size: 332, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:41:23,339 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.929e+02 7.980e+02 1.108e+03 1.676e+03 3.345e+03, threshold=2.216e+03, percent-clipped=1.0 2023-06-24 23:41:29,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1859436.0, ans=0.125 2023-06-24 23:41:41,434 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:41:55,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=1859496.0, ans=0.02 2023-06-24 23:42:14,077 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-24 23:42:15,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1859556.0, ans=0.125 2023-06-24 23:42:53,678 INFO [train.py:996] (2/4) Epoch 11, batch 5000, loss[loss=0.2005, simple_loss=0.2654, pruned_loss=0.06775, over 20180.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3108, pruned_loss=0.07601, over 4285278.55 frames. ], batch size: 703, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:43:05,262 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1859676.0, ans=0.125 2023-06-24 23:43:10,194 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1859676.0, ans=0.1 2023-06-24 23:43:19,130 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.12 vs. limit=10.0 2023-06-24 23:43:59,239 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.04 vs. limit=15.0 2023-06-24 23:44:25,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1859916.0, ans=0.0 2023-06-24 23:44:37,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1859916.0, ans=0.125 2023-06-24 23:44:38,689 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-24 23:44:39,752 INFO [train.py:996] (2/4) Epoch 11, batch 5050, loss[loss=0.256, simple_loss=0.322, pruned_loss=0.09496, over 21874.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3107, pruned_loss=0.0774, over 4281888.37 frames. ], batch size: 391, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:44:43,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1859976.0, ans=0.0 2023-06-24 23:44:49,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1859976.0, ans=0.1 2023-06-24 23:44:57,980 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.911e+02 7.469e+02 1.066e+03 1.616e+03 3.471e+03, threshold=2.133e+03, percent-clipped=8.0 2023-06-24 23:45:25,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1860096.0, ans=0.0 2023-06-24 23:45:25,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1860096.0, ans=0.2 2023-06-24 23:45:28,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1860096.0, ans=0.0 2023-06-24 23:45:46,044 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.18 vs. limit=6.0 2023-06-24 23:46:26,228 INFO [train.py:996] (2/4) Epoch 11, batch 5100, loss[loss=0.1958, simple_loss=0.2801, pruned_loss=0.05577, over 21856.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3093, pruned_loss=0.07754, over 4278121.12 frames. ], batch size: 371, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:46:51,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1860336.0, ans=0.1 2023-06-24 23:47:35,525 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.44 vs. limit=15.0 2023-06-24 23:48:11,864 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=22.5 2023-06-24 23:48:21,808 INFO [train.py:996] (2/4) Epoch 11, batch 5150, loss[loss=0.2174, simple_loss=0.2907, pruned_loss=0.07201, over 21889.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3073, pruned_loss=0.07855, over 4284332.90 frames. ], batch size: 316, lr: 2.70e-03, grad_scale: 16.0 2023-06-24 23:48:34,355 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.000e+02 7.764e+02 1.031e+03 1.609e+03 3.475e+03, threshold=2.061e+03, percent-clipped=12.0 2023-06-24 23:48:41,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1860636.0, ans=0.2 2023-06-24 23:49:20,204 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1860696.0, ans=0.1 2023-06-24 23:49:32,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1860756.0, ans=0.2 2023-06-24 23:50:04,578 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.45 vs. limit=22.5 2023-06-24 23:50:11,763 INFO [train.py:996] (2/4) Epoch 11, batch 5200, loss[loss=0.2832, simple_loss=0.3713, pruned_loss=0.09752, over 21746.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3089, pruned_loss=0.07871, over 4286448.99 frames. ], batch size: 332, lr: 2.69e-03, grad_scale: 32.0 2023-06-24 23:50:36,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=1860936.0, ans=10.0 2023-06-24 23:51:58,023 INFO [train.py:996] (2/4) Epoch 11, batch 5250, loss[loss=0.2395, simple_loss=0.326, pruned_loss=0.07651, over 21588.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3135, pruned_loss=0.07766, over 4283562.22 frames. ], batch size: 263, lr: 2.69e-03, grad_scale: 16.0 2023-06-24 23:52:18,260 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.878e+02 9.518e+02 1.553e+03 2.129e+03 4.596e+03, threshold=3.106e+03, percent-clipped=26.0 2023-06-24 23:53:25,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1861416.0, ans=0.125 2023-06-24 23:53:29,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1861416.0, ans=0.0 2023-06-24 23:53:38,134 INFO [train.py:996] (2/4) Epoch 11, batch 5300, loss[loss=0.2522, simple_loss=0.3215, pruned_loss=0.09145, over 21961.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3115, pruned_loss=0.07807, over 4289247.50 frames. ], batch size: 113, lr: 2.69e-03, grad_scale: 16.0 2023-06-24 23:54:08,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1861536.0, ans=0.1 2023-06-24 23:54:17,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1861536.0, ans=0.125 2023-06-24 23:54:21,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1861596.0, ans=0.125 2023-06-24 23:54:50,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1861656.0, ans=0.125 2023-06-24 23:55:12,125 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=15.0 2023-06-24 23:55:17,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1861716.0, ans=0.5 2023-06-24 23:55:22,269 INFO [train.py:996] (2/4) Epoch 11, batch 5350, loss[loss=0.2895, simple_loss=0.3422, pruned_loss=0.1184, over 21736.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3106, pruned_loss=0.07969, over 4287859.24 frames. ], batch size: 473, lr: 2.69e-03, grad_scale: 16.0 2023-06-24 23:55:27,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1861776.0, ans=0.125 2023-06-24 23:55:35,080 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.646e+02 7.558e+02 1.125e+03 1.569e+03 2.899e+03, threshold=2.250e+03, percent-clipped=0.0 2023-06-24 23:55:55,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1861836.0, ans=0.125 2023-06-24 23:56:12,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1861896.0, ans=0.125 2023-06-24 23:56:19,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1861896.0, ans=0.1 2023-06-24 23:56:51,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1862016.0, ans=0.125 2023-06-24 23:57:01,694 INFO [train.py:996] (2/4) Epoch 11, batch 5400, loss[loss=0.212, simple_loss=0.3177, pruned_loss=0.05315, over 20827.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3092, pruned_loss=0.08042, over 4288901.04 frames. ], batch size: 607, lr: 2.69e-03, grad_scale: 8.0 2023-06-24 23:58:48,740 INFO [train.py:996] (2/4) Epoch 11, batch 5450, loss[loss=0.2601, simple_loss=0.3708, pruned_loss=0.07471, over 21636.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3116, pruned_loss=0.07875, over 4286231.46 frames. ], batch size: 389, lr: 2.69e-03, grad_scale: 8.0 2023-06-24 23:59:09,898 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.83 vs. limit=15.0 2023-06-24 23:59:10,601 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.863e+02 8.635e+02 1.460e+03 2.379e+03 5.903e+03, threshold=2.920e+03, percent-clipped=27.0 2023-06-24 23:59:14,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1862436.0, ans=0.1 2023-06-24 23:59:19,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1862436.0, ans=0.0 2023-06-24 23:59:47,568 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1862496.0, ans=0.125 2023-06-25 00:00:45,475 INFO [train.py:996] (2/4) Epoch 11, batch 5500, loss[loss=0.2289, simple_loss=0.3272, pruned_loss=0.06537, over 21757.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3175, pruned_loss=0.07691, over 4278652.82 frames. ], batch size: 332, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:00:57,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1862676.0, ans=0.025 2023-06-25 00:01:12,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1862736.0, ans=0.125 2023-06-25 00:02:33,296 INFO [train.py:996] (2/4) Epoch 11, batch 5550, loss[loss=0.1785, simple_loss=0.2541, pruned_loss=0.05149, over 21197.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3163, pruned_loss=0.07405, over 4270621.83 frames. ], batch size: 159, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:02:48,984 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.684e+02 8.321e+02 1.311e+03 1.956e+03 3.720e+03, threshold=2.623e+03, percent-clipped=7.0 2023-06-25 00:02:58,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1863036.0, ans=0.2 2023-06-25 00:03:17,898 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.22 vs. limit=10.0 2023-06-25 00:03:29,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1863096.0, ans=0.125 2023-06-25 00:04:18,520 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.52 vs. limit=22.5 2023-06-25 00:04:21,199 INFO [train.py:996] (2/4) Epoch 11, batch 5600, loss[loss=0.236, simple_loss=0.3212, pruned_loss=0.07541, over 21404.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3136, pruned_loss=0.07017, over 4266342.35 frames. ], batch size: 194, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:04:32,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1863276.0, ans=0.2 2023-06-25 00:05:09,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1863396.0, ans=0.125 2023-06-25 00:05:29,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1863396.0, ans=0.125 2023-06-25 00:06:01,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1863516.0, ans=0.125 2023-06-25 00:06:06,170 INFO [train.py:996] (2/4) Epoch 11, batch 5650, loss[loss=0.2525, simple_loss=0.3163, pruned_loss=0.09433, over 21878.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3165, pruned_loss=0.07238, over 4269749.45 frames. ], batch size: 124, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:06:18,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1863576.0, ans=0.2 2023-06-25 00:06:32,022 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.004e+02 8.541e+02 1.292e+03 2.009e+03 3.827e+03, threshold=2.583e+03, percent-clipped=13.0 2023-06-25 00:06:52,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1863696.0, ans=0.1 2023-06-25 00:07:04,783 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.23 vs. limit=15.0 2023-06-25 00:07:28,502 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:07:37,804 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.52 vs. limit=12.0 2023-06-25 00:07:42,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1863816.0, ans=0.125 2023-06-25 00:07:57,641 INFO [train.py:996] (2/4) Epoch 11, batch 5700, loss[loss=0.2294, simple_loss=0.3063, pruned_loss=0.07618, over 21123.00 frames. ], tot_loss[loss=0.232, simple_loss=0.315, pruned_loss=0.07449, over 4275826.83 frames. ], batch size: 608, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:08:12,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1863876.0, ans=0.125 2023-06-25 00:08:39,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1863936.0, ans=0.1 2023-06-25 00:08:56,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1863996.0, ans=0.05 2023-06-25 00:09:47,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1864116.0, ans=0.2 2023-06-25 00:09:53,484 INFO [train.py:996] (2/4) Epoch 11, batch 5750, loss[loss=0.2705, simple_loss=0.3509, pruned_loss=0.09502, over 21504.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3119, pruned_loss=0.07247, over 4270014.57 frames. ], batch size: 508, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:10:08,445 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.527e+02 8.365e+02 1.283e+03 1.865e+03 4.523e+03, threshold=2.566e+03, percent-clipped=10.0 2023-06-25 00:11:14,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1864356.0, ans=0.125 2023-06-25 00:11:31,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1864416.0, ans=0.09899494936611666 2023-06-25 00:11:39,481 INFO [train.py:996] (2/4) Epoch 11, batch 5800, loss[loss=0.2452, simple_loss=0.3464, pruned_loss=0.07196, over 21533.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3134, pruned_loss=0.07115, over 4266913.21 frames. ], batch size: 471, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:12:54,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1864656.0, ans=0.125 2023-06-25 00:13:05,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1864656.0, ans=0.0 2023-06-25 00:13:24,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1864716.0, ans=0.2 2023-06-25 00:13:27,131 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.71 vs. limit=10.0 2023-06-25 00:13:32,882 INFO [train.py:996] (2/4) Epoch 11, batch 5850, loss[loss=0.1893, simple_loss=0.3041, pruned_loss=0.03724, over 21635.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.3094, pruned_loss=0.06707, over 4263476.23 frames. ], batch size: 414, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:13:33,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1864776.0, ans=0.1 2023-06-25 00:13:52,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1864776.0, ans=0.125 2023-06-25 00:13:53,412 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 3.480e+02 6.927e+02 1.116e+03 1.995e+03 4.965e+03, threshold=2.231e+03, percent-clipped=19.0 2023-06-25 00:14:07,440 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.90 vs. limit=22.5 2023-06-25 00:14:54,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1865016.0, ans=0.1 2023-06-25 00:15:17,048 INFO [train.py:996] (2/4) Epoch 11, batch 5900, loss[loss=0.2105, simple_loss=0.2849, pruned_loss=0.06801, over 21427.00 frames. ], tot_loss[loss=0.2117, simple_loss=0.3002, pruned_loss=0.06161, over 4264020.91 frames. ], batch size: 131, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:16:14,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1865196.0, ans=0.2 2023-06-25 00:16:44,158 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.21 vs. limit=22.5 2023-06-25 00:17:06,601 INFO [train.py:996] (2/4) Epoch 11, batch 5950, loss[loss=0.247, simple_loss=0.3041, pruned_loss=0.09495, over 21686.00 frames. ], tot_loss[loss=0.214, simple_loss=0.2986, pruned_loss=0.0647, over 4271075.79 frames. ], batch size: 414, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:17:10,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1865376.0, ans=0.125 2023-06-25 00:17:11,268 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=12.39 vs. limit=15.0 2023-06-25 00:17:21,624 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.131e+02 6.600e+02 8.461e+02 1.275e+03 2.602e+03, threshold=1.692e+03, percent-clipped=3.0 2023-06-25 00:18:39,954 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=15.0 2023-06-25 00:18:51,556 INFO [train.py:996] (2/4) Epoch 11, batch 6000, loss[loss=0.211, simple_loss=0.2697, pruned_loss=0.07608, over 21251.00 frames. ], tot_loss[loss=0.2148, simple_loss=0.2941, pruned_loss=0.06777, over 4272773.58 frames. ], batch size: 159, lr: 2.69e-03, grad_scale: 32.0 2023-06-25 00:18:51,557 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 00:19:08,583 INFO [train.py:1028] (2/4) Epoch 11, validation: loss=0.2642, simple_loss=0.3568, pruned_loss=0.08578, over 1796401.00 frames. 2023-06-25 00:19:08,584 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-25 00:19:18,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1865676.0, ans=0.125 2023-06-25 00:20:01,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1865796.0, ans=0.09899494936611666 2023-06-25 00:20:53,357 INFO [train.py:996] (2/4) Epoch 11, batch 6050, loss[loss=0.2181, simple_loss=0.2867, pruned_loss=0.07477, over 15999.00 frames. ], tot_loss[loss=0.2128, simple_loss=0.2885, pruned_loss=0.06855, over 4273841.71 frames. ], batch size: 60, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:21:16,012 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.73 vs. limit=22.5 2023-06-25 00:21:18,396 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.947e+02 8.062e+02 1.043e+03 1.359e+03 2.248e+03, threshold=2.086e+03, percent-clipped=5.0 2023-06-25 00:21:32,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1866036.0, ans=0.0 2023-06-25 00:21:52,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1866096.0, ans=0.125 2023-06-25 00:22:39,202 INFO [train.py:996] (2/4) Epoch 11, batch 6100, loss[loss=0.227, simple_loss=0.2998, pruned_loss=0.07707, over 21534.00 frames. ], tot_loss[loss=0.2123, simple_loss=0.2891, pruned_loss=0.06774, over 4275216.85 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:23:01,953 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:24:27,320 INFO [train.py:996] (2/4) Epoch 11, batch 6150, loss[loss=0.224, simple_loss=0.2938, pruned_loss=0.07709, over 21906.00 frames. ], tot_loss[loss=0.2165, simple_loss=0.2927, pruned_loss=0.07014, over 4283309.45 frames. ], batch size: 98, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:24:43,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1866576.0, ans=0.125 2023-06-25 00:24:58,586 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.664e+02 7.679e+02 1.290e+03 1.928e+03 3.741e+03, threshold=2.581e+03, percent-clipped=18.0 2023-06-25 00:25:12,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1866696.0, ans=0.125 2023-06-25 00:25:51,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1866756.0, ans=0.125 2023-06-25 00:26:19,944 INFO [train.py:996] (2/4) Epoch 11, batch 6200, loss[loss=0.2861, simple_loss=0.3842, pruned_loss=0.09397, over 21769.00 frames. ], tot_loss[loss=0.2182, simple_loss=0.2954, pruned_loss=0.07056, over 4277232.70 frames. ], batch size: 391, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:26:20,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1866876.0, ans=0.125 2023-06-25 00:26:21,036 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.43 vs. limit=15.0 2023-06-25 00:26:25,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1866876.0, ans=0.0 2023-06-25 00:26:28,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1866876.0, ans=0.125 2023-06-25 00:26:51,093 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:27:04,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1866996.0, ans=0.0 2023-06-25 00:28:03,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1867116.0, ans=0.2 2023-06-25 00:28:06,346 INFO [train.py:996] (2/4) Epoch 11, batch 6250, loss[loss=0.2121, simple_loss=0.3005, pruned_loss=0.06181, over 21277.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2996, pruned_loss=0.06922, over 4269112.09 frames. ], batch size: 159, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:28:31,519 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.849e+02 8.847e+02 1.490e+03 2.226e+03 5.467e+03, threshold=2.981e+03, percent-clipped=18.0 2023-06-25 00:29:15,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten.whitening_limit, batch_count=1867356.0, ans=15.0 2023-06-25 00:29:29,311 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2023-06-25 00:29:52,748 INFO [train.py:996] (2/4) Epoch 11, batch 6300, loss[loss=0.2366, simple_loss=0.3221, pruned_loss=0.07557, over 21873.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3043, pruned_loss=0.0695, over 4275386.95 frames. ], batch size: 351, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:31:06,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1867656.0, ans=0.125 2023-06-25 00:31:10,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1867656.0, ans=0.125 2023-06-25 00:31:43,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1867776.0, ans=0.0 2023-06-25 00:31:44,179 INFO [train.py:996] (2/4) Epoch 11, batch 6350, loss[loss=0.2224, simple_loss=0.3002, pruned_loss=0.07237, over 21434.00 frames. ], tot_loss[loss=0.227, simple_loss=0.307, pruned_loss=0.07352, over 4282806.73 frames. ], batch size: 211, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:31:52,462 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.37 vs. limit=12.0 2023-06-25 00:32:03,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1867776.0, ans=0.1 2023-06-25 00:32:08,039 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.095e+02 6.705e+02 8.360e+02 1.250e+03 2.332e+03, threshold=1.672e+03, percent-clipped=0.0 2023-06-25 00:32:37,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1867896.0, ans=0.0 2023-06-25 00:32:41,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1867896.0, ans=0.1 2023-06-25 00:32:47,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1867956.0, ans=0.2 2023-06-25 00:32:49,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1867956.0, ans=0.0 2023-06-25 00:33:26,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1868016.0, ans=0.2 2023-06-25 00:33:37,582 INFO [train.py:996] (2/4) Epoch 11, batch 6400, loss[loss=0.2267, simple_loss=0.3013, pruned_loss=0.07604, over 21627.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3133, pruned_loss=0.07799, over 4283593.11 frames. ], batch size: 263, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:33:38,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1868076.0, ans=0.125 2023-06-25 00:33:47,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1868076.0, ans=0.0 2023-06-25 00:34:17,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1868196.0, ans=0.125 2023-06-25 00:35:26,294 INFO [train.py:996] (2/4) Epoch 11, batch 6450, loss[loss=0.2415, simple_loss=0.3296, pruned_loss=0.07671, over 21577.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3182, pruned_loss=0.07847, over 4283802.11 frames. ], batch size: 389, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:35:38,718 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-25 00:35:51,636 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.519e+02 9.176e+02 1.134e+03 1.706e+03 4.418e+03, threshold=2.268e+03, percent-clipped=27.0 2023-06-25 00:36:29,060 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.33 vs. limit=8.0 2023-06-25 00:36:30,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1868556.0, ans=0.0 2023-06-25 00:37:13,875 INFO [train.py:996] (2/4) Epoch 11, batch 6500, loss[loss=0.272, simple_loss=0.3248, pruned_loss=0.1096, over 21347.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3134, pruned_loss=0.07788, over 4283882.25 frames. ], batch size: 507, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:37:17,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1868676.0, ans=0.0 2023-06-25 00:38:59,840 INFO [train.py:996] (2/4) Epoch 11, batch 6550, loss[loss=0.2116, simple_loss=0.2898, pruned_loss=0.06672, over 21775.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3129, pruned_loss=0.07677, over 4285202.10 frames. ], batch size: 247, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:39:05,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1868976.0, ans=0.2 2023-06-25 00:39:19,983 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.39 vs. limit=10.0 2023-06-25 00:39:24,195 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.737e+02 9.229e+02 1.425e+03 2.181e+03 3.625e+03, threshold=2.850e+03, percent-clipped=21.0 2023-06-25 00:39:25,192 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.05 vs. limit=15.0 2023-06-25 00:39:27,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1869036.0, ans=0.125 2023-06-25 00:40:15,557 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=15.0 2023-06-25 00:40:47,110 INFO [train.py:996] (2/4) Epoch 11, batch 6600, loss[loss=0.2047, simple_loss=0.2698, pruned_loss=0.06977, over 21757.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.306, pruned_loss=0.07647, over 4265984.22 frames. ], batch size: 300, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:42:28,230 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-25 00:42:36,104 INFO [train.py:996] (2/4) Epoch 11, batch 6650, loss[loss=0.2392, simple_loss=0.3076, pruned_loss=0.08543, over 21557.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2982, pruned_loss=0.0735, over 4266074.61 frames. ], batch size: 442, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:43:06,914 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.219e+02 5.753e+02 7.174e+02 1.040e+03 2.181e+03, threshold=1.435e+03, percent-clipped=0.0 2023-06-25 00:43:23,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1869696.0, ans=0.0 2023-06-25 00:44:32,439 INFO [train.py:996] (2/4) Epoch 11, batch 6700, loss[loss=0.2099, simple_loss=0.2822, pruned_loss=0.06874, over 21655.00 frames. ], tot_loss[loss=0.219, simple_loss=0.2926, pruned_loss=0.07271, over 4257565.43 frames. ], batch size: 247, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:45:07,243 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1869936.0, ans=0.0 2023-06-25 00:45:27,316 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 00:45:31,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1869996.0, ans=0.2 2023-06-25 00:45:34,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1870056.0, ans=0.0 2023-06-25 00:45:51,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1870056.0, ans=0.0 2023-06-25 00:46:05,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1870116.0, ans=0.125 2023-06-25 00:46:14,642 INFO [train.py:996] (2/4) Epoch 11, batch 6750, loss[loss=0.2208, simple_loss=0.2872, pruned_loss=0.07716, over 21820.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2914, pruned_loss=0.0732, over 4258724.26 frames. ], batch size: 351, lr: 2.69e-03, grad_scale: 8.0 2023-06-25 00:46:46,792 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.712e+02 8.208e+02 1.148e+03 1.600e+03 3.333e+03, threshold=2.296e+03, percent-clipped=33.0 2023-06-25 00:46:53,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1870236.0, ans=0.09899494936611666 2023-06-25 00:46:55,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1870236.0, ans=0.0 2023-06-25 00:46:57,729 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.10 vs. limit=15.0 2023-06-25 00:47:59,228 INFO [train.py:996] (2/4) Epoch 11, batch 6800, loss[loss=0.2284, simple_loss=0.2911, pruned_loss=0.08289, over 21578.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2932, pruned_loss=0.07538, over 4268640.07 frames. ], batch size: 414, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:48:01,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1870476.0, ans=0.125 2023-06-25 00:48:03,430 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.97 vs. limit=15.0 2023-06-25 00:48:43,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1870596.0, ans=0.04949747468305833 2023-06-25 00:48:58,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1870596.0, ans=0.125 2023-06-25 00:49:44,480 INFO [train.py:996] (2/4) Epoch 11, batch 6850, loss[loss=0.2303, simple_loss=0.2934, pruned_loss=0.08353, over 21216.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2929, pruned_loss=0.07603, over 4264863.97 frames. ], batch size: 607, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:49:51,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1870776.0, ans=0.1 2023-06-25 00:50:16,079 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.911e+02 8.303e+02 1.235e+03 2.153e+03 3.729e+03, threshold=2.471e+03, percent-clipped=22.0 2023-06-25 00:51:07,113 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=22.5 2023-06-25 00:51:15,207 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.26 vs. limit=15.0 2023-06-25 00:51:31,183 INFO [train.py:996] (2/4) Epoch 11, batch 6900, loss[loss=0.2399, simple_loss=0.3005, pruned_loss=0.08961, over 21534.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2938, pruned_loss=0.0767, over 4272807.39 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:52:57,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1871256.0, ans=0.125 2023-06-25 00:52:59,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1871256.0, ans=0.125 2023-06-25 00:53:00,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1871256.0, ans=0.125 2023-06-25 00:53:27,049 INFO [train.py:996] (2/4) Epoch 11, batch 6950, loss[loss=0.1988, simple_loss=0.3113, pruned_loss=0.0432, over 21279.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2978, pruned_loss=0.07464, over 4278655.72 frames. ], batch size: 548, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:53:53,757 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.630e+02 7.235e+02 1.015e+03 1.522e+03 6.325e+03, threshold=2.030e+03, percent-clipped=9.0 2023-06-25 00:55:15,820 INFO [train.py:996] (2/4) Epoch 11, batch 7000, loss[loss=0.2419, simple_loss=0.3077, pruned_loss=0.08806, over 21698.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2994, pruned_loss=0.07672, over 4284256.27 frames. ], batch size: 441, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:55:55,314 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=15.75 vs. limit=15.0 2023-06-25 00:56:02,054 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.12 vs. limit=6.0 2023-06-25 00:56:13,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1871796.0, ans=0.1 2023-06-25 00:57:10,148 INFO [train.py:996] (2/4) Epoch 11, batch 7050, loss[loss=0.1739, simple_loss=0.2385, pruned_loss=0.05462, over 15947.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2966, pruned_loss=0.07533, over 4275880.41 frames. ], batch size: 60, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:57:37,678 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.311e+02 8.822e+02 1.310e+03 1.745e+03 4.662e+03, threshold=2.619e+03, percent-clipped=19.0 2023-06-25 00:57:56,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1872096.0, ans=0.0 2023-06-25 00:58:41,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1872216.0, ans=0.0 2023-06-25 00:58:49,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_na.min_abs, batch_count=1872216.0, ans=0.02 2023-06-25 00:59:02,521 INFO [train.py:996] (2/4) Epoch 11, batch 7100, loss[loss=0.2174, simple_loss=0.2937, pruned_loss=0.07052, over 20727.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3001, pruned_loss=0.07616, over 4277099.33 frames. ], batch size: 607, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 00:59:51,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1872396.0, ans=0.0 2023-06-25 00:59:52,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1872396.0, ans=0.5 2023-06-25 00:59:55,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1872396.0, ans=0.0 2023-06-25 01:00:50,638 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-25 01:00:51,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1872576.0, ans=0.125 2023-06-25 01:00:53,312 INFO [train.py:996] (2/4) Epoch 11, batch 7150, loss[loss=0.2751, simple_loss=0.346, pruned_loss=0.1021, over 21659.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2994, pruned_loss=0.07526, over 4275021.43 frames. ], batch size: 351, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 01:00:58,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1872576.0, ans=0.125 2023-06-25 01:01:01,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1872576.0, ans=0.125 2023-06-25 01:01:25,398 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.968e+02 7.662e+02 1.147e+03 1.671e+03 2.803e+03, threshold=2.294e+03, percent-clipped=2.0 2023-06-25 01:01:28,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1872636.0, ans=0.125 2023-06-25 01:01:35,716 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=22.5 2023-06-25 01:02:51,308 INFO [train.py:996] (2/4) Epoch 11, batch 7200, loss[loss=0.2162, simple_loss=0.2822, pruned_loss=0.07515, over 21652.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3021, pruned_loss=0.07832, over 4282658.79 frames. ], batch size: 298, lr: 2.69e-03, grad_scale: 32.0 2023-06-25 01:02:51,898 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1872876.0, ans=0.1 2023-06-25 01:03:15,031 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1872936.0, ans=0.5 2023-06-25 01:03:15,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1872936.0, ans=0.2 2023-06-25 01:04:09,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1873056.0, ans=0.1 2023-06-25 01:04:17,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1873116.0, ans=0.0 2023-06-25 01:04:39,717 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.06 vs. limit=15.0 2023-06-25 01:04:40,393 INFO [train.py:996] (2/4) Epoch 11, batch 7250, loss[loss=0.1772, simple_loss=0.244, pruned_loss=0.05519, over 21423.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.299, pruned_loss=0.07707, over 4283818.50 frames. ], batch size: 195, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 01:05:04,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1873236.0, ans=0.05 2023-06-25 01:05:04,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1873236.0, ans=0.2 2023-06-25 01:05:06,929 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.525e+02 1.021e+03 1.447e+03 2.035e+03 4.041e+03, threshold=2.893e+03, percent-clipped=18.0 2023-06-25 01:06:06,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1873416.0, ans=0.125 2023-06-25 01:06:08,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1873416.0, ans=0.025 2023-06-25 01:06:24,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1873416.0, ans=0.05 2023-06-25 01:06:27,155 INFO [train.py:996] (2/4) Epoch 11, batch 7300, loss[loss=0.2309, simple_loss=0.2903, pruned_loss=0.08571, over 21829.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2933, pruned_loss=0.0759, over 4285552.44 frames. ], batch size: 107, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 01:06:32,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1873476.0, ans=0.125 2023-06-25 01:06:54,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1873536.0, ans=0.125 2023-06-25 01:07:43,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1873656.0, ans=0.125 2023-06-25 01:07:59,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1873716.0, ans=0.125 2023-06-25 01:08:02,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1873716.0, ans=0.1 2023-06-25 01:08:16,329 INFO [train.py:996] (2/4) Epoch 11, batch 7350, loss[loss=0.2699, simple_loss=0.3479, pruned_loss=0.09598, over 21745.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.292, pruned_loss=0.07623, over 4274793.38 frames. ], batch size: 124, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 01:08:43,117 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.949e+02 8.143e+02 1.181e+03 1.694e+03 4.027e+03, threshold=2.361e+03, percent-clipped=4.0 2023-06-25 01:09:45,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1874016.0, ans=0.125 2023-06-25 01:09:46,263 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.83 vs. limit=22.5 2023-06-25 01:10:11,695 INFO [train.py:996] (2/4) Epoch 11, batch 7400, loss[loss=0.2164, simple_loss=0.3099, pruned_loss=0.06145, over 21725.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.2962, pruned_loss=0.07745, over 4274783.78 frames. ], batch size: 332, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 01:10:27,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1874136.0, ans=0.125 2023-06-25 01:10:56,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1874136.0, ans=0.1 2023-06-25 01:11:09,293 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.77 vs. limit=15.0 2023-06-25 01:11:14,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1874196.0, ans=0.0 2023-06-25 01:11:31,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1874256.0, ans=0.125 2023-06-25 01:11:44,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1874316.0, ans=0.125 2023-06-25 01:12:03,183 INFO [train.py:996] (2/4) Epoch 11, batch 7450, loss[loss=0.2236, simple_loss=0.2869, pruned_loss=0.08016, over 21570.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.2945, pruned_loss=0.07646, over 4273135.83 frames. ], batch size: 247, lr: 2.69e-03, grad_scale: 16.0 2023-06-25 01:12:06,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1874376.0, ans=0.125 2023-06-25 01:12:08,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1874376.0, ans=0.1 2023-06-25 01:12:22,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1874436.0, ans=0.0 2023-06-25 01:12:33,131 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.846e+02 7.768e+02 1.010e+03 1.629e+03 4.953e+03, threshold=2.020e+03, percent-clipped=6.0 2023-06-25 01:13:05,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1874496.0, ans=0.2 2023-06-25 01:13:44,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=1874616.0, ans=0.025 2023-06-25 01:13:54,443 INFO [train.py:996] (2/4) Epoch 11, batch 7500, loss[loss=0.2097, simple_loss=0.2639, pruned_loss=0.07773, over 20870.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.2994, pruned_loss=0.07853, over 4273866.31 frames. ], batch size: 608, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:13:56,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1874676.0, ans=0.125 2023-06-25 01:14:29,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1874736.0, ans=0.0 2023-06-25 01:14:58,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1874856.0, ans=0.125 2023-06-25 01:15:25,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1874916.0, ans=0.125 2023-06-25 01:15:43,438 INFO [train.py:996] (2/4) Epoch 11, batch 7550, loss[loss=0.2186, simple_loss=0.3103, pruned_loss=0.06347, over 21690.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3077, pruned_loss=0.07817, over 4274758.69 frames. ], batch size: 298, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:15:53,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.min_positive, batch_count=1874976.0, ans=0.05 2023-06-25 01:16:14,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1875036.0, ans=0.125 2023-06-25 01:16:17,093 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.793e+02 9.851e+02 1.650e+03 2.404e+03 5.031e+03, threshold=3.301e+03, percent-clipped=35.0 2023-06-25 01:16:26,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1875036.0, ans=0.0 2023-06-25 01:16:26,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1875036.0, ans=0.07 2023-06-25 01:17:29,902 INFO [train.py:996] (2/4) Epoch 11, batch 7600, loss[loss=0.2277, simple_loss=0.3059, pruned_loss=0.07477, over 21758.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3064, pruned_loss=0.07643, over 4278253.44 frames. ], batch size: 112, lr: 2.68e-03, grad_scale: 32.0 2023-06-25 01:18:16,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1875396.0, ans=0.125 2023-06-25 01:18:28,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1875396.0, ans=0.125 2023-06-25 01:18:39,464 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:18:54,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1875456.0, ans=0.2 2023-06-25 01:18:59,008 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.93 vs. limit=15.0 2023-06-25 01:19:04,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1875516.0, ans=0.1 2023-06-25 01:19:14,308 INFO [train.py:996] (2/4) Epoch 11, batch 7650, loss[loss=0.2514, simple_loss=0.3119, pruned_loss=0.09545, over 21952.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3068, pruned_loss=0.07886, over 4279542.10 frames. ], batch size: 316, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:19:16,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1875576.0, ans=0.2 2023-06-25 01:19:44,575 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.184e+02 7.567e+02 1.161e+03 1.543e+03 3.222e+03, threshold=2.322e+03, percent-clipped=0.0 2023-06-25 01:20:00,240 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:20:42,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1875816.0, ans=0.125 2023-06-25 01:20:50,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1875816.0, ans=0.125 2023-06-25 01:20:56,015 INFO [train.py:996] (2/4) Epoch 11, batch 7700, loss[loss=0.2038, simple_loss=0.2586, pruned_loss=0.0745, over 20819.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.31, pruned_loss=0.08173, over 4281233.37 frames. ], batch size: 609, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:21:24,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1875936.0, ans=0.125 2023-06-25 01:21:33,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1875936.0, ans=0.125 2023-06-25 01:21:33,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=1875936.0, ans=0.05 2023-06-25 01:21:38,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1875996.0, ans=0.1 2023-06-25 01:22:13,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1876056.0, ans=0.0 2023-06-25 01:22:44,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1876176.0, ans=0.125 2023-06-25 01:22:45,823 INFO [train.py:996] (2/4) Epoch 11, batch 7750, loss[loss=0.2034, simple_loss=0.2995, pruned_loss=0.0537, over 20705.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3154, pruned_loss=0.08178, over 4279498.84 frames. ], batch size: 607, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:22:58,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1876176.0, ans=0.125 2023-06-25 01:23:03,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1876236.0, ans=0.125 2023-06-25 01:23:03,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1876236.0, ans=0.125 2023-06-25 01:23:10,555 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.954e+02 7.928e+02 1.247e+03 1.821e+03 3.792e+03, threshold=2.494e+03, percent-clipped=12.0 2023-06-25 01:23:44,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1876296.0, ans=0.2 2023-06-25 01:24:30,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1876476.0, ans=0.125 2023-06-25 01:24:31,881 INFO [train.py:996] (2/4) Epoch 11, batch 7800, loss[loss=0.2031, simple_loss=0.2314, pruned_loss=0.0874, over 16599.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3167, pruned_loss=0.0819, over 4272630.27 frames. ], batch size: 60, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:24:42,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=1876476.0, ans=0.07 2023-06-25 01:24:53,471 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.06 vs. limit=15.0 2023-06-25 01:25:16,610 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.46 vs. limit=15.0 2023-06-25 01:26:04,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1876716.0, ans=0.125 2023-06-25 01:26:13,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=1876716.0, ans=6.0 2023-06-25 01:26:15,676 INFO [train.py:996] (2/4) Epoch 11, batch 7850, loss[loss=0.1857, simple_loss=0.2492, pruned_loss=0.06115, over 21150.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3101, pruned_loss=0.08143, over 4269675.29 frames. ], batch size: 176, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:26:27,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=1876776.0, ans=0.125 2023-06-25 01:26:46,361 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.852e+02 8.105e+02 1.212e+03 1.898e+03 4.667e+03, threshold=2.425e+03, percent-clipped=9.0 2023-06-25 01:27:02,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1876896.0, ans=0.125 2023-06-25 01:28:06,255 INFO [train.py:996] (2/4) Epoch 11, batch 7900, loss[loss=0.188, simple_loss=0.2493, pruned_loss=0.06334, over 21422.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3057, pruned_loss=0.08101, over 4264332.31 frames. ], batch size: 131, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:28:14,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1877076.0, ans=0.05 2023-06-25 01:28:22,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1877136.0, ans=0.125 2023-06-25 01:28:54,065 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.03 vs. limit=10.0 2023-06-25 01:28:58,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1877196.0, ans=0.125 2023-06-25 01:29:45,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1877316.0, ans=0.0 2023-06-25 01:29:57,769 INFO [train.py:996] (2/4) Epoch 11, batch 7950, loss[loss=0.2407, simple_loss=0.326, pruned_loss=0.07772, over 21784.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3092, pruned_loss=0.0791, over 4258552.41 frames. ], batch size: 332, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:29:58,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1877376.0, ans=0.1 2023-06-25 01:30:35,719 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.494e+02 9.486e+02 1.599e+03 2.410e+03 5.026e+03, threshold=3.197e+03, percent-clipped=23.0 2023-06-25 01:32:03,926 INFO [train.py:996] (2/4) Epoch 11, batch 8000, loss[loss=0.3127, simple_loss=0.3895, pruned_loss=0.1179, over 21371.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3132, pruned_loss=0.08097, over 4261055.85 frames. ], batch size: 507, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:32:06,588 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.19 vs. limit=15.0 2023-06-25 01:32:19,261 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=22.5 2023-06-25 01:32:21,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1877736.0, ans=0.125 2023-06-25 01:33:03,499 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 01:33:26,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1877856.0, ans=0.125 2023-06-25 01:33:56,979 INFO [train.py:996] (2/4) Epoch 11, batch 8050, loss[loss=0.3822, simple_loss=0.4449, pruned_loss=0.1597, over 21445.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3194, pruned_loss=0.08234, over 4265277.49 frames. ], batch size: 507, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:34:34,626 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.082e+02 8.572e+02 1.267e+03 1.861e+03 4.173e+03, threshold=2.534e+03, percent-clipped=4.0 2023-06-25 01:34:40,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1878036.0, ans=0.125 2023-06-25 01:35:36,790 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.30 vs. limit=15.0 2023-06-25 01:35:45,695 INFO [train.py:996] (2/4) Epoch 11, batch 8100, loss[loss=0.2279, simple_loss=0.2902, pruned_loss=0.0828, over 21361.00 frames. ], tot_loss[loss=0.2422, simple_loss=0.3192, pruned_loss=0.08263, over 4275942.99 frames. ], batch size: 159, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:35:52,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1878276.0, ans=0.0 2023-06-25 01:36:31,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1878396.0, ans=0.125 2023-06-25 01:36:33,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1878396.0, ans=10.0 2023-06-25 01:37:22,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1878456.0, ans=0.1 2023-06-25 01:37:48,328 INFO [train.py:996] (2/4) Epoch 11, batch 8150, loss[loss=0.2031, simple_loss=0.2666, pruned_loss=0.06978, over 21276.00 frames. ], tot_loss[loss=0.2454, simple_loss=0.3241, pruned_loss=0.08337, over 4275911.44 frames. ], batch size: 159, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:38:17,984 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.023e+02 7.751e+02 1.218e+03 2.122e+03 5.445e+03, threshold=2.437e+03, percent-clipped=16.0 2023-06-25 01:38:31,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1878636.0, ans=0.0 2023-06-25 01:39:27,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1878816.0, ans=0.0 2023-06-25 01:39:39,813 INFO [train.py:996] (2/4) Epoch 11, batch 8200, loss[loss=0.2232, simple_loss=0.2837, pruned_loss=0.08139, over 21583.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3162, pruned_loss=0.08067, over 4273176.11 frames. ], batch size: 415, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:39:41,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1878876.0, ans=0.125 2023-06-25 01:39:43,894 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=15.0 2023-06-25 01:39:57,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1878936.0, ans=0.1 2023-06-25 01:40:36,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1878996.0, ans=0.1 2023-06-25 01:41:19,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1879116.0, ans=0.125 2023-06-25 01:41:28,944 INFO [train.py:996] (2/4) Epoch 11, batch 8250, loss[loss=0.3701, simple_loss=0.4186, pruned_loss=0.1608, over 21484.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3159, pruned_loss=0.08159, over 4271806.36 frames. ], batch size: 508, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:42:00,019 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.841e+02 7.306e+02 1.035e+03 1.633e+03 3.565e+03, threshold=2.069e+03, percent-clipped=11.0 2023-06-25 01:42:02,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1879236.0, ans=0.0 2023-06-25 01:42:20,432 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.75 vs. limit=10.0 2023-06-25 01:42:26,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1879296.0, ans=0.035 2023-06-25 01:42:36,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1879356.0, ans=0.125 2023-06-25 01:42:47,069 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.40 vs. limit=15.0 2023-06-25 01:42:53,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1879416.0, ans=0.125 2023-06-25 01:43:04,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1879416.0, ans=0.125 2023-06-25 01:43:17,158 INFO [train.py:996] (2/4) Epoch 11, batch 8300, loss[loss=0.2506, simple_loss=0.3377, pruned_loss=0.08176, over 21617.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3146, pruned_loss=0.07857, over 4260550.49 frames. ], batch size: 389, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:43:28,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1879476.0, ans=0.125 2023-06-25 01:43:41,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1879536.0, ans=0.2 2023-06-25 01:45:04,800 INFO [train.py:996] (2/4) Epoch 11, batch 8350, loss[loss=0.211, simple_loss=0.2983, pruned_loss=0.06185, over 19896.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3145, pruned_loss=0.07731, over 4266097.54 frames. ], batch size: 703, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:45:13,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1879776.0, ans=0.0 2023-06-25 01:45:44,600 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.048e+02 7.785e+02 1.165e+03 1.706e+03 3.630e+03, threshold=2.331e+03, percent-clipped=15.0 2023-06-25 01:46:33,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1879956.0, ans=0.125 2023-06-25 01:46:52,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1880076.0, ans=15.0 2023-06-25 01:46:53,163 INFO [train.py:996] (2/4) Epoch 11, batch 8400, loss[loss=0.2333, simple_loss=0.3283, pruned_loss=0.06918, over 21615.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3111, pruned_loss=0.0742, over 4260913.95 frames. ], batch size: 441, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:46:55,611 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.96 vs. limit=15.0 2023-06-25 01:47:13,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1880136.0, ans=0.0 2023-06-25 01:47:30,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1880136.0, ans=0.125 2023-06-25 01:48:25,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1880316.0, ans=0.0 2023-06-25 01:48:41,830 INFO [train.py:996] (2/4) Epoch 11, batch 8450, loss[loss=0.2011, simple_loss=0.2833, pruned_loss=0.05946, over 21510.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3093, pruned_loss=0.07448, over 4265988.53 frames. ], batch size: 212, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:48:47,409 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1880376.0, ans=0.125 2023-06-25 01:49:20,510 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.263e+02 6.433e+02 1.170e+03 1.916e+03 4.574e+03, threshold=2.341e+03, percent-clipped=17.0 2023-06-25 01:49:45,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1880496.0, ans=0.0 2023-06-25 01:49:56,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1880556.0, ans=0.1 2023-06-25 01:49:58,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1880556.0, ans=0.1 2023-06-25 01:50:14,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1880616.0, ans=0.0 2023-06-25 01:50:18,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1880616.0, ans=0.125 2023-06-25 01:50:30,038 INFO [train.py:996] (2/4) Epoch 11, batch 8500, loss[loss=0.2411, simple_loss=0.306, pruned_loss=0.08805, over 21686.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3042, pruned_loss=0.07542, over 4261937.26 frames. ], batch size: 247, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:50:41,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1880676.0, ans=0.125 2023-06-25 01:51:31,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1880796.0, ans=0.0 2023-06-25 01:52:18,588 INFO [train.py:996] (2/4) Epoch 11, batch 8550, loss[loss=0.2547, simple_loss=0.3331, pruned_loss=0.08815, over 21748.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3104, pruned_loss=0.07919, over 4265536.35 frames. ], batch size: 247, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 01:52:29,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1880976.0, ans=0.2 2023-06-25 01:52:51,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1881036.0, ans=0.125 2023-06-25 01:52:56,693 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.844e+02 6.860e+02 9.469e+02 1.395e+03 3.551e+03, threshold=1.894e+03, percent-clipped=10.0 2023-06-25 01:54:20,717 INFO [train.py:996] (2/4) Epoch 11, batch 8600, loss[loss=0.2935, simple_loss=0.3646, pruned_loss=0.1112, over 21361.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3187, pruned_loss=0.08158, over 4268382.04 frames. ], batch size: 176, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:54:21,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1881276.0, ans=0.2 2023-06-25 01:55:03,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1881396.0, ans=0.125 2023-06-25 01:55:12,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1881396.0, ans=0.2 2023-06-25 01:55:21,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1881396.0, ans=0.2 2023-06-25 01:55:29,380 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-25 01:56:09,632 INFO [train.py:996] (2/4) Epoch 11, batch 8650, loss[loss=0.2377, simple_loss=0.3418, pruned_loss=0.06674, over 21768.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3236, pruned_loss=0.08181, over 4264177.65 frames. ], batch size: 282, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:56:32,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1881636.0, ans=0.2 2023-06-25 01:56:39,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1881636.0, ans=0.0 2023-06-25 01:56:43,773 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.081e+02 8.530e+02 1.308e+03 2.199e+03 5.345e+03, threshold=2.615e+03, percent-clipped=30.0 2023-06-25 01:57:13,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1881756.0, ans=0.125 2023-06-25 01:57:52,019 INFO [train.py:996] (2/4) Epoch 11, batch 8700, loss[loss=0.2045, simple_loss=0.274, pruned_loss=0.06747, over 21371.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3145, pruned_loss=0.07763, over 4263706.36 frames. ], batch size: 131, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:58:01,794 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.80 vs. limit=22.5 2023-06-25 01:58:22,155 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1881936.0, ans=0.125 2023-06-25 01:58:30,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1881936.0, ans=0.125 2023-06-25 01:58:35,889 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1881996.0, ans=0.125 2023-06-25 01:58:38,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1881996.0, ans=10.0 2023-06-25 01:58:58,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1882056.0, ans=0.0 2023-06-25 01:59:02,023 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.14 vs. limit=22.5 2023-06-25 01:59:03,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1882056.0, ans=0.125 2023-06-25 01:59:03,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1882056.0, ans=0.1 2023-06-25 01:59:07,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1882056.0, ans=0.125 2023-06-25 01:59:38,922 INFO [train.py:996] (2/4) Epoch 11, batch 8750, loss[loss=0.2213, simple_loss=0.2877, pruned_loss=0.07747, over 21467.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3119, pruned_loss=0.07819, over 4261702.75 frames. ], batch size: 144, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 01:59:40,229 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=13.55 vs. limit=22.5 2023-06-25 01:59:50,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1882176.0, ans=0.125 2023-06-25 02:00:25,251 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.601e+02 8.552e+02 1.572e+03 2.395e+03 4.841e+03, threshold=3.145e+03, percent-clipped=19.0 2023-06-25 02:00:29,373 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.32 vs. limit=15.0 2023-06-25 02:00:46,955 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.40 vs. limit=15.0 2023-06-25 02:00:57,215 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.33 vs. limit=12.0 2023-06-25 02:01:17,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1882416.0, ans=0.2 2023-06-25 02:01:32,845 INFO [train.py:996] (2/4) Epoch 11, batch 8800, loss[loss=0.2132, simple_loss=0.2899, pruned_loss=0.06824, over 20799.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3212, pruned_loss=0.08003, over 4257584.33 frames. ], batch size: 608, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:02:26,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1882596.0, ans=0.0 2023-06-25 02:02:30,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1882596.0, ans=0.125 2023-06-25 02:02:49,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1882656.0, ans=0.125 2023-06-25 02:03:26,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1882776.0, ans=0.0 2023-06-25 02:03:27,975 INFO [train.py:996] (2/4) Epoch 11, batch 8850, loss[loss=0.271, simple_loss=0.3367, pruned_loss=0.1026, over 21557.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3262, pruned_loss=0.08211, over 4253611.02 frames. ], batch size: 441, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:04:04,660 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.975e+02 8.533e+02 1.157e+03 2.147e+03 4.267e+03, threshold=2.313e+03, percent-clipped=8.0 2023-06-25 02:04:20,817 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.62 vs. limit=6.0 2023-06-25 02:04:41,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=1882956.0, ans=0.125 2023-06-25 02:05:17,223 INFO [train.py:996] (2/4) Epoch 11, batch 8900, loss[loss=0.2051, simple_loss=0.2714, pruned_loss=0.06945, over 21432.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3183, pruned_loss=0.08036, over 4259035.50 frames. ], batch size: 194, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:05:19,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1883076.0, ans=0.125 2023-06-25 02:05:37,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1883076.0, ans=0.0 2023-06-25 02:06:47,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1883256.0, ans=0.125 2023-06-25 02:06:54,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=1883316.0, ans=0.05 2023-06-25 02:07:13,526 INFO [train.py:996] (2/4) Epoch 11, batch 8950, loss[loss=0.2722, simple_loss=0.3594, pruned_loss=0.09248, over 21614.00 frames. ], tot_loss[loss=0.2395, simple_loss=0.3194, pruned_loss=0.07985, over 4261987.05 frames. ], batch size: 414, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:07:48,284 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.752e+02 7.956e+02 1.198e+03 2.154e+03 4.592e+03, threshold=2.397e+03, percent-clipped=22.0 2023-06-25 02:07:55,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1883496.0, ans=0.0 2023-06-25 02:08:16,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1883556.0, ans=0.0 2023-06-25 02:08:36,335 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 02:08:39,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1883616.0, ans=0.0 2023-06-25 02:08:53,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1883676.0, ans=0.0 2023-06-25 02:08:55,069 INFO [train.py:996] (2/4) Epoch 11, batch 9000, loss[loss=0.2284, simple_loss=0.2983, pruned_loss=0.07923, over 21535.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3124, pruned_loss=0.07991, over 4265240.91 frames. ], batch size: 195, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:08:55,070 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 02:09:07,967 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([5.5419, 4.8202, 5.0296, 5.3254], device='cuda:2') 2023-06-25 02:09:12,571 INFO [train.py:1028] (2/4) Epoch 11, validation: loss=0.2589, simple_loss=0.3526, pruned_loss=0.08262, over 1796401.00 frames. 2023-06-25 02:09:12,571 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-25 02:09:42,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1883736.0, ans=0.04949747468305833 2023-06-25 02:10:13,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1883796.0, ans=0.2 2023-06-25 02:10:20,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1883856.0, ans=0.05 2023-06-25 02:11:00,336 INFO [train.py:996] (2/4) Epoch 11, batch 9050, loss[loss=0.2004, simple_loss=0.2824, pruned_loss=0.05919, over 21536.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3065, pruned_loss=0.07623, over 4262564.04 frames. ], batch size: 230, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:11:09,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1883976.0, ans=0.1 2023-06-25 02:11:36,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1884036.0, ans=0.0 2023-06-25 02:11:44,397 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.917e+02 7.004e+02 1.025e+03 1.804e+03 4.936e+03, threshold=2.049e+03, percent-clipped=10.0 2023-06-25 02:12:50,832 INFO [train.py:996] (2/4) Epoch 11, batch 9100, loss[loss=0.2695, simple_loss=0.3438, pruned_loss=0.09756, over 21710.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3143, pruned_loss=0.07991, over 4255315.72 frames. ], batch size: 332, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:13:31,253 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.09 vs. limit=22.5 2023-06-25 02:14:11,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1884456.0, ans=0.2 2023-06-25 02:14:37,913 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.98 vs. limit=8.0 2023-06-25 02:14:40,242 INFO [train.py:996] (2/4) Epoch 11, batch 9150, loss[loss=0.2586, simple_loss=0.382, pruned_loss=0.06756, over 19699.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3176, pruned_loss=0.07785, over 4261209.43 frames. ], batch size: 702, lr: 2.68e-03, grad_scale: 8.0 2023-06-25 02:15:09,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1884636.0, ans=0.2 2023-06-25 02:15:21,592 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.226e+02 1.034e+03 1.434e+03 2.123e+03 3.847e+03, threshold=2.868e+03, percent-clipped=26.0 2023-06-25 02:15:41,557 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-25 02:15:49,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1884756.0, ans=0.0 2023-06-25 02:16:33,367 INFO [train.py:996] (2/4) Epoch 11, batch 9200, loss[loss=0.2372, simple_loss=0.3359, pruned_loss=0.06928, over 21048.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3192, pruned_loss=0.07654, over 4265234.53 frames. ], batch size: 608, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:17:01,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1884936.0, ans=0.0 2023-06-25 02:18:20,287 INFO [train.py:996] (2/4) Epoch 11, batch 9250, loss[loss=0.2085, simple_loss=0.2708, pruned_loss=0.07307, over 21218.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3231, pruned_loss=0.07968, over 4266022.77 frames. ], batch size: 176, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:18:35,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1885176.0, ans=0.125 2023-06-25 02:18:50,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1885236.0, ans=0.125 2023-06-25 02:18:51,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1885236.0, ans=0.0 2023-06-25 02:18:56,156 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.130e+02 8.312e+02 1.043e+03 1.613e+03 4.110e+03, threshold=2.085e+03, percent-clipped=7.0 2023-06-25 02:20:14,359 INFO [train.py:996] (2/4) Epoch 11, batch 9300, loss[loss=0.2226, simple_loss=0.2982, pruned_loss=0.07347, over 21326.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3168, pruned_loss=0.079, over 4265788.81 frames. ], batch size: 131, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:20:30,589 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.71 vs. limit=15.0 2023-06-25 02:21:02,914 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=22.5 2023-06-25 02:21:07,126 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 02:21:22,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1885656.0, ans=0.0 2023-06-25 02:21:37,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1885716.0, ans=0.125 2023-06-25 02:21:53,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1885716.0, ans=0.125 2023-06-25 02:22:02,525 INFO [train.py:996] (2/4) Epoch 11, batch 9350, loss[loss=0.3185, simple_loss=0.3881, pruned_loss=0.1245, over 21434.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3204, pruned_loss=0.07983, over 4262932.18 frames. ], batch size: 471, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:22:07,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1885776.0, ans=0.125 2023-06-25 02:22:34,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1885836.0, ans=0.1 2023-06-25 02:22:41,003 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.377e+02 8.449e+02 1.377e+03 2.044e+03 3.190e+03, threshold=2.753e+03, percent-clipped=23.0 2023-06-25 02:22:58,715 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.35 vs. limit=10.0 2023-06-25 02:22:58,987 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.14 vs. limit=8.0 2023-06-25 02:23:01,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1885896.0, ans=0.0 2023-06-25 02:23:02,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1885896.0, ans=0.07 2023-06-25 02:23:52,734 INFO [train.py:996] (2/4) Epoch 11, batch 9400, loss[loss=0.2342, simple_loss=0.294, pruned_loss=0.0872, over 21533.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3227, pruned_loss=0.08097, over 4268011.70 frames. ], batch size: 230, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:24:20,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1886136.0, ans=0.1 2023-06-25 02:24:32,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1886136.0, ans=0.125 2023-06-25 02:25:16,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1886256.0, ans=0.2 2023-06-25 02:25:23,686 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=6.88 vs. limit=22.5 2023-06-25 02:25:29,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1886316.0, ans=0.125 2023-06-25 02:25:30,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1886316.0, ans=0.0 2023-06-25 02:25:44,604 INFO [train.py:996] (2/4) Epoch 11, batch 9450, loss[loss=0.2149, simple_loss=0.2865, pruned_loss=0.07164, over 21342.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3145, pruned_loss=0.07919, over 4270589.50 frames. ], batch size: 131, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:26:20,762 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.121e+02 9.191e+02 1.408e+03 2.175e+03 4.648e+03, threshold=2.816e+03, percent-clipped=10.0 2023-06-25 02:26:42,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1886496.0, ans=0.035 2023-06-25 02:27:33,356 INFO [train.py:996] (2/4) Epoch 11, batch 9500, loss[loss=0.2164, simple_loss=0.2948, pruned_loss=0.06902, over 21841.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3072, pruned_loss=0.07757, over 4268441.97 frames. ], batch size: 124, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:28:07,219 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=15.0 2023-06-25 02:28:54,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1886856.0, ans=0.125 2023-06-25 02:28:56,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1886856.0, ans=0.1 2023-06-25 02:29:13,054 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.06 vs. limit=22.5 2023-06-25 02:29:22,384 INFO [train.py:996] (2/4) Epoch 11, batch 9550, loss[loss=0.2465, simple_loss=0.3196, pruned_loss=0.08667, over 21381.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3113, pruned_loss=0.07909, over 4256878.55 frames. ], batch size: 131, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:29:24,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1886976.0, ans=0.125 2023-06-25 02:29:57,634 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.824e+02 8.918e+02 1.397e+03 2.020e+03 4.656e+03, threshold=2.794e+03, percent-clipped=11.0 2023-06-25 02:30:01,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1887096.0, ans=0.125 2023-06-25 02:30:30,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1887156.0, ans=0.125 2023-06-25 02:31:08,639 INFO [train.py:996] (2/4) Epoch 11, batch 9600, loss[loss=0.2312, simple_loss=0.3091, pruned_loss=0.07669, over 21422.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3163, pruned_loss=0.08122, over 4263994.16 frames. ], batch size: 131, lr: 2.68e-03, grad_scale: 32.0 2023-06-25 02:31:17,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1887276.0, ans=0.125 2023-06-25 02:31:22,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1887276.0, ans=0.2 2023-06-25 02:31:24,557 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.32 vs. limit=22.5 2023-06-25 02:31:40,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1887336.0, ans=0.0 2023-06-25 02:31:46,531 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.49 vs. limit=5.0 2023-06-25 02:32:04,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1887396.0, ans=0.035 2023-06-25 02:32:56,738 INFO [train.py:996] (2/4) Epoch 11, batch 9650, loss[loss=0.2505, simple_loss=0.3322, pruned_loss=0.08441, over 21583.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3162, pruned_loss=0.08133, over 4272326.82 frames. ], batch size: 389, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:32:57,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1887576.0, ans=0.0 2023-06-25 02:33:06,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1887576.0, ans=0.0 2023-06-25 02:33:19,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1887636.0, ans=0.125 2023-06-25 02:33:34,592 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.507e+02 8.589e+02 1.260e+03 1.923e+03 2.986e+03, threshold=2.520e+03, percent-clipped=3.0 2023-06-25 02:34:17,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1887756.0, ans=0.1 2023-06-25 02:34:45,488 INFO [train.py:996] (2/4) Epoch 11, batch 9700, loss[loss=0.296, simple_loss=0.3536, pruned_loss=0.1192, over 21637.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3187, pruned_loss=0.08186, over 4279880.51 frames. ], batch size: 508, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:35:04,182 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.72 vs. limit=22.5 2023-06-25 02:35:44,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1887996.0, ans=0.0 2023-06-25 02:35:55,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1888056.0, ans=0.125 2023-06-25 02:36:31,575 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.10 vs. limit=15.0 2023-06-25 02:36:32,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1888176.0, ans=0.125 2023-06-25 02:36:34,150 INFO [train.py:996] (2/4) Epoch 11, batch 9750, loss[loss=0.2124, simple_loss=0.2815, pruned_loss=0.07168, over 21513.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3114, pruned_loss=0.08018, over 4271963.36 frames. ], batch size: 391, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:37:09,419 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.320e+02 8.182e+02 1.091e+03 1.675e+03 6.818e+03, threshold=2.183e+03, percent-clipped=8.0 2023-06-25 02:37:39,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1888356.0, ans=0.125 2023-06-25 02:38:04,667 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.87 vs. limit=22.5 2023-06-25 02:38:19,327 INFO [train.py:996] (2/4) Epoch 11, batch 9800, loss[loss=0.2293, simple_loss=0.2917, pruned_loss=0.08346, over 21573.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3132, pruned_loss=0.08081, over 4263051.42 frames. ], batch size: 263, lr: 2.68e-03, grad_scale: 16.0 2023-06-25 02:38:23,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1888476.0, ans=0.0 2023-06-25 02:38:26,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1888476.0, ans=0.0 2023-06-25 02:39:06,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1888596.0, ans=0.125 2023-06-25 02:39:48,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1888716.0, ans=0.125 2023-06-25 02:40:05,028 INFO [train.py:996] (2/4) Epoch 11, batch 9850, loss[loss=0.2079, simple_loss=0.2638, pruned_loss=0.07602, over 21589.00 frames. ], tot_loss[loss=0.236, simple_loss=0.31, pruned_loss=0.08097, over 4263882.73 frames. ], batch size: 195, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:40:35,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1888836.0, ans=0.125 2023-06-25 02:40:41,991 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.841e+02 6.727e+02 9.053e+02 1.353e+03 2.861e+03, threshold=1.811e+03, percent-clipped=2.0 2023-06-25 02:41:33,391 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1889016.0, ans=0.1 2023-06-25 02:41:53,309 INFO [train.py:996] (2/4) Epoch 11, batch 9900, loss[loss=0.1973, simple_loss=0.2596, pruned_loss=0.06755, over 20638.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3055, pruned_loss=0.08023, over 4258308.20 frames. ], batch size: 607, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:42:02,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1889076.0, ans=0.07 2023-06-25 02:43:05,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1889256.0, ans=0.125 2023-06-25 02:43:40,110 INFO [train.py:996] (2/4) Epoch 11, batch 9950, loss[loss=0.2265, simple_loss=0.2984, pruned_loss=0.07728, over 21342.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3061, pruned_loss=0.08187, over 4247976.37 frames. ], batch size: 176, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:43:57,679 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.37 vs. limit=15.0 2023-06-25 02:44:23,420 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.842e+02 7.849e+02 1.088e+03 1.572e+03 3.841e+03, threshold=2.175e+03, percent-clipped=17.0 2023-06-25 02:45:19,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1889616.0, ans=0.1 2023-06-25 02:45:27,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1889616.0, ans=0.125 2023-06-25 02:45:36,539 INFO [train.py:996] (2/4) Epoch 11, batch 10000, loss[loss=0.2167, simple_loss=0.2899, pruned_loss=0.07169, over 21105.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3031, pruned_loss=0.08078, over 4247255.19 frames. ], batch size: 607, lr: 2.67e-03, grad_scale: 32.0 2023-06-25 02:46:23,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1889796.0, ans=0.125 2023-06-25 02:47:25,690 INFO [train.py:996] (2/4) Epoch 11, batch 10050, loss[loss=0.1978, simple_loss=0.2699, pruned_loss=0.06286, over 21606.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3039, pruned_loss=0.08027, over 4258618.27 frames. ], batch size: 112, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:47:45,676 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.43 vs. limit=22.5 2023-06-25 02:48:13,114 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.178e+02 7.347e+02 1.195e+03 1.566e+03 3.839e+03, threshold=2.391e+03, percent-clipped=10.0 2023-06-25 02:48:15,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1890096.0, ans=0.125 2023-06-25 02:48:44,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1890156.0, ans=0.125 2023-06-25 02:49:16,346 INFO [train.py:996] (2/4) Epoch 11, batch 10100, loss[loss=0.2044, simple_loss=0.2641, pruned_loss=0.07238, over 21259.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.301, pruned_loss=0.07796, over 4264979.79 frames. ], batch size: 159, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:49:45,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1890336.0, ans=0.125 2023-06-25 02:50:10,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1890396.0, ans=0.125 2023-06-25 02:50:24,141 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1890396.0, ans=0.04949747468305833 2023-06-25 02:50:24,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1890396.0, ans=0.1 2023-06-25 02:50:48,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1890516.0, ans=0.125 2023-06-25 02:50:54,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1890516.0, ans=0.0 2023-06-25 02:51:12,551 INFO [train.py:996] (2/4) Epoch 11, batch 10150, loss[loss=0.2384, simple_loss=0.295, pruned_loss=0.09089, over 21807.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3057, pruned_loss=0.07995, over 4268554.89 frames. ], batch size: 98, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:51:41,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1890636.0, ans=0.1 2023-06-25 02:51:59,199 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.141e+02 7.484e+02 1.008e+03 1.435e+03 3.139e+03, threshold=2.017e+03, percent-clipped=8.0 2023-06-25 02:51:59,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1890696.0, ans=0.125 2023-06-25 02:52:03,002 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1890696.0, ans=0.0 2023-06-25 02:52:18,409 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.14 vs. limit=15.0 2023-06-25 02:52:44,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1890816.0, ans=0.1 2023-06-25 02:52:54,884 INFO [train.py:996] (2/4) Epoch 11, batch 10200, loss[loss=0.2249, simple_loss=0.2924, pruned_loss=0.07868, over 21751.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3043, pruned_loss=0.07765, over 4271325.34 frames. ], batch size: 124, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:53:28,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1890936.0, ans=0.0 2023-06-25 02:53:44,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1890996.0, ans=0.125 2023-06-25 02:53:52,027 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.38 vs. limit=15.0 2023-06-25 02:54:18,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1891056.0, ans=0.2 2023-06-25 02:54:47,842 INFO [train.py:996] (2/4) Epoch 11, batch 10250, loss[loss=0.2401, simple_loss=0.3247, pruned_loss=0.0777, over 21516.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2983, pruned_loss=0.07272, over 4258480.23 frames. ], batch size: 509, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:55:38,019 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.531e+02 8.201e+02 1.215e+03 1.712e+03 3.588e+03, threshold=2.431e+03, percent-clipped=17.0 2023-06-25 02:56:46,822 INFO [train.py:996] (2/4) Epoch 11, batch 10300, loss[loss=0.2485, simple_loss=0.3536, pruned_loss=0.07166, over 21637.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3026, pruned_loss=0.07402, over 4267541.44 frames. ], batch size: 414, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 02:57:02,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1891536.0, ans=0.5 2023-06-25 02:57:02,927 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-06-25 02:57:10,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1891536.0, ans=0.035 2023-06-25 02:57:28,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1891596.0, ans=0.1 2023-06-25 02:57:40,951 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.82 vs. limit=22.5 2023-06-25 02:58:38,050 INFO [train.py:996] (2/4) Epoch 11, batch 10350, loss[loss=0.2317, simple_loss=0.3167, pruned_loss=0.07338, over 21714.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3041, pruned_loss=0.07413, over 4260577.76 frames. ], batch size: 415, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 02:59:25,192 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.545e+02 9.168e+02 1.323e+03 1.995e+03 3.228e+03, threshold=2.646e+03, percent-clipped=12.0 2023-06-25 03:00:32,824 INFO [train.py:996] (2/4) Epoch 11, batch 10400, loss[loss=0.243, simple_loss=0.3189, pruned_loss=0.08349, over 21740.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.2995, pruned_loss=0.07351, over 4257743.16 frames. ], batch size: 391, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:00:40,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=1892076.0, ans=0.0 2023-06-25 03:01:09,818 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.51 vs. limit=15.0 2023-06-25 03:01:14,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1892196.0, ans=0.125 2023-06-25 03:01:33,962 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.99 vs. limit=15.0 2023-06-25 03:01:57,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1892256.0, ans=0.0 2023-06-25 03:02:15,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1892316.0, ans=0.125 2023-06-25 03:02:18,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1892316.0, ans=0.0 2023-06-25 03:02:22,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1892376.0, ans=0.2 2023-06-25 03:02:23,358 INFO [train.py:996] (2/4) Epoch 11, batch 10450, loss[loss=0.2669, simple_loss=0.3392, pruned_loss=0.0973, over 21630.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3037, pruned_loss=0.07623, over 4263499.86 frames. ], batch size: 263, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:02:48,076 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1892436.0, ans=0.1 2023-06-25 03:03:04,280 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.499e+02 9.525e+02 1.455e+03 2.411e+03 5.571e+03, threshold=2.910e+03, percent-clipped=19.0 2023-06-25 03:03:22,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1892496.0, ans=0.125 2023-06-25 03:03:53,007 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.88 vs. limit=12.0 2023-06-25 03:03:59,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1892616.0, ans=0.125 2023-06-25 03:04:11,705 INFO [train.py:996] (2/4) Epoch 11, batch 10500, loss[loss=0.243, simple_loss=0.3406, pruned_loss=0.07266, over 20814.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3034, pruned_loss=0.07528, over 4264266.37 frames. ], batch size: 608, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:04:45,268 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:05:01,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1892796.0, ans=0.125 2023-06-25 03:05:32,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1892916.0, ans=0.0 2023-06-25 03:05:46,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1892916.0, ans=15.0 2023-06-25 03:05:47,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1892916.0, ans=0.95 2023-06-25 03:05:47,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1892916.0, ans=0.125 2023-06-25 03:05:57,583 INFO [train.py:996] (2/4) Epoch 11, batch 10550, loss[loss=0.2032, simple_loss=0.2704, pruned_loss=0.06797, over 21662.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2982, pruned_loss=0.07533, over 4266442.52 frames. ], batch size: 333, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:05:58,637 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.20 vs. limit=10.0 2023-06-25 03:06:23,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1893036.0, ans=0.125 2023-06-25 03:06:39,254 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.737e+02 7.309e+02 9.989e+02 1.510e+03 3.276e+03, threshold=1.998e+03, percent-clipped=4.0 2023-06-25 03:06:43,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1893096.0, ans=0.125 2023-06-25 03:06:54,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1893096.0, ans=0.1 2023-06-25 03:07:14,067 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 03:07:50,987 INFO [train.py:996] (2/4) Epoch 11, batch 10600, loss[loss=0.2338, simple_loss=0.3298, pruned_loss=0.06889, over 21623.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.294, pruned_loss=0.07424, over 4265558.93 frames. ], batch size: 389, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:07:53,486 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-25 03:09:02,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1893456.0, ans=0.2 2023-06-25 03:09:39,365 INFO [train.py:996] (2/4) Epoch 11, batch 10650, loss[loss=0.202, simple_loss=0.2969, pruned_loss=0.05358, over 21666.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2978, pruned_loss=0.07285, over 4262307.02 frames. ], batch size: 389, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:09:39,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=1893576.0, ans=0.025 2023-06-25 03:09:49,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1893576.0, ans=0.125 2023-06-25 03:10:09,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1893636.0, ans=0.125 2023-06-25 03:10:11,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1893636.0, ans=10.0 2023-06-25 03:10:23,986 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.435e+02 7.785e+02 1.184e+03 1.890e+03 4.480e+03, threshold=2.368e+03, percent-clipped=23.0 2023-06-25 03:10:28,502 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.82 vs. limit=15.0 2023-06-25 03:11:22,672 INFO [train.py:996] (2/4) Epoch 11, batch 10700, loss[loss=0.2809, simple_loss=0.3547, pruned_loss=0.1036, over 21423.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.299, pruned_loss=0.07407, over 4263845.82 frames. ], batch size: 131, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:11:23,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=1893876.0, ans=0.125 2023-06-25 03:11:26,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1893876.0, ans=0.125 2023-06-25 03:11:34,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1893876.0, ans=0.125 2023-06-25 03:11:56,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1893936.0, ans=0.2 2023-06-25 03:12:00,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1893936.0, ans=0.035 2023-06-25 03:12:09,811 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.97 vs. limit=6.0 2023-06-25 03:13:05,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1894116.0, ans=0.125 2023-06-25 03:13:09,295 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=13.03 vs. limit=15.0 2023-06-25 03:13:09,987 INFO [train.py:996] (2/4) Epoch 11, batch 10750, loss[loss=0.2426, simple_loss=0.3279, pruned_loss=0.07866, over 21808.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3099, pruned_loss=0.07843, over 4267996.54 frames. ], batch size: 124, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:13:10,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1894176.0, ans=0.2 2023-06-25 03:13:33,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1894176.0, ans=0.2 2023-06-25 03:13:59,210 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=4.03 vs. limit=12.0 2023-06-25 03:14:03,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1894296.0, ans=0.125 2023-06-25 03:14:06,440 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.572e+02 8.726e+02 1.242e+03 1.937e+03 5.296e+03, threshold=2.484e+03, percent-clipped=18.0 2023-06-25 03:15:11,145 INFO [train.py:996] (2/4) Epoch 11, batch 10800, loss[loss=0.265, simple_loss=0.3392, pruned_loss=0.09536, over 20642.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3127, pruned_loss=0.078, over 4271576.58 frames. ], batch size: 607, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:15:17,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1894476.0, ans=0.0 2023-06-25 03:16:17,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1894656.0, ans=0.125 2023-06-25 03:16:27,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1894656.0, ans=0.1 2023-06-25 03:16:52,709 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.56 vs. limit=15.0 2023-06-25 03:16:59,869 INFO [train.py:996] (2/4) Epoch 11, batch 10850, loss[loss=0.2029, simple_loss=0.2902, pruned_loss=0.05778, over 21210.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3129, pruned_loss=0.07793, over 4276694.95 frames. ], batch size: 176, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:17:07,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1894776.0, ans=0.0 2023-06-25 03:17:48,726 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.535e+02 7.569e+02 9.387e+02 1.863e+03 6.222e+03, threshold=1.877e+03, percent-clipped=9.0 2023-06-25 03:17:49,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1894896.0, ans=0.0 2023-06-25 03:18:08,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1894956.0, ans=0.125 2023-06-25 03:18:42,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1895016.0, ans=0.1 2023-06-25 03:18:46,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1895016.0, ans=0.125 2023-06-25 03:18:50,659 INFO [train.py:996] (2/4) Epoch 11, batch 10900, loss[loss=0.2177, simple_loss=0.3263, pruned_loss=0.05448, over 19997.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3057, pruned_loss=0.0762, over 4275755.05 frames. ], batch size: 702, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:18:51,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1895076.0, ans=0.2 2023-06-25 03:19:03,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1895076.0, ans=0.0 2023-06-25 03:19:07,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1895076.0, ans=0.125 2023-06-25 03:19:32,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1895136.0, ans=0.125 2023-06-25 03:19:50,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1895196.0, ans=0.125 2023-06-25 03:20:37,854 INFO [train.py:996] (2/4) Epoch 11, batch 10950, loss[loss=0.1926, simple_loss=0.2603, pruned_loss=0.06248, over 21825.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3015, pruned_loss=0.07383, over 4265794.14 frames. ], batch size: 98, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:21:00,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.out_whiten.whitening_limit, batch_count=1895436.0, ans=8.0 2023-06-25 03:21:18,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1895436.0, ans=0.2 2023-06-25 03:21:26,496 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.792e+02 7.087e+02 9.989e+02 1.560e+03 2.958e+03, threshold=1.998e+03, percent-clipped=15.0 2023-06-25 03:22:21,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1895616.0, ans=0.125 2023-06-25 03:22:25,899 INFO [train.py:996] (2/4) Epoch 11, batch 11000, loss[loss=0.2629, simple_loss=0.3231, pruned_loss=0.1013, over 21570.00 frames. ], tot_loss[loss=0.225, simple_loss=0.3008, pruned_loss=0.07456, over 4260972.29 frames. ], batch size: 471, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:22:42,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1895676.0, ans=0.0 2023-06-25 03:22:51,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1895676.0, ans=0.07 2023-06-25 03:23:23,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1895796.0, ans=0.0 2023-06-25 03:23:48,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1895916.0, ans=0.05 2023-06-25 03:24:12,458 INFO [train.py:996] (2/4) Epoch 11, batch 11050, loss[loss=0.2243, simple_loss=0.2872, pruned_loss=0.08075, over 22007.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.2984, pruned_loss=0.07591, over 4263520.44 frames. ], batch size: 103, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:24:45,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1896036.0, ans=0.0 2023-06-25 03:24:57,010 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.157e+02 7.118e+02 9.875e+02 1.339e+03 2.675e+03, threshold=1.975e+03, percent-clipped=6.0 2023-06-25 03:25:42,657 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1896216.0, ans=0.125 2023-06-25 03:25:48,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1896216.0, ans=0.1 2023-06-25 03:25:48,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1896216.0, ans=0.125 2023-06-25 03:25:53,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1896276.0, ans=0.125 2023-06-25 03:25:54,920 INFO [train.py:996] (2/4) Epoch 11, batch 11100, loss[loss=0.2268, simple_loss=0.2813, pruned_loss=0.08613, over 21776.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2976, pruned_loss=0.07611, over 4259649.90 frames. ], batch size: 112, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:26:28,270 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.43 vs. limit=22.5 2023-06-25 03:26:49,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1896396.0, ans=0.07 2023-06-25 03:27:34,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1896516.0, ans=0.125 2023-06-25 03:27:38,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1896516.0, ans=10.0 2023-06-25 03:27:41,671 INFO [train.py:996] (2/4) Epoch 11, batch 11150, loss[loss=0.227, simple_loss=0.3244, pruned_loss=0.0648, over 21597.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2953, pruned_loss=0.07519, over 4265806.48 frames. ], batch size: 230, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:28:14,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1896636.0, ans=0.2 2023-06-25 03:28:25,708 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.74 vs. limit=15.0 2023-06-25 03:28:31,505 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.954e+02 7.186e+02 9.135e+02 1.372e+03 3.865e+03, threshold=1.827e+03, percent-clipped=12.0 2023-06-25 03:29:00,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1896756.0, ans=0.0 2023-06-25 03:29:00,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1896756.0, ans=0.125 2023-06-25 03:29:31,020 INFO [train.py:996] (2/4) Epoch 11, batch 11200, loss[loss=0.2271, simple_loss=0.2893, pruned_loss=0.08247, over 21824.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2948, pruned_loss=0.07471, over 4258433.91 frames. ], batch size: 102, lr: 2.67e-03, grad_scale: 32.0 2023-06-25 03:29:36,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1896876.0, ans=0.1 2023-06-25 03:30:05,183 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=22.5 2023-06-25 03:30:18,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1896996.0, ans=0.025 2023-06-25 03:30:32,130 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.69 vs. limit=15.0 2023-06-25 03:30:48,537 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.06 vs. limit=15.0 2023-06-25 03:31:19,596 INFO [train.py:996] (2/4) Epoch 11, batch 11250, loss[loss=0.2355, simple_loss=0.3084, pruned_loss=0.08129, over 21466.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2945, pruned_loss=0.07464, over 4256482.40 frames. ], batch size: 194, lr: 2.67e-03, grad_scale: 32.0 2023-06-25 03:31:37,337 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.22 vs. limit=12.0 2023-06-25 03:31:55,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1897236.0, ans=0.0 2023-06-25 03:32:04,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1897296.0, ans=0.125 2023-06-25 03:32:07,773 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.121e+02 7.485e+02 1.049e+03 1.491e+03 3.670e+03, threshold=2.098e+03, percent-clipped=11.0 2023-06-25 03:33:07,305 INFO [train.py:996] (2/4) Epoch 11, batch 11300, loss[loss=0.2018, simple_loss=0.2791, pruned_loss=0.06221, over 21314.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2952, pruned_loss=0.07477, over 4268837.40 frames. ], batch size: 159, lr: 2.67e-03, grad_scale: 32.0 2023-06-25 03:34:54,360 INFO [train.py:996] (2/4) Epoch 11, batch 11350, loss[loss=0.1886, simple_loss=0.274, pruned_loss=0.05157, over 21618.00 frames. ], tot_loss[loss=0.2227, simple_loss=0.2965, pruned_loss=0.07442, over 4267025.62 frames. ], batch size: 263, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:35:25,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1897836.0, ans=0.125 2023-06-25 03:35:37,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1897836.0, ans=0.125 2023-06-25 03:35:47,356 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.585e+02 7.865e+02 1.156e+03 1.769e+03 3.739e+03, threshold=2.312e+03, percent-clipped=14.0 2023-06-25 03:36:04,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1897956.0, ans=0.2 2023-06-25 03:36:09,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1897956.0, ans=0.0 2023-06-25 03:36:51,931 INFO [train.py:996] (2/4) Epoch 11, batch 11400, loss[loss=0.242, simple_loss=0.328, pruned_loss=0.07804, over 21708.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.303, pruned_loss=0.07724, over 4270335.06 frames. ], batch size: 298, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:37:10,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1898076.0, ans=0.0 2023-06-25 03:37:25,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1898136.0, ans=0.125 2023-06-25 03:37:29,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1898196.0, ans=0.125 2023-06-25 03:37:36,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1898196.0, ans=0.09899494936611666 2023-06-25 03:38:05,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1898256.0, ans=0.0 2023-06-25 03:38:16,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1898316.0, ans=0.0 2023-06-25 03:38:31,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1898316.0, ans=0.0 2023-06-25 03:38:38,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1898376.0, ans=0.05 2023-06-25 03:38:39,472 INFO [train.py:996] (2/4) Epoch 11, batch 11450, loss[loss=0.2324, simple_loss=0.323, pruned_loss=0.07091, over 21587.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3045, pruned_loss=0.07683, over 4271165.34 frames. ], batch size: 414, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:39:09,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1898436.0, ans=0.0 2023-06-25 03:39:33,540 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.188e+02 7.976e+02 1.094e+03 1.671e+03 3.367e+03, threshold=2.188e+03, percent-clipped=9.0 2023-06-25 03:39:45,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1898496.0, ans=0.2 2023-06-25 03:40:29,735 INFO [train.py:996] (2/4) Epoch 11, batch 11500, loss[loss=0.2178, simple_loss=0.3191, pruned_loss=0.05826, over 21857.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3067, pruned_loss=0.07755, over 4268358.66 frames. ], batch size: 371, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:42:25,767 INFO [train.py:996] (2/4) Epoch 11, batch 11550, loss[loss=0.1957, simple_loss=0.2554, pruned_loss=0.06803, over 20739.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3108, pruned_loss=0.0773, over 4261951.60 frames. ], batch size: 608, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 03:43:21,050 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.334e+02 7.903e+02 1.066e+03 1.850e+03 4.952e+03, threshold=2.132e+03, percent-clipped=19.0 2023-06-25 03:43:28,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1899096.0, ans=0.125 2023-06-25 03:43:32,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1899096.0, ans=0.2 2023-06-25 03:44:16,540 INFO [train.py:996] (2/4) Epoch 11, batch 11600, loss[loss=0.2609, simple_loss=0.3488, pruned_loss=0.08649, over 21422.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3259, pruned_loss=0.07918, over 4255797.86 frames. ], batch size: 131, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:45:14,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=1899396.0, ans=6.0 2023-06-25 03:45:48,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1899516.0, ans=0.1 2023-06-25 03:46:03,266 INFO [train.py:996] (2/4) Epoch 11, batch 11650, loss[loss=0.2356, simple_loss=0.324, pruned_loss=0.07365, over 21265.00 frames. ], tot_loss[loss=0.2467, simple_loss=0.3327, pruned_loss=0.08032, over 4257642.68 frames. ], batch size: 549, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:46:13,564 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-25 03:46:26,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1899576.0, ans=0.125 2023-06-25 03:46:43,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1899636.0, ans=0.0 2023-06-25 03:47:01,627 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.300e+02 9.276e+02 1.301e+03 2.293e+03 3.963e+03, threshold=2.603e+03, percent-clipped=26.0 2023-06-25 03:47:02,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1899696.0, ans=0.0 2023-06-25 03:47:28,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1899756.0, ans=0.2 2023-06-25 03:47:47,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1899816.0, ans=0.125 2023-06-25 03:47:55,803 INFO [train.py:996] (2/4) Epoch 11, batch 11700, loss[loss=0.206, simple_loss=0.2697, pruned_loss=0.07113, over 21842.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3228, pruned_loss=0.07973, over 4252917.43 frames. ], batch size: 373, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:47:58,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1899876.0, ans=0.2 2023-06-25 03:48:06,506 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=22.5 2023-06-25 03:49:34,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1900176.0, ans=0.0 2023-06-25 03:49:42,866 INFO [train.py:996] (2/4) Epoch 11, batch 11750, loss[loss=0.2219, simple_loss=0.2842, pruned_loss=0.07976, over 21827.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3142, pruned_loss=0.07952, over 4249423.35 frames. ], batch size: 98, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:49:54,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1900176.0, ans=0.0 2023-06-25 03:50:35,178 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.87 vs. limit=15.0 2023-06-25 03:50:36,018 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.631e+02 8.041e+02 1.029e+03 1.302e+03 3.025e+03, threshold=2.058e+03, percent-clipped=2.0 2023-06-25 03:50:52,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1900356.0, ans=0.1 2023-06-25 03:50:52,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1900356.0, ans=0.125 2023-06-25 03:51:03,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1900356.0, ans=0.1 2023-06-25 03:51:31,735 INFO [train.py:996] (2/4) Epoch 11, batch 11800, loss[loss=0.2292, simple_loss=0.3113, pruned_loss=0.0735, over 21701.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3161, pruned_loss=0.08154, over 4255228.88 frames. ], batch size: 298, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:52:41,686 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1900656.0, ans=0.125 2023-06-25 03:53:17,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1900716.0, ans=0.125 2023-06-25 03:53:18,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1900776.0, ans=0.125 2023-06-25 03:53:19,741 INFO [train.py:996] (2/4) Epoch 11, batch 11850, loss[loss=0.2623, simple_loss=0.3537, pruned_loss=0.08541, over 21700.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3161, pruned_loss=0.08003, over 4262845.58 frames. ], batch size: 389, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:53:34,291 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=22.5 2023-06-25 03:53:53,240 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1900836.0, ans=0.0 2023-06-25 03:53:58,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1900836.0, ans=0.125 2023-06-25 03:54:05,588 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.44 vs. limit=15.0 2023-06-25 03:54:17,574 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.516e+02 7.061e+02 9.969e+02 1.583e+03 3.889e+03, threshold=1.994e+03, percent-clipped=10.0 2023-06-25 03:54:25,790 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-25 03:55:15,577 INFO [train.py:996] (2/4) Epoch 11, batch 11900, loss[loss=0.2129, simple_loss=0.2954, pruned_loss=0.06521, over 21662.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3172, pruned_loss=0.07788, over 4267317.24 frames. ], batch size: 298, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:55:33,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=1901076.0, ans=10.0 2023-06-25 03:56:11,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1901196.0, ans=0.1 2023-06-25 03:56:19,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1901256.0, ans=0.125 2023-06-25 03:56:40,172 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.76 vs. limit=5.0 2023-06-25 03:56:55,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1901316.0, ans=0.125 2023-06-25 03:56:58,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1901316.0, ans=0.2 2023-06-25 03:57:09,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1901376.0, ans=0.0 2023-06-25 03:57:11,037 INFO [train.py:996] (2/4) Epoch 11, batch 11950, loss[loss=0.2609, simple_loss=0.3636, pruned_loss=0.07913, over 21598.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3168, pruned_loss=0.07438, over 4258447.62 frames. ], batch size: 441, lr: 2.67e-03, grad_scale: 16.0 2023-06-25 03:57:23,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1901376.0, ans=0.125 2023-06-25 03:57:48,593 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.18 vs. limit=15.0 2023-06-25 03:57:56,077 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.868e+02 8.366e+02 1.305e+03 1.891e+03 4.761e+03, threshold=2.610e+03, percent-clipped=24.0 2023-06-25 03:58:15,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1901556.0, ans=0.0 2023-06-25 03:58:20,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1901556.0, ans=0.1 2023-06-25 03:58:53,116 INFO [train.py:996] (2/4) Epoch 11, batch 12000, loss[loss=0.2319, simple_loss=0.2983, pruned_loss=0.08273, over 21800.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3109, pruned_loss=0.07258, over 4264063.62 frames. ], batch size: 352, lr: 2.67e-03, grad_scale: 32.0 2023-06-25 03:58:53,116 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 03:59:03,805 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.0.layers.0.self_attn_weights, attn_weights_entropy = tensor([4.6847, 4.6729, 4.3596, 4.3013], device='cuda:2') 2023-06-25 03:59:11,386 INFO [train.py:1028] (2/4) Epoch 11, validation: loss=0.2587, simple_loss=0.3514, pruned_loss=0.08303, over 1796401.00 frames. 2023-06-25 03:59:11,387 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-25 03:59:30,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1901676.0, ans=0.125 2023-06-25 03:59:46,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1901736.0, ans=0.1 2023-06-25 03:59:53,740 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.63 vs. limit=6.0 2023-06-25 04:00:50,830 INFO [train.py:996] (2/4) Epoch 11, batch 12050, loss[loss=0.2619, simple_loss=0.327, pruned_loss=0.09838, over 21812.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3099, pruned_loss=0.07527, over 4266944.84 frames. ], batch size: 391, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 04:00:52,673 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:00:54,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1901976.0, ans=0.125 2023-06-25 04:01:12,938 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1902036.0, ans=0.125 2023-06-25 04:01:19,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1902036.0, ans=0.125 2023-06-25 04:01:21,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1902036.0, ans=0.0 2023-06-25 04:01:44,069 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.395e+02 7.721e+02 1.099e+03 1.708e+03 2.830e+03, threshold=2.199e+03, percent-clipped=2.0 2023-06-25 04:02:05,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1902156.0, ans=0.1 2023-06-25 04:02:13,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1902156.0, ans=0.125 2023-06-25 04:02:16,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1902156.0, ans=0.125 2023-06-25 04:02:41,821 INFO [train.py:996] (2/4) Epoch 11, batch 12100, loss[loss=0.2727, simple_loss=0.3392, pruned_loss=0.1031, over 21820.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3146, pruned_loss=0.07981, over 4272316.23 frames. ], batch size: 441, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 04:02:44,394 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.38 vs. limit=10.0 2023-06-25 04:03:00,115 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1902336.0, ans=0.1 2023-06-25 04:03:14,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1902336.0, ans=0.125 2023-06-25 04:03:40,376 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:03:40,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1902396.0, ans=0.2 2023-06-25 04:04:25,625 INFO [train.py:996] (2/4) Epoch 11, batch 12150, loss[loss=0.2555, simple_loss=0.3798, pruned_loss=0.0656, over 19767.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3184, pruned_loss=0.07966, over 4269933.41 frames. ], batch size: 702, lr: 2.67e-03, grad_scale: 8.0 2023-06-25 04:04:57,994 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=20.09 vs. limit=22.5 2023-06-25 04:05:28,513 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.672e+02 1.025e+03 1.712e+03 2.364e+03 4.484e+03, threshold=3.424e+03, percent-clipped=30.0 2023-06-25 04:05:40,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1902756.0, ans=0.125 2023-06-25 04:06:12,842 INFO [train.py:996] (2/4) Epoch 11, batch 12200, loss[loss=0.2313, simple_loss=0.2941, pruned_loss=0.08425, over 21803.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3163, pruned_loss=0.07892, over 4268620.39 frames. ], batch size: 352, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:06:30,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1902876.0, ans=0.2 2023-06-25 04:06:38,945 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=15.0 2023-06-25 04:07:03,302 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-06-25 04:07:18,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1903056.0, ans=0.0 2023-06-25 04:07:58,635 INFO [train.py:996] (2/4) Epoch 11, batch 12250, loss[loss=0.1716, simple_loss=0.2617, pruned_loss=0.04069, over 21701.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3076, pruned_loss=0.07569, over 4268123.49 frames. ], batch size: 332, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:08:30,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=1903236.0, ans=15.0 2023-06-25 04:08:58,934 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.12 vs. limit=6.0 2023-06-25 04:08:59,363 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.457e+02 7.371e+02 1.190e+03 1.577e+03 4.141e+03, threshold=2.380e+03, percent-clipped=2.0 2023-06-25 04:09:04,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1903356.0, ans=0.125 2023-06-25 04:09:11,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1903356.0, ans=0.125 2023-06-25 04:09:31,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1903416.0, ans=0.0 2023-06-25 04:09:44,645 INFO [train.py:996] (2/4) Epoch 11, batch 12300, loss[loss=0.2569, simple_loss=0.3494, pruned_loss=0.08224, over 21672.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.3003, pruned_loss=0.0703, over 4270392.10 frames. ], batch size: 414, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:10:40,271 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=15.0 2023-06-25 04:11:04,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1903656.0, ans=0.125 2023-06-25 04:11:23,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1903716.0, ans=0.125 2023-06-25 04:11:30,622 INFO [train.py:996] (2/4) Epoch 11, batch 12350, loss[loss=0.2688, simple_loss=0.3366, pruned_loss=0.1005, over 21919.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3035, pruned_loss=0.07035, over 4271049.70 frames. ], batch size: 316, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:11:39,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1903776.0, ans=0.0 2023-06-25 04:12:29,919 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:12:31,423 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.609e+02 8.670e+02 1.217e+03 1.964e+03 4.834e+03, threshold=2.433e+03, percent-clipped=16.0 2023-06-25 04:13:16,310 INFO [train.py:996] (2/4) Epoch 11, batch 12400, loss[loss=0.2274, simple_loss=0.295, pruned_loss=0.07995, over 21411.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.3053, pruned_loss=0.07394, over 4277325.32 frames. ], batch size: 159, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:13:42,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1904136.0, ans=0.125 2023-06-25 04:14:21,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1904256.0, ans=0.125 2023-06-25 04:15:07,762 INFO [train.py:996] (2/4) Epoch 11, batch 12450, loss[loss=0.2746, simple_loss=0.3437, pruned_loss=0.1028, over 21608.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.31, pruned_loss=0.0772, over 4281798.26 frames. ], batch size: 263, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:15:25,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1904376.0, ans=0.125 2023-06-25 04:15:35,520 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.17 vs. limit=10.0 2023-06-25 04:16:10,960 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.996e+02 6.913e+02 8.483e+02 1.165e+03 2.704e+03, threshold=1.697e+03, percent-clipped=3.0 2023-06-25 04:16:17,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1904556.0, ans=0.125 2023-06-25 04:16:34,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1904616.0, ans=0.1 2023-06-25 04:17:03,373 INFO [train.py:996] (2/4) Epoch 11, batch 12500, loss[loss=0.2523, simple_loss=0.3546, pruned_loss=0.07497, over 21594.00 frames. ], tot_loss[loss=0.2407, simple_loss=0.3216, pruned_loss=0.07993, over 4281186.16 frames. ], batch size: 389, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:17:32,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1904736.0, ans=0.2 2023-06-25 04:18:02,276 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=22.5 2023-06-25 04:18:32,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1904856.0, ans=0.0 2023-06-25 04:19:02,287 INFO [train.py:996] (2/4) Epoch 11, batch 12550, loss[loss=0.2148, simple_loss=0.3084, pruned_loss=0.06062, over 21731.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3245, pruned_loss=0.08188, over 4275235.38 frames. ], batch size: 298, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:19:26,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1905036.0, ans=0.125 2023-06-25 04:20:07,416 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.343e+02 7.503e+02 1.080e+03 1.641e+03 3.839e+03, threshold=2.159e+03, percent-clipped=20.0 2023-06-25 04:20:52,624 INFO [train.py:996] (2/4) Epoch 11, batch 12600, loss[loss=0.202, simple_loss=0.2994, pruned_loss=0.05232, over 21784.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3245, pruned_loss=0.08033, over 4276148.34 frames. ], batch size: 352, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:21:01,676 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1905276.0, ans=0.125 2023-06-25 04:21:01,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1905276.0, ans=0.125 2023-06-25 04:21:33,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1905396.0, ans=0.125 2023-06-25 04:21:56,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1905456.0, ans=0.125 2023-06-25 04:21:56,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1905456.0, ans=0.2 2023-06-25 04:22:06,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1905456.0, ans=0.2 2023-06-25 04:22:33,074 INFO [train.py:996] (2/4) Epoch 11, batch 12650, loss[loss=0.2494, simple_loss=0.3396, pruned_loss=0.07959, over 19880.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3163, pruned_loss=0.07681, over 4272537.04 frames. ], batch size: 702, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:22:44,047 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.12 vs. limit=15.0 2023-06-25 04:23:37,018 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.529e+02 6.454e+02 1.042e+03 1.689e+03 3.142e+03, threshold=2.085e+03, percent-clipped=12.0 2023-06-25 04:23:48,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1905756.0, ans=0.125 2023-06-25 04:24:28,073 INFO [train.py:996] (2/4) Epoch 11, batch 12700, loss[loss=0.2502, simple_loss=0.3186, pruned_loss=0.0909, over 21261.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3151, pruned_loss=0.07934, over 4281514.01 frames. ], batch size: 176, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:24:49,663 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.46 vs. limit=5.0 2023-06-25 04:26:13,869 INFO [train.py:996] (2/4) Epoch 11, batch 12750, loss[loss=0.2275, simple_loss=0.3102, pruned_loss=0.07239, over 21753.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3161, pruned_loss=0.07983, over 4281954.91 frames. ], batch size: 282, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 04:26:14,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1906176.0, ans=0.125 2023-06-25 04:26:33,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1906236.0, ans=0.035 2023-06-25 04:26:35,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1906236.0, ans=0.125 2023-06-25 04:27:01,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1906296.0, ans=0.125 2023-06-25 04:27:09,556 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.849e+02 1.051e+03 1.343e+03 1.949e+03 4.528e+03, threshold=2.685e+03, percent-clipped=20.0 2023-06-25 04:27:43,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1906416.0, ans=0.0 2023-06-25 04:27:57,963 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.86 vs. limit=15.0 2023-06-25 04:28:00,574 INFO [train.py:996] (2/4) Epoch 11, batch 12800, loss[loss=0.2377, simple_loss=0.3296, pruned_loss=0.07285, over 20756.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3166, pruned_loss=0.08072, over 4279288.74 frames. ], batch size: 607, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:28:48,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1906596.0, ans=0.2 2023-06-25 04:29:50,499 INFO [train.py:996] (2/4) Epoch 11, batch 12850, loss[loss=0.1991, simple_loss=0.2926, pruned_loss=0.0528, over 21611.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3188, pruned_loss=0.08226, over 4277273.76 frames. ], batch size: 230, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:30:42,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1906896.0, ans=0.1 2023-06-25 04:30:53,495 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.548e+02 7.415e+02 1.066e+03 1.369e+03 3.330e+03, threshold=2.132e+03, percent-clipped=6.0 2023-06-25 04:31:26,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1907016.0, ans=0.125 2023-06-25 04:31:28,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1907016.0, ans=0.125 2023-06-25 04:31:43,315 INFO [train.py:996] (2/4) Epoch 11, batch 12900, loss[loss=0.2467, simple_loss=0.3329, pruned_loss=0.08027, over 21714.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3173, pruned_loss=0.07847, over 4280007.40 frames. ], batch size: 391, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:32:04,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1907136.0, ans=0.2 2023-06-25 04:32:15,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1907136.0, ans=0.0 2023-06-25 04:33:33,457 INFO [train.py:996] (2/4) Epoch 11, batch 12950, loss[loss=0.2683, simple_loss=0.3407, pruned_loss=0.09791, over 21460.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3145, pruned_loss=0.0768, over 4277876.84 frames. ], batch size: 471, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:33:52,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1907436.0, ans=0.2 2023-06-25 04:33:54,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1907436.0, ans=0.2 2023-06-25 04:34:06,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1907436.0, ans=0.1 2023-06-25 04:34:31,948 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.954e+02 6.927e+02 1.132e+03 1.522e+03 3.743e+03, threshold=2.263e+03, percent-clipped=8.0 2023-06-25 04:34:42,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1907556.0, ans=0.2 2023-06-25 04:34:42,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1907556.0, ans=0.2 2023-06-25 04:35:21,569 INFO [train.py:996] (2/4) Epoch 11, batch 13000, loss[loss=0.1981, simple_loss=0.2839, pruned_loss=0.05613, over 21795.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3137, pruned_loss=0.07741, over 4278216.13 frames. ], batch size: 372, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:35:28,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1907676.0, ans=0.125 2023-06-25 04:35:38,917 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1907736.0, ans=0.2 2023-06-25 04:35:45,633 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.27 vs. limit=12.0 2023-06-25 04:36:11,129 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1907796.0, ans=0.125 2023-06-25 04:36:15,201 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.28 vs. limit=22.5 2023-06-25 04:37:07,490 INFO [train.py:996] (2/4) Epoch 11, batch 13050, loss[loss=0.2649, simple_loss=0.322, pruned_loss=0.1039, over 21596.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3088, pruned_loss=0.0754, over 4271952.67 frames. ], batch size: 548, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:37:08,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1907976.0, ans=0.95 2023-06-25 04:37:39,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1908036.0, ans=0.125 2023-06-25 04:37:57,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1908096.0, ans=0.125 2023-06-25 04:38:04,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1.whitening_limit, batch_count=1908096.0, ans=10.0 2023-06-25 04:38:05,016 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.554e+02 7.302e+02 9.567e+02 1.329e+03 2.389e+03, threshold=1.913e+03, percent-clipped=1.0 2023-06-25 04:38:44,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1908216.0, ans=0.2 2023-06-25 04:38:55,871 INFO [train.py:996] (2/4) Epoch 11, batch 13100, loss[loss=0.2744, simple_loss=0.3524, pruned_loss=0.09823, over 21801.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3094, pruned_loss=0.07459, over 4279038.86 frames. ], batch size: 441, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:40:27,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1908516.0, ans=0.125 2023-06-25 04:40:30,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1908516.0, ans=0.125 2023-06-25 04:40:37,792 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1908516.0, ans=0.0 2023-06-25 04:40:45,494 INFO [train.py:996] (2/4) Epoch 11, batch 13150, loss[loss=0.2756, simple_loss=0.3373, pruned_loss=0.107, over 21391.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3138, pruned_loss=0.07724, over 4277000.61 frames. ], batch size: 471, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:41:06,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1908576.0, ans=0.125 2023-06-25 04:41:08,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1908576.0, ans=0.1 2023-06-25 04:41:14,587 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-25 04:41:20,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1908636.0, ans=0.0 2023-06-25 04:41:26,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1908636.0, ans=0.2 2023-06-25 04:41:54,999 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.487e+02 7.569e+02 1.234e+03 1.722e+03 3.917e+03, threshold=2.467e+03, percent-clipped=21.0 2023-06-25 04:41:57,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1908696.0, ans=0.2 2023-06-25 04:42:12,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1908756.0, ans=0.125 2023-06-25 04:42:21,965 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1908816.0, ans=0.125 2023-06-25 04:42:34,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1908816.0, ans=0.125 2023-06-25 04:42:36,969 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2023-06-25 04:42:46,106 INFO [train.py:996] (2/4) Epoch 11, batch 13200, loss[loss=0.2153, simple_loss=0.2928, pruned_loss=0.06893, over 21814.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3154, pruned_loss=0.07707, over 4271810.25 frames. ], batch size: 282, lr: 2.66e-03, grad_scale: 32.0 2023-06-25 04:42:50,634 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.48 vs. limit=12.0 2023-06-25 04:43:09,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1908936.0, ans=0.2 2023-06-25 04:43:16,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1908936.0, ans=0.125 2023-06-25 04:44:07,218 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.14 vs. limit=15.0 2023-06-25 04:44:31,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1909116.0, ans=0.0 2023-06-25 04:44:34,058 INFO [train.py:996] (2/4) Epoch 11, batch 13250, loss[loss=0.2301, simple_loss=0.3061, pruned_loss=0.07707, over 21724.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3157, pruned_loss=0.07939, over 4272964.93 frames. ], batch size: 247, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:44:39,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=1909176.0, ans=0.125 2023-06-25 04:44:40,158 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=22.5 2023-06-25 04:45:16,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1909236.0, ans=0.1 2023-06-25 04:45:29,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1909296.0, ans=0.125 2023-06-25 04:45:32,773 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.299e+02 1.027e+03 1.488e+03 2.200e+03 4.599e+03, threshold=2.975e+03, percent-clipped=16.0 2023-06-25 04:46:21,094 INFO [train.py:996] (2/4) Epoch 11, batch 13300, loss[loss=0.2494, simple_loss=0.3324, pruned_loss=0.08319, over 21780.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3167, pruned_loss=0.07867, over 4272117.42 frames. ], batch size: 332, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:46:59,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1909536.0, ans=0.0 2023-06-25 04:47:08,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.max_positive, batch_count=1909596.0, ans=0.95 2023-06-25 04:48:09,241 INFO [train.py:996] (2/4) Epoch 11, batch 13350, loss[loss=0.2209, simple_loss=0.32, pruned_loss=0.06087, over 20804.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3206, pruned_loss=0.08143, over 4268962.08 frames. ], batch size: 609, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:49:08,198 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.321e+02 8.273e+02 1.155e+03 1.760e+03 3.459e+03, threshold=2.310e+03, percent-clipped=3.0 2023-06-25 04:49:26,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1909956.0, ans=0.125 2023-06-25 04:49:27,392 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.90 vs. limit=10.0 2023-06-25 04:49:35,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1910016.0, ans=0.125 2023-06-25 04:49:52,119 INFO [train.py:996] (2/4) Epoch 11, batch 13400, loss[loss=0.2397, simple_loss=0.3125, pruned_loss=0.08344, over 21444.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3214, pruned_loss=0.08333, over 4276754.61 frames. ], batch size: 194, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:50:09,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=1910076.0, ans=0.125 2023-06-25 04:50:14,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1910076.0, ans=0.0 2023-06-25 04:50:15,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1910076.0, ans=0.1 2023-06-25 04:50:59,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1910256.0, ans=0.125 2023-06-25 04:51:01,264 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=1910256.0, ans=0.035 2023-06-25 04:51:31,107 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 04:51:35,113 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.47 vs. limit=15.0 2023-06-25 04:51:39,491 INFO [train.py:996] (2/4) Epoch 11, batch 13450, loss[loss=0.234, simple_loss=0.2992, pruned_loss=0.08434, over 21600.00 frames. ], tot_loss[loss=0.2455, simple_loss=0.3224, pruned_loss=0.08432, over 4271352.54 frames. ], batch size: 263, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:51:55,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1910436.0, ans=0.125 2023-06-25 04:52:08,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1910436.0, ans=0.0 2023-06-25 04:52:36,800 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.628e+02 8.174e+02 1.187e+03 1.780e+03 3.541e+03, threshold=2.373e+03, percent-clipped=13.0 2023-06-25 04:53:14,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1910616.0, ans=0.125 2023-06-25 04:53:26,282 INFO [train.py:996] (2/4) Epoch 11, batch 13500, loss[loss=0.2174, simple_loss=0.2909, pruned_loss=0.072, over 21718.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3155, pruned_loss=0.08152, over 4258051.29 frames. ], batch size: 298, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:53:30,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1910676.0, ans=0.2 2023-06-25 04:53:38,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1910676.0, ans=0.0 2023-06-25 04:54:15,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1910796.0, ans=0.2 2023-06-25 04:55:13,727 INFO [train.py:996] (2/4) Epoch 11, batch 13550, loss[loss=0.255, simple_loss=0.3481, pruned_loss=0.08102, over 21423.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3179, pruned_loss=0.08014, over 4264748.37 frames. ], batch size: 211, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:55:24,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1910976.0, ans=10.0 2023-06-25 04:55:27,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1910976.0, ans=0.07 2023-06-25 04:56:11,385 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.050e+02 7.777e+02 1.227e+03 1.710e+03 3.921e+03, threshold=2.454e+03, percent-clipped=8.0 2023-06-25 04:56:14,229 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.68 vs. limit=15.0 2023-06-25 04:56:25,933 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.03 vs. limit=12.0 2023-06-25 04:56:34,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1911156.0, ans=0.125 2023-06-25 04:57:01,024 INFO [train.py:996] (2/4) Epoch 11, batch 13600, loss[loss=0.2603, simple_loss=0.3271, pruned_loss=0.09672, over 21792.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3177, pruned_loss=0.08159, over 4269208.70 frames. ], batch size: 441, lr: 2.66e-03, grad_scale: 32.0 2023-06-25 04:57:32,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1911336.0, ans=0.2 2023-06-25 04:57:33,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1911336.0, ans=0.1 2023-06-25 04:57:40,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1911396.0, ans=0.125 2023-06-25 04:58:00,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1911396.0, ans=0.1 2023-06-25 04:58:02,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1911456.0, ans=0.0 2023-06-25 04:58:16,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1911456.0, ans=0.125 2023-06-25 04:58:42,926 INFO [train.py:996] (2/4) Epoch 11, batch 13650, loss[loss=0.1979, simple_loss=0.2586, pruned_loss=0.06861, over 21629.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3133, pruned_loss=0.07892, over 4272303.17 frames. ], batch size: 231, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 04:58:45,527 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=15.0 2023-06-25 04:58:46,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1911576.0, ans=0.0 2023-06-25 04:59:48,571 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.419e+02 6.510e+02 1.024e+03 1.563e+03 2.533e+03, threshold=2.048e+03, percent-clipped=2.0 2023-06-25 05:00:35,837 INFO [train.py:996] (2/4) Epoch 11, batch 13700, loss[loss=0.2201, simple_loss=0.2844, pruned_loss=0.07783, over 21449.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3089, pruned_loss=0.07816, over 4277147.71 frames. ], batch size: 211, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:01:47,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1912056.0, ans=0.0 2023-06-25 05:02:18,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1912116.0, ans=0.125 2023-06-25 05:02:30,025 INFO [train.py:996] (2/4) Epoch 11, batch 13750, loss[loss=0.2455, simple_loss=0.3242, pruned_loss=0.0834, over 21626.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3075, pruned_loss=0.07762, over 4278888.72 frames. ], batch size: 414, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:02:42,622 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=1912176.0, ans=0.05 2023-06-25 05:03:05,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1912236.0, ans=0.1 2023-06-25 05:03:09,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1912296.0, ans=0.125 2023-06-25 05:03:33,617 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.440e+02 9.228e+02 1.294e+03 2.214e+03 4.699e+03, threshold=2.588e+03, percent-clipped=28.0 2023-06-25 05:03:49,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1912356.0, ans=0.0 2023-06-25 05:04:10,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=1912416.0, ans=0.2 2023-06-25 05:04:20,965 INFO [train.py:996] (2/4) Epoch 11, batch 13800, loss[loss=0.1938, simple_loss=0.3109, pruned_loss=0.0383, over 19773.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.31, pruned_loss=0.07568, over 4274050.50 frames. ], batch size: 703, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:04:30,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1912476.0, ans=0.0 2023-06-25 05:04:43,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1912536.0, ans=0.0 2023-06-25 05:05:24,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1912596.0, ans=0.2 2023-06-25 05:05:25,368 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.05 vs. limit=12.0 2023-06-25 05:06:07,141 INFO [train.py:996] (2/4) Epoch 11, batch 13850, loss[loss=0.3449, simple_loss=0.4118, pruned_loss=0.139, over 21480.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3184, pruned_loss=0.07773, over 4273680.39 frames. ], batch size: 507, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 05:06:09,786 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.51 vs. limit=15.0 2023-06-25 05:06:21,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1912776.0, ans=0.125 2023-06-25 05:06:42,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1912836.0, ans=0.2 2023-06-25 05:07:12,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1912956.0, ans=0.0 2023-06-25 05:07:13,526 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.731e+02 7.739e+02 1.067e+03 1.553e+03 4.213e+03, threshold=2.133e+03, percent-clipped=6.0 2023-06-25 05:07:23,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1912956.0, ans=0.125 2023-06-25 05:07:36,867 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.59 vs. limit=10.0 2023-06-25 05:07:40,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1913016.0, ans=0.0 2023-06-25 05:07:52,241 INFO [train.py:996] (2/4) Epoch 11, batch 13900, loss[loss=0.2584, simple_loss=0.3342, pruned_loss=0.09132, over 21358.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3217, pruned_loss=0.08042, over 4281091.01 frames. ], batch size: 143, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 05:08:28,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1913136.0, ans=0.07 2023-06-25 05:08:55,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1913196.0, ans=0.0 2023-06-25 05:08:57,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1913196.0, ans=0.125 2023-06-25 05:09:40,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1913316.0, ans=0.1 2023-06-25 05:09:44,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1913376.0, ans=0.0 2023-06-25 05:09:45,235 INFO [train.py:996] (2/4) Epoch 11, batch 13950, loss[loss=0.2319, simple_loss=0.3076, pruned_loss=0.07808, over 21774.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3225, pruned_loss=0.08204, over 4287914.63 frames. ], batch size: 282, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 05:10:50,567 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.156e+02 8.731e+02 1.156e+03 1.746e+03 2.860e+03, threshold=2.312e+03, percent-clipped=13.0 2023-06-25 05:11:29,091 INFO [train.py:996] (2/4) Epoch 11, batch 14000, loss[loss=0.202, simple_loss=0.2893, pruned_loss=0.05735, over 21418.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3177, pruned_loss=0.07987, over 4278089.55 frames. ], batch size: 548, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:12:17,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1913796.0, ans=0.0 2023-06-25 05:12:42,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1913856.0, ans=0.0 2023-06-25 05:12:59,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1913916.0, ans=0.0 2023-06-25 05:13:08,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1913916.0, ans=0.1 2023-06-25 05:13:16,596 INFO [train.py:996] (2/4) Epoch 11, batch 14050, loss[loss=0.2089, simple_loss=0.2832, pruned_loss=0.06729, over 21584.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3121, pruned_loss=0.07627, over 4277623.33 frames. ], batch size: 414, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:13:52,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1914036.0, ans=0.2 2023-06-25 05:14:06,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1914096.0, ans=0.025 2023-06-25 05:14:24,605 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.465e+02 7.745e+02 1.137e+03 1.921e+03 3.840e+03, threshold=2.273e+03, percent-clipped=15.0 2023-06-25 05:14:42,288 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=1914216.0, ans=0.0 2023-06-25 05:14:43,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1914216.0, ans=0.125 2023-06-25 05:14:49,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_na.min_abs, batch_count=1914216.0, ans=0.02 2023-06-25 05:15:04,508 INFO [train.py:996] (2/4) Epoch 11, batch 14100, loss[loss=0.2407, simple_loss=0.3022, pruned_loss=0.08962, over 21741.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3062, pruned_loss=0.07665, over 4264712.11 frames. ], batch size: 351, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:15:58,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1914396.0, ans=0.125 2023-06-25 05:16:04,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1914396.0, ans=0.125 2023-06-25 05:16:32,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1914516.0, ans=0.125 2023-06-25 05:16:46,925 INFO [train.py:996] (2/4) Epoch 11, batch 14150, loss[loss=0.2173, simple_loss=0.3113, pruned_loss=0.06171, over 21812.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.309, pruned_loss=0.07703, over 4262002.19 frames. ], batch size: 118, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:17:40,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1914696.0, ans=0.0 2023-06-25 05:17:51,146 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.783e+02 7.327e+02 9.497e+02 1.308e+03 3.394e+03, threshold=1.899e+03, percent-clipped=3.0 2023-06-25 05:18:02,077 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=7.65 vs. limit=15.0 2023-06-25 05:18:29,895 INFO [train.py:996] (2/4) Epoch 11, batch 14200, loss[loss=0.2463, simple_loss=0.3136, pruned_loss=0.08948, over 21803.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3081, pruned_loss=0.07545, over 4268343.00 frames. ], batch size: 118, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:19:40,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1915056.0, ans=0.1 2023-06-25 05:19:51,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1915116.0, ans=0.1 2023-06-25 05:20:12,970 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:20:13,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1915176.0, ans=0.0 2023-06-25 05:20:14,308 INFO [train.py:996] (2/4) Epoch 11, batch 14250, loss[loss=0.2025, simple_loss=0.274, pruned_loss=0.06547, over 21504.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3028, pruned_loss=0.07553, over 4269433.17 frames. ], batch size: 230, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:21:24,827 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.492e+02 7.052e+02 9.633e+02 1.519e+03 2.693e+03, threshold=1.927e+03, percent-clipped=14.0 2023-06-25 05:21:55,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1915416.0, ans=0.2 2023-06-25 05:22:03,002 INFO [train.py:996] (2/4) Epoch 11, batch 14300, loss[loss=0.3437, simple_loss=0.4328, pruned_loss=0.1273, over 21570.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3029, pruned_loss=0.07415, over 4256664.53 frames. ], batch size: 441, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:22:07,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1915476.0, ans=0.1 2023-06-25 05:22:43,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1915536.0, ans=0.0 2023-06-25 05:22:45,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1915536.0, ans=0.125 2023-06-25 05:23:49,069 INFO [train.py:996] (2/4) Epoch 11, batch 14350, loss[loss=0.2208, simple_loss=0.2927, pruned_loss=0.07447, over 21756.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3086, pruned_loss=0.07573, over 4252181.41 frames. ], batch size: 112, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:24:26,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1915836.0, ans=0.125 2023-06-25 05:24:37,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1915896.0, ans=0.1 2023-06-25 05:24:56,251 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.843e+02 8.086e+02 1.263e+03 2.324e+03 6.942e+03, threshold=2.526e+03, percent-clipped=29.0 2023-06-25 05:25:18,304 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1916016.0, ans=0.125 2023-06-25 05:25:25,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1916016.0, ans=0.0 2023-06-25 05:25:34,861 INFO [train.py:996] (2/4) Epoch 11, batch 14400, loss[loss=0.2077, simple_loss=0.2717, pruned_loss=0.07192, over 21300.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3073, pruned_loss=0.0775, over 4265304.83 frames. ], batch size: 160, lr: 2.66e-03, grad_scale: 32.0 2023-06-25 05:26:28,774 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=15.0 2023-06-25 05:27:00,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=1916256.0, ans=0.125 2023-06-25 05:27:12,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1916316.0, ans=0.0 2023-06-25 05:27:29,123 INFO [train.py:996] (2/4) Epoch 11, batch 14450, loss[loss=0.2232, simple_loss=0.2892, pruned_loss=0.07859, over 21255.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3025, pruned_loss=0.07768, over 4266174.37 frames. ], batch size: 159, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:28:20,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=1916496.0, ans=0.05 2023-06-25 05:28:29,145 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:28:30,486 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.134e+02 8.129e+02 1.231e+03 1.648e+03 3.274e+03, threshold=2.462e+03, percent-clipped=7.0 2023-06-25 05:28:37,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1916556.0, ans=0.0 2023-06-25 05:28:37,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1916556.0, ans=0.125 2023-06-25 05:28:57,533 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=13.28 vs. limit=22.5 2023-06-25 05:29:06,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1916676.0, ans=0.0 2023-06-25 05:29:07,844 INFO [train.py:996] (2/4) Epoch 11, batch 14500, loss[loss=0.2086, simple_loss=0.2804, pruned_loss=0.06837, over 21854.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.2991, pruned_loss=0.07719, over 4275630.53 frames. ], batch size: 107, lr: 2.66e-03, grad_scale: 16.0 2023-06-25 05:29:22,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1916676.0, ans=0.0 2023-06-25 05:30:03,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1916796.0, ans=0.125 2023-06-25 05:30:10,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1916796.0, ans=0.0 2023-06-25 05:30:34,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1916856.0, ans=0.125 2023-06-25 05:30:37,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1916916.0, ans=0.125 2023-06-25 05:30:44,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=1916916.0, ans=0.125 2023-06-25 05:30:59,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1916916.0, ans=0.0 2023-06-25 05:31:01,905 INFO [train.py:996] (2/4) Epoch 11, batch 14550, loss[loss=0.2942, simple_loss=0.3668, pruned_loss=0.1108, over 21779.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3036, pruned_loss=0.07849, over 4275625.23 frames. ], batch size: 124, lr: 2.66e-03, grad_scale: 8.0 2023-06-25 05:31:30,056 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 05:31:50,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1917096.0, ans=0.0 2023-06-25 05:32:13,604 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.818e+02 8.368e+02 1.257e+03 1.782e+03 3.337e+03, threshold=2.514e+03, percent-clipped=4.0 2023-06-25 05:32:38,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1917216.0, ans=0.125 2023-06-25 05:32:38,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1917216.0, ans=0.0 2023-06-25 05:32:56,082 INFO [train.py:996] (2/4) Epoch 11, batch 14600, loss[loss=0.2479, simple_loss=0.3319, pruned_loss=0.0819, over 21813.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3113, pruned_loss=0.08204, over 4276272.33 frames. ], batch size: 124, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 05:33:26,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1917336.0, ans=0.125 2023-06-25 05:33:29,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1917336.0, ans=0.125 2023-06-25 05:33:50,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1917396.0, ans=0.125 2023-06-25 05:34:23,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1917516.0, ans=0.0 2023-06-25 05:34:44,105 INFO [train.py:996] (2/4) Epoch 11, batch 14650, loss[loss=0.1915, simple_loss=0.2591, pruned_loss=0.06196, over 20758.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3121, pruned_loss=0.08023, over 4279679.28 frames. ], batch size: 608, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 05:34:56,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1917576.0, ans=0.2 2023-06-25 05:35:48,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1917756.0, ans=0.2 2023-06-25 05:35:50,873 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.954e+02 8.792e+02 1.262e+03 1.854e+03 3.152e+03, threshold=2.525e+03, percent-clipped=6.0 2023-06-25 05:36:03,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1917816.0, ans=0.2 2023-06-25 05:36:33,369 INFO [train.py:996] (2/4) Epoch 11, batch 14700, loss[loss=0.2573, simple_loss=0.3468, pruned_loss=0.0839, over 21279.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3084, pruned_loss=0.07549, over 4276550.29 frames. ], batch size: 548, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 05:36:36,027 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.15 vs. limit=22.5 2023-06-25 05:37:00,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1917936.0, ans=0.125 2023-06-25 05:37:31,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=1917996.0, ans=0.125 2023-06-25 05:38:09,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1918116.0, ans=0.2 2023-06-25 05:38:17,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1918116.0, ans=0.125 2023-06-25 05:38:19,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1918116.0, ans=0.0 2023-06-25 05:38:22,502 INFO [train.py:996] (2/4) Epoch 11, batch 14750, loss[loss=0.2617, simple_loss=0.3364, pruned_loss=0.09348, over 21589.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3145, pruned_loss=0.07882, over 4278652.01 frames. ], batch size: 389, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 05:39:07,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1918236.0, ans=0.2 2023-06-25 05:39:42,614 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.929e+02 7.909e+02 1.142e+03 1.631e+03 3.263e+03, threshold=2.283e+03, percent-clipped=2.0 2023-06-25 05:40:17,891 INFO [train.py:996] (2/4) Epoch 11, batch 14800, loss[loss=0.2336, simple_loss=0.2953, pruned_loss=0.086, over 21812.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.328, pruned_loss=0.08546, over 4280547.98 frames. ], batch size: 107, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:40:56,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1918596.0, ans=0.125 2023-06-25 05:41:01,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1918596.0, ans=0.125 2023-06-25 05:42:13,908 INFO [train.py:996] (2/4) Epoch 11, batch 14850, loss[loss=0.2149, simple_loss=0.276, pruned_loss=0.07687, over 21787.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3206, pruned_loss=0.08428, over 4278609.33 frames. ], batch size: 124, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:42:17,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1918776.0, ans=0.1 2023-06-25 05:42:23,811 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=22.5 2023-06-25 05:42:31,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1918836.0, ans=0.0 2023-06-25 05:42:53,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1918836.0, ans=0.125 2023-06-25 05:42:58,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1918896.0, ans=0.125 2023-06-25 05:43:13,015 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.12 vs. limit=15.0 2023-06-25 05:43:25,169 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.029e+02 9.409e+02 1.250e+03 2.186e+03 4.588e+03, threshold=2.500e+03, percent-clipped=20.0 2023-06-25 05:43:53,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1919016.0, ans=15.0 2023-06-25 05:44:02,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1919076.0, ans=0.125 2023-06-25 05:44:03,304 INFO [train.py:996] (2/4) Epoch 11, batch 14900, loss[loss=0.2516, simple_loss=0.3188, pruned_loss=0.09217, over 21363.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3238, pruned_loss=0.08538, over 4278071.01 frames. ], batch size: 176, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:44:07,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1919076.0, ans=0.125 2023-06-25 05:44:07,782 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.41 vs. limit=12.0 2023-06-25 05:44:14,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1919076.0, ans=0.125 2023-06-25 05:44:25,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1919136.0, ans=0.0 2023-06-25 05:45:05,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1919196.0, ans=0.1 2023-06-25 05:45:29,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1919256.0, ans=0.2 2023-06-25 05:45:50,674 INFO [train.py:996] (2/4) Epoch 11, batch 14950, loss[loss=0.2227, simple_loss=0.3074, pruned_loss=0.06896, over 21801.00 frames. ], tot_loss[loss=0.2468, simple_loss=0.3243, pruned_loss=0.08465, over 4279162.74 frames. ], batch size: 282, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:46:32,748 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1919436.0, ans=0.1 2023-06-25 05:47:03,172 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.098e+02 8.346e+02 1.154e+03 1.605e+03 2.804e+03, threshold=2.309e+03, percent-clipped=2.0 2023-06-25 05:47:17,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1919556.0, ans=0.125 2023-06-25 05:47:21,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1919616.0, ans=0.0 2023-06-25 05:47:39,727 INFO [train.py:996] (2/4) Epoch 11, batch 15000, loss[loss=0.2465, simple_loss=0.3117, pruned_loss=0.09067, over 21298.00 frames. ], tot_loss[loss=0.2484, simple_loss=0.3262, pruned_loss=0.08534, over 4277662.95 frames. ], batch size: 159, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:47:39,727 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 05:48:02,332 INFO [train.py:1028] (2/4) Epoch 11, validation: loss=0.2537, simple_loss=0.3474, pruned_loss=0.08002, over 1796401.00 frames. 2023-06-25 05:48:02,332 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-25 05:48:07,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1919676.0, ans=0.125 2023-06-25 05:49:01,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1919796.0, ans=0.125 2023-06-25 05:49:04,396 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1919856.0, ans=0.125 2023-06-25 05:49:12,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1919856.0, ans=0.125 2023-06-25 05:49:50,628 INFO [train.py:996] (2/4) Epoch 11, batch 15050, loss[loss=0.2911, simple_loss=0.3837, pruned_loss=0.0993, over 21537.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3264, pruned_loss=0.08618, over 4273133.40 frames. ], batch size: 471, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:50:04,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1919976.0, ans=0.1 2023-06-25 05:50:27,023 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.29 vs. limit=6.0 2023-06-25 05:50:57,303 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.512e+02 8.828e+02 1.154e+03 1.761e+03 2.876e+03, threshold=2.308e+03, percent-clipped=7.0 2023-06-25 05:51:35,711 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-25 05:51:38,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1920276.0, ans=0.125 2023-06-25 05:51:39,391 INFO [train.py:996] (2/4) Epoch 11, batch 15100, loss[loss=0.2122, simple_loss=0.288, pruned_loss=0.06817, over 20837.00 frames. ], tot_loss[loss=0.2477, simple_loss=0.3263, pruned_loss=0.08455, over 4269173.02 frames. ], batch size: 608, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:51:55,544 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=22.5 2023-06-25 05:53:00,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1920456.0, ans=0.125 2023-06-25 05:53:13,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1920516.0, ans=0.0 2023-06-25 05:53:13,598 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=22.5 2023-06-25 05:53:14,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1920516.0, ans=0.125 2023-06-25 05:53:29,034 INFO [train.py:996] (2/4) Epoch 11, batch 15150, loss[loss=0.1991, simple_loss=0.2629, pruned_loss=0.0676, over 21323.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3223, pruned_loss=0.08468, over 4272696.12 frames. ], batch size: 211, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 05:53:37,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1920576.0, ans=0.125 2023-06-25 05:54:09,661 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.82 vs. limit=22.5 2023-06-25 05:54:36,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1920756.0, ans=0.0 2023-06-25 05:54:44,913 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.398e+02 9.134e+02 1.396e+03 2.248e+03 4.445e+03, threshold=2.791e+03, percent-clipped=24.0 2023-06-25 05:55:03,057 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.46 vs. limit=15.0 2023-06-25 05:55:12,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1920816.0, ans=0.125 2023-06-25 05:55:18,867 INFO [train.py:996] (2/4) Epoch 11, batch 15200, loss[loss=0.2058, simple_loss=0.289, pruned_loss=0.06128, over 21727.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3131, pruned_loss=0.08056, over 4271127.21 frames. ], batch size: 282, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:55:28,449 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.00 vs. limit=15.0 2023-06-25 05:56:09,177 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.41 vs. limit=15.0 2023-06-25 05:56:13,807 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.92 vs. limit=15.0 2023-06-25 05:56:29,956 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1921056.0, ans=0.0 2023-06-25 05:56:31,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1921056.0, ans=0.0 2023-06-25 05:57:06,645 INFO [train.py:996] (2/4) Epoch 11, batch 15250, loss[loss=0.2172, simple_loss=0.2865, pruned_loss=0.074, over 21521.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3077, pruned_loss=0.07939, over 4257813.12 frames. ], batch size: 230, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:58:10,686 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.73 vs. limit=6.0 2023-06-25 05:58:13,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1921356.0, ans=0.1 2023-06-25 05:58:19,446 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.800e+02 7.712e+02 1.026e+03 1.486e+03 3.458e+03, threshold=2.053e+03, percent-clipped=2.0 2023-06-25 05:58:53,093 INFO [train.py:996] (2/4) Epoch 11, batch 15300, loss[loss=0.238, simple_loss=0.3188, pruned_loss=0.07858, over 21675.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3094, pruned_loss=0.08134, over 4266153.76 frames. ], batch size: 351, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 05:59:03,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1921476.0, ans=0.125 2023-06-25 05:59:21,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1921536.0, ans=0.125 2023-06-25 05:59:57,636 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.33 vs. limit=10.0 2023-06-25 06:00:28,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1921716.0, ans=0.125 2023-06-25 06:00:48,338 INFO [train.py:996] (2/4) Epoch 11, batch 15350, loss[loss=0.248, simple_loss=0.3319, pruned_loss=0.08203, over 21634.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3136, pruned_loss=0.08362, over 4265433.94 frames. ], batch size: 230, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:01:01,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1921776.0, ans=0.125 2023-06-25 06:01:03,779 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.45 vs. limit=15.0 2023-06-25 06:01:53,579 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.938e+02 7.974e+02 1.016e+03 1.491e+03 3.012e+03, threshold=2.032e+03, percent-clipped=10.0 2023-06-25 06:02:27,018 INFO [train.py:996] (2/4) Epoch 11, batch 15400, loss[loss=0.2473, simple_loss=0.3171, pruned_loss=0.0888, over 21476.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3137, pruned_loss=0.08164, over 4261250.54 frames. ], batch size: 194, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:02:37,547 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.89 vs. limit=15.0 2023-06-25 06:02:50,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1922136.0, ans=0.0 2023-06-25 06:03:22,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1922196.0, ans=0.1 2023-06-25 06:03:59,478 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.16 vs. limit=15.0 2023-06-25 06:04:02,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1922316.0, ans=0.125 2023-06-25 06:04:03,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1922316.0, ans=0.2 2023-06-25 06:04:05,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1922316.0, ans=0.125 2023-06-25 06:04:11,486 INFO [train.py:996] (2/4) Epoch 11, batch 15450, loss[loss=0.2086, simple_loss=0.2884, pruned_loss=0.06446, over 21333.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3123, pruned_loss=0.08159, over 4270476.07 frames. ], batch size: 159, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:05:19,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1922556.0, ans=0.0 2023-06-25 06:05:25,666 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.669e+02 7.354e+02 9.513e+02 1.338e+03 2.588e+03, threshold=1.903e+03, percent-clipped=5.0 2023-06-25 06:05:26,694 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.20 vs. limit=6.0 2023-06-25 06:05:47,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1922616.0, ans=0.125 2023-06-25 06:05:54,579 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.17 vs. limit=15.0 2023-06-25 06:06:02,404 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.60 vs. limit=15.0 2023-06-25 06:06:04,768 INFO [train.py:996] (2/4) Epoch 11, batch 15500, loss[loss=0.2505, simple_loss=0.3223, pruned_loss=0.08939, over 21403.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3154, pruned_loss=0.08143, over 4244894.95 frames. ], batch size: 211, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:06:06,832 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1922676.0, ans=0.125 2023-06-25 06:06:20,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=1922736.0, ans=0.2 2023-06-25 06:06:52,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1922796.0, ans=0.125 2023-06-25 06:07:00,642 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.65 vs. limit=15.0 2023-06-25 06:07:54,178 INFO [train.py:996] (2/4) Epoch 11, batch 15550, loss[loss=0.1964, simple_loss=0.272, pruned_loss=0.06043, over 21379.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3158, pruned_loss=0.07921, over 4245914.94 frames. ], batch size: 211, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:08:12,703 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.25 vs. limit=22.5 2023-06-25 06:08:20,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1923036.0, ans=0.125 2023-06-25 06:08:23,835 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:08:34,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1923036.0, ans=0.0 2023-06-25 06:08:48,471 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.44 vs. limit=15.0 2023-06-25 06:09:07,393 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.113e+02 7.965e+02 1.145e+03 1.833e+03 5.244e+03, threshold=2.290e+03, percent-clipped=21.0 2023-06-25 06:09:09,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1923156.0, ans=0.0 2023-06-25 06:09:09,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1923156.0, ans=0.125 2023-06-25 06:09:15,463 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.36 vs. limit=6.0 2023-06-25 06:09:18,847 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.81 vs. limit=10.0 2023-06-25 06:09:29,036 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1923216.0, ans=0.0 2023-06-25 06:09:40,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1923276.0, ans=0.125 2023-06-25 06:09:41,637 INFO [train.py:996] (2/4) Epoch 11, batch 15600, loss[loss=0.2158, simple_loss=0.3102, pruned_loss=0.06068, over 21255.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3112, pruned_loss=0.07809, over 4242667.36 frames. ], batch size: 548, lr: 2.65e-03, grad_scale: 32.0 2023-06-25 06:09:42,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1923276.0, ans=0.1 2023-06-25 06:09:57,955 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.95 vs. limit=10.0 2023-06-25 06:10:10,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1923336.0, ans=0.0 2023-06-25 06:11:17,238 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=15.0 2023-06-25 06:11:26,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1923576.0, ans=0.125 2023-06-25 06:11:33,807 INFO [train.py:996] (2/4) Epoch 11, batch 15650, loss[loss=0.1799, simple_loss=0.2446, pruned_loss=0.05762, over 15438.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3093, pruned_loss=0.07812, over 4234771.59 frames. ], batch size: 61, lr: 2.65e-03, grad_scale: 32.0 2023-06-25 06:11:44,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1923576.0, ans=0.07 2023-06-25 06:12:16,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1923696.0, ans=0.125 2023-06-25 06:12:25,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1923696.0, ans=0.125 2023-06-25 06:12:33,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1923696.0, ans=0.125 2023-06-25 06:12:43,333 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.960e+02 7.231e+02 1.048e+03 1.538e+03 3.677e+03, threshold=2.096e+03, percent-clipped=8.0 2023-06-25 06:12:55,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=1923756.0, ans=0.0 2023-06-25 06:12:59,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1923816.0, ans=0.0 2023-06-25 06:13:04,689 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.41 vs. limit=10.0 2023-06-25 06:13:22,306 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.51 vs. limit=22.5 2023-06-25 06:13:23,062 INFO [train.py:996] (2/4) Epoch 11, batch 15700, loss[loss=0.2186, simple_loss=0.2804, pruned_loss=0.07839, over 21175.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3045, pruned_loss=0.07703, over 4239881.30 frames. ], batch size: 159, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:14:25,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1924056.0, ans=0.0 2023-06-25 06:14:46,958 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.91 vs. limit=6.0 2023-06-25 06:15:08,058 INFO [train.py:996] (2/4) Epoch 11, batch 15750, loss[loss=0.2185, simple_loss=0.2759, pruned_loss=0.08054, over 21849.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2995, pruned_loss=0.07642, over 4249491.84 frames. ], batch size: 98, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:15:30,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1924236.0, ans=0.09899494936611666 2023-06-25 06:16:19,542 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.466e+02 7.452e+02 1.136e+03 1.633e+03 2.643e+03, threshold=2.272e+03, percent-clipped=11.0 2023-06-25 06:16:19,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1924356.0, ans=0.1 2023-06-25 06:16:55,694 INFO [train.py:996] (2/4) Epoch 11, batch 15800, loss[loss=0.1928, simple_loss=0.2703, pruned_loss=0.05762, over 16524.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.295, pruned_loss=0.07621, over 4231085.87 frames. ], batch size: 60, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 06:16:59,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=1924476.0, ans=0.1 2023-06-25 06:17:02,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1924476.0, ans=0.1 2023-06-25 06:17:31,162 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1924536.0, ans=0.125 2023-06-25 06:17:34,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1924536.0, ans=0.125 2023-06-25 06:17:56,988 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.26 vs. limit=22.5 2023-06-25 06:18:01,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1924656.0, ans=0.0 2023-06-25 06:18:45,292 INFO [train.py:996] (2/4) Epoch 11, batch 15850, loss[loss=0.2138, simple_loss=0.2699, pruned_loss=0.07886, over 21518.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.2983, pruned_loss=0.07902, over 4246063.07 frames. ], batch size: 212, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 06:19:17,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1924836.0, ans=0.0 2023-06-25 06:19:52,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1924956.0, ans=0.1 2023-06-25 06:19:57,107 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.898e+02 6.760e+02 9.766e+02 1.376e+03 2.542e+03, threshold=1.953e+03, percent-clipped=1.0 2023-06-25 06:19:59,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=1924956.0, ans=10.0 2023-06-25 06:20:26,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1925016.0, ans=0.125 2023-06-25 06:20:34,244 INFO [train.py:996] (2/4) Epoch 11, batch 15900, loss[loss=0.2436, simple_loss=0.3154, pruned_loss=0.08591, over 21399.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2962, pruned_loss=0.07914, over 4242972.77 frames. ], batch size: 211, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 06:20:36,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1925076.0, ans=0.125 2023-06-25 06:20:52,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1925136.0, ans=0.0 2023-06-25 06:20:58,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1925136.0, ans=0.125 2023-06-25 06:20:59,737 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1925136.0, ans=0.125 2023-06-25 06:21:15,067 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.39 vs. limit=10.0 2023-06-25 06:21:28,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1925196.0, ans=0.2 2023-06-25 06:21:43,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1925256.0, ans=0.125 2023-06-25 06:21:53,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1925316.0, ans=0.125 2023-06-25 06:21:53,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1925316.0, ans=0.07 2023-06-25 06:22:22,466 INFO [train.py:996] (2/4) Epoch 11, batch 15950, loss[loss=0.2146, simple_loss=0.3253, pruned_loss=0.05198, over 21247.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.2994, pruned_loss=0.07782, over 4245375.64 frames. ], batch size: 548, lr: 2.65e-03, grad_scale: 8.0 2023-06-25 06:22:45,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1925436.0, ans=0.125 2023-06-25 06:22:55,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1925496.0, ans=0.125 2023-06-25 06:23:27,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1925556.0, ans=0.1 2023-06-25 06:23:35,060 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.240e+02 8.208e+02 1.106e+03 1.560e+03 3.108e+03, threshold=2.211e+03, percent-clipped=12.0 2023-06-25 06:24:05,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1925616.0, ans=0.125 2023-06-25 06:24:12,219 INFO [train.py:996] (2/4) Epoch 11, batch 16000, loss[loss=0.2437, simple_loss=0.3399, pruned_loss=0.07379, over 21514.00 frames. ], tot_loss[loss=0.2267, simple_loss=0.301, pruned_loss=0.0762, over 4248240.23 frames. ], batch size: 471, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:25:58,383 INFO [train.py:996] (2/4) Epoch 11, batch 16050, loss[loss=0.2229, simple_loss=0.3164, pruned_loss=0.06473, over 21748.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3036, pruned_loss=0.07406, over 4247585.08 frames. ], batch size: 247, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:26:13,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1926036.0, ans=0.1 2023-06-25 06:26:18,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1926036.0, ans=0.125 2023-06-25 06:26:42,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1926096.0, ans=0.125 2023-06-25 06:27:06,120 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.209e+02 1.010e+03 1.605e+03 2.461e+03 5.413e+03, threshold=3.210e+03, percent-clipped=30.0 2023-06-25 06:27:18,485 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=15.0 2023-06-25 06:27:36,710 INFO [train.py:996] (2/4) Epoch 11, batch 16100, loss[loss=0.2843, simple_loss=0.3471, pruned_loss=0.1107, over 21803.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3091, pruned_loss=0.07631, over 4258807.89 frames. ], batch size: 112, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:27:51,222 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:28:02,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1926336.0, ans=0.0 2023-06-25 06:28:35,607 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 06:29:17,215 INFO [train.py:996] (2/4) Epoch 11, batch 16150, loss[loss=0.2119, simple_loss=0.2785, pruned_loss=0.0727, over 21583.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3097, pruned_loss=0.07831, over 4267713.94 frames. ], batch size: 548, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:29:45,572 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.43 vs. limit=15.0 2023-06-25 06:29:56,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1926636.0, ans=0.2 2023-06-25 06:30:14,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1926696.0, ans=0.125 2023-06-25 06:30:40,134 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.105e+02 8.824e+02 1.229e+03 1.712e+03 3.510e+03, threshold=2.459e+03, percent-clipped=5.0 2023-06-25 06:30:40,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1926756.0, ans=0.125 2023-06-25 06:31:16,326 INFO [train.py:996] (2/4) Epoch 11, batch 16200, loss[loss=0.2841, simple_loss=0.3541, pruned_loss=0.1071, over 21267.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3128, pruned_loss=0.07901, over 4270065.34 frames. ], batch size: 143, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:31:35,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1926936.0, ans=0.1 2023-06-25 06:31:50,573 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.65 vs. limit=15.0 2023-06-25 06:32:23,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=1927056.0, ans=0.2 2023-06-25 06:33:02,344 INFO [train.py:996] (2/4) Epoch 11, batch 16250, loss[loss=0.2074, simple_loss=0.2839, pruned_loss=0.0654, over 21454.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3125, pruned_loss=0.0792, over 4272489.67 frames. ], batch size: 194, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:33:02,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1927176.0, ans=0.125 2023-06-25 06:33:14,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1927176.0, ans=0.125 2023-06-25 06:33:34,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1927236.0, ans=0.07 2023-06-25 06:33:47,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1927296.0, ans=0.125 2023-06-25 06:34:02,171 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1927356.0, ans=0.2 2023-06-25 06:34:05,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1927356.0, ans=0.125 2023-06-25 06:34:11,384 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.609e+02 8.190e+02 1.044e+03 1.433e+03 2.783e+03, threshold=2.088e+03, percent-clipped=4.0 2023-06-25 06:34:26,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1927416.0, ans=0.0 2023-06-25 06:34:49,065 INFO [train.py:996] (2/4) Epoch 11, batch 16300, loss[loss=0.1726, simple_loss=0.2617, pruned_loss=0.04176, over 21428.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3063, pruned_loss=0.07571, over 4268952.81 frames. ], batch size: 211, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:36:22,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.whiten.whitening_limit, batch_count=1927716.0, ans=15.0 2023-06-25 06:36:36,568 INFO [train.py:996] (2/4) Epoch 11, batch 16350, loss[loss=0.3319, simple_loss=0.3995, pruned_loss=0.1322, over 21814.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.307, pruned_loss=0.07692, over 4271897.58 frames. ], batch size: 118, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:37:23,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1927896.0, ans=0.125 2023-06-25 06:37:36,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1927896.0, ans=0.0 2023-06-25 06:37:52,919 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.557e+02 7.101e+02 1.051e+03 1.461e+03 2.820e+03, threshold=2.102e+03, percent-clipped=5.0 2023-06-25 06:38:23,651 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.14 vs. limit=15.0 2023-06-25 06:38:24,468 INFO [train.py:996] (2/4) Epoch 11, batch 16400, loss[loss=0.2684, simple_loss=0.3366, pruned_loss=0.1001, over 21805.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3098, pruned_loss=0.07798, over 4278720.01 frames. ], batch size: 414, lr: 2.65e-03, grad_scale: 32.0 2023-06-25 06:38:26,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1928076.0, ans=0.125 2023-06-25 06:38:33,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1928076.0, ans=0.2 2023-06-25 06:38:58,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1928136.0, ans=0.0 2023-06-25 06:40:03,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1928316.0, ans=0.125 2023-06-25 06:40:04,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1928316.0, ans=0.1 2023-06-25 06:40:04,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1928316.0, ans=0.2 2023-06-25 06:40:09,736 INFO [train.py:996] (2/4) Epoch 11, batch 16450, loss[loss=0.278, simple_loss=0.3544, pruned_loss=0.1008, over 21415.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3084, pruned_loss=0.07769, over 4284273.14 frames. ], batch size: 548, lr: 2.65e-03, grad_scale: 32.0 2023-06-25 06:40:42,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1928436.0, ans=0.125 2023-06-25 06:40:49,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1928496.0, ans=0.2 2023-06-25 06:40:56,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1928496.0, ans=0.2 2023-06-25 06:41:22,525 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.821e+02 6.919e+02 9.825e+02 1.554e+03 3.786e+03, threshold=1.965e+03, percent-clipped=13.0 2023-06-25 06:41:46,076 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.79 vs. limit=6.0 2023-06-25 06:41:53,329 INFO [train.py:996] (2/4) Epoch 11, batch 16500, loss[loss=0.2477, simple_loss=0.3389, pruned_loss=0.07826, over 21506.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.307, pruned_loss=0.07825, over 4276552.10 frames. ], batch size: 471, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:42:20,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1928736.0, ans=0.0 2023-06-25 06:42:36,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1928796.0, ans=0.0 2023-06-25 06:42:55,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1928856.0, ans=0.0 2023-06-25 06:42:58,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1928856.0, ans=0.0 2023-06-25 06:43:01,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1928856.0, ans=0.2 2023-06-25 06:43:17,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1928856.0, ans=0.0 2023-06-25 06:43:44,081 INFO [train.py:996] (2/4) Epoch 11, batch 16550, loss[loss=0.2111, simple_loss=0.2969, pruned_loss=0.06269, over 21752.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3084, pruned_loss=0.077, over 4275130.03 frames. ], batch size: 298, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:44:07,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1929036.0, ans=0.125 2023-06-25 06:44:20,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1929036.0, ans=0.2 2023-06-25 06:44:44,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1929096.0, ans=0.1 2023-06-25 06:44:54,566 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.15 vs. limit=15.0 2023-06-25 06:45:07,339 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.454e+02 9.180e+02 1.462e+03 2.154e+03 5.250e+03, threshold=2.924e+03, percent-clipped=28.0 2023-06-25 06:45:20,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1929216.0, ans=0.2 2023-06-25 06:45:29,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=1929276.0, ans=0.2 2023-06-25 06:45:31,236 INFO [train.py:996] (2/4) Epoch 11, batch 16600, loss[loss=0.2868, simple_loss=0.378, pruned_loss=0.09776, over 21743.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3149, pruned_loss=0.0792, over 4272628.82 frames. ], batch size: 247, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:46:50,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=1929456.0, ans=0.125 2023-06-25 06:47:00,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1929456.0, ans=0.125 2023-06-25 06:47:21,336 INFO [train.py:996] (2/4) Epoch 11, batch 16650, loss[loss=0.2022, simple_loss=0.3282, pruned_loss=0.03815, over 20858.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3255, pruned_loss=0.08177, over 4268370.31 frames. ], batch size: 608, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:47:44,166 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.62 vs. limit=15.0 2023-06-25 06:47:44,186 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.32 vs. limit=15.0 2023-06-25 06:48:02,083 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1929636.0, ans=0.125 2023-06-25 06:48:12,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1929696.0, ans=0.2 2023-06-25 06:48:40,395 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.47 vs. limit=15.0 2023-06-25 06:48:48,114 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.799e+02 8.385e+02 1.061e+03 1.516e+03 3.591e+03, threshold=2.122e+03, percent-clipped=0.0 2023-06-25 06:48:48,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1929756.0, ans=0.125 2023-06-25 06:49:18,268 INFO [train.py:996] (2/4) Epoch 11, batch 16700, loss[loss=0.2189, simple_loss=0.2759, pruned_loss=0.081, over 21378.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3252, pruned_loss=0.08262, over 4271955.72 frames. ], batch size: 131, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:50:04,553 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.19 vs. limit=12.0 2023-06-25 06:50:23,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1929996.0, ans=0.1 2023-06-25 06:50:50,820 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.36 vs. limit=15.0 2023-06-25 06:51:18,574 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.02 vs. limit=6.0 2023-06-25 06:51:19,110 INFO [train.py:996] (2/4) Epoch 11, batch 16750, loss[loss=0.2647, simple_loss=0.3428, pruned_loss=0.0933, over 21848.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3282, pruned_loss=0.08498, over 4273895.27 frames. ], batch size: 282, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:51:33,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1930176.0, ans=0.125 2023-06-25 06:51:58,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1930236.0, ans=0.125 2023-06-25 06:52:12,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=1930296.0, ans=0.0 2023-06-25 06:52:16,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1930296.0, ans=0.0 2023-06-25 06:52:25,630 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.01 vs. limit=15.0 2023-06-25 06:52:25,655 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.11 vs. limit=22.5 2023-06-25 06:52:39,840 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.774e+02 8.238e+02 1.096e+03 1.590e+03 4.377e+03, threshold=2.192e+03, percent-clipped=15.0 2023-06-25 06:53:01,131 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=22.5 2023-06-25 06:53:14,801 INFO [train.py:996] (2/4) Epoch 11, batch 16800, loss[loss=0.2317, simple_loss=0.3131, pruned_loss=0.07515, over 21876.00 frames. ], tot_loss[loss=0.2503, simple_loss=0.3319, pruned_loss=0.08437, over 4273209.89 frames. ], batch size: 351, lr: 2.65e-03, grad_scale: 32.0 2023-06-25 06:53:38,503 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.02 vs. limit=15.0 2023-06-25 06:53:51,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1930596.0, ans=0.0 2023-06-25 06:54:10,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1930656.0, ans=0.2 2023-06-25 06:54:59,756 INFO [train.py:996] (2/4) Epoch 11, batch 16850, loss[loss=0.2269, simple_loss=0.2986, pruned_loss=0.0776, over 21932.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3288, pruned_loss=0.08464, over 4277236.95 frames. ], batch size: 333, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:55:07,529 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=15.0 2023-06-25 06:55:25,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1930836.0, ans=0.2 2023-06-25 06:55:26,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1930836.0, ans=0.2 2023-06-25 06:55:59,008 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1930956.0, ans=0.125 2023-06-25 06:56:12,565 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.019e+02 7.924e+02 1.109e+03 1.823e+03 3.367e+03, threshold=2.218e+03, percent-clipped=14.0 2023-06-25 06:56:15,450 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.29 vs. limit=15.0 2023-06-25 06:56:30,283 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1931016.0, ans=0.125 2023-06-25 06:56:33,660 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1931016.0, ans=0.0 2023-06-25 06:56:40,172 INFO [train.py:996] (2/4) Epoch 11, batch 16900, loss[loss=0.2058, simple_loss=0.276, pruned_loss=0.06785, over 21584.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3211, pruned_loss=0.08291, over 4279091.68 frames. ], batch size: 263, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:56:47,373 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.21 vs. limit=15.0 2023-06-25 06:57:05,087 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.22 vs. limit=15.0 2023-06-25 06:58:23,598 INFO [train.py:996] (2/4) Epoch 11, batch 16950, loss[loss=0.255, simple_loss=0.3081, pruned_loss=0.1009, over 21619.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3128, pruned_loss=0.08047, over 4281224.88 frames. ], batch size: 195, lr: 2.65e-03, grad_scale: 16.0 2023-06-25 06:58:29,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=1931376.0, ans=0.04949747468305833 2023-06-25 06:58:42,478 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.21 vs. limit=15.0 2023-06-25 06:59:41,829 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.539e+02 6.475e+02 7.581e+02 1.089e+03 2.288e+03, threshold=1.516e+03, percent-clipped=2.0 2023-06-25 07:00:09,563 INFO [train.py:996] (2/4) Epoch 11, batch 17000, loss[loss=0.2298, simple_loss=0.3041, pruned_loss=0.07768, over 21841.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3092, pruned_loss=0.08066, over 4289396.74 frames. ], batch size: 124, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:00:16,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1931676.0, ans=0.0 2023-06-25 07:00:17,504 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.26 vs. limit=15.0 2023-06-25 07:01:00,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1931796.0, ans=0.125 2023-06-25 07:01:21,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1931856.0, ans=0.125 2023-06-25 07:01:22,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1931856.0, ans=0.0 2023-06-25 07:01:25,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1931856.0, ans=0.1 2023-06-25 07:01:56,052 INFO [train.py:996] (2/4) Epoch 11, batch 17050, loss[loss=0.2368, simple_loss=0.326, pruned_loss=0.07382, over 21837.00 frames. ], tot_loss[loss=0.2417, simple_loss=0.3167, pruned_loss=0.08331, over 4288470.65 frames. ], batch size: 332, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:02:55,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1932096.0, ans=0.0 2023-06-25 07:03:22,789 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.436e+02 8.949e+02 1.142e+03 1.744e+03 3.951e+03, threshold=2.284e+03, percent-clipped=33.0 2023-06-25 07:03:42,196 INFO [train.py:996] (2/4) Epoch 11, batch 17100, loss[loss=0.2207, simple_loss=0.2919, pruned_loss=0.07476, over 21878.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3161, pruned_loss=0.0841, over 4289085.12 frames. ], batch size: 332, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:03:54,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1932276.0, ans=0.125 2023-06-25 07:04:15,784 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:04:51,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1932456.0, ans=0.2 2023-06-25 07:04:52,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1932456.0, ans=0.125 2023-06-25 07:05:24,854 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.35 vs. limit=12.0 2023-06-25 07:05:29,187 INFO [train.py:996] (2/4) Epoch 11, batch 17150, loss[loss=0.2157, simple_loss=0.3067, pruned_loss=0.06239, over 21709.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3122, pruned_loss=0.08327, over 4297195.04 frames. ], batch size: 414, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:05:50,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1932636.0, ans=0.125 2023-06-25 07:06:05,856 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=15.0 2023-06-25 07:06:43,415 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.68 vs. limit=5.0 2023-06-25 07:06:55,805 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.253e+02 7.399e+02 1.011e+03 1.479e+03 2.669e+03, threshold=2.021e+03, percent-clipped=4.0 2023-06-25 07:07:05,673 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-25 07:07:16,412 INFO [train.py:996] (2/4) Epoch 11, batch 17200, loss[loss=0.3102, simple_loss=0.3676, pruned_loss=0.1264, over 21345.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3118, pruned_loss=0.08271, over 4292724.07 frames. ], batch size: 507, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:07:34,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1932876.0, ans=0.1 2023-06-25 07:07:40,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1932936.0, ans=0.125 2023-06-25 07:07:42,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1932936.0, ans=0.2 2023-06-25 07:08:21,925 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.08 vs. limit=15.0 2023-06-25 07:08:39,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1933056.0, ans=0.1 2023-06-25 07:08:48,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1933116.0, ans=0.0 2023-06-25 07:09:10,163 INFO [train.py:996] (2/4) Epoch 11, batch 17250, loss[loss=0.2508, simple_loss=0.3282, pruned_loss=0.08676, over 21583.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3159, pruned_loss=0.08448, over 4284615.50 frames. ], batch size: 389, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:09:37,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1933236.0, ans=0.125 2023-06-25 07:09:51,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1933236.0, ans=0.0 2023-06-25 07:10:05,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=1933296.0, ans=0.95 2023-06-25 07:10:28,660 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-25 07:10:31,132 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.220e+02 7.674e+02 1.037e+03 1.511e+03 3.569e+03, threshold=2.074e+03, percent-clipped=11.0 2023-06-25 07:10:53,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1933416.0, ans=0.1 2023-06-25 07:10:56,480 INFO [train.py:996] (2/4) Epoch 11, batch 17300, loss[loss=0.2936, simple_loss=0.353, pruned_loss=0.1171, over 21294.00 frames. ], tot_loss[loss=0.2508, simple_loss=0.3245, pruned_loss=0.08856, over 4281083.83 frames. ], batch size: 176, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:11:14,450 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.72 vs. limit=22.5 2023-06-25 07:11:44,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1933536.0, ans=0.1 2023-06-25 07:12:28,502 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.631e-02 2023-06-25 07:12:50,929 INFO [train.py:996] (2/4) Epoch 11, batch 17350, loss[loss=0.1973, simple_loss=0.2943, pruned_loss=0.05021, over 19885.00 frames. ], tot_loss[loss=0.2504, simple_loss=0.325, pruned_loss=0.08787, over 4279338.80 frames. ], batch size: 702, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:13:08,231 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=15.0 2023-06-25 07:13:31,259 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=15.0 2023-06-25 07:14:02,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1933956.0, ans=0.125 2023-06-25 07:14:08,335 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.102e+02 8.837e+02 1.250e+03 1.745e+03 4.253e+03, threshold=2.500e+03, percent-clipped=18.0 2023-06-25 07:14:10,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1934016.0, ans=0.125 2023-06-25 07:14:17,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1934016.0, ans=0.125 2023-06-25 07:14:46,089 INFO [train.py:996] (2/4) Epoch 11, batch 17400, loss[loss=0.2109, simple_loss=0.3014, pruned_loss=0.06023, over 21767.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3213, pruned_loss=0.08392, over 4277937.66 frames. ], batch size: 332, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:14:56,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1934076.0, ans=0.2 2023-06-25 07:15:10,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1934136.0, ans=0.0 2023-06-25 07:15:29,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1934196.0, ans=0.04949747468305833 2023-06-25 07:15:54,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1934256.0, ans=0.04949747468305833 2023-06-25 07:16:33,112 INFO [train.py:996] (2/4) Epoch 11, batch 17450, loss[loss=0.179, simple_loss=0.257, pruned_loss=0.05053, over 21433.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3155, pruned_loss=0.08026, over 4278054.35 frames. ], batch size: 194, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:17:16,143 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.25 vs. limit=15.0 2023-06-25 07:17:22,356 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-25 07:18:04,011 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.432e+02 7.727e+02 1.188e+03 2.165e+03 4.981e+03, threshold=2.376e+03, percent-clipped=19.0 2023-06-25 07:18:12,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1934616.0, ans=10.0 2023-06-25 07:18:14,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1934616.0, ans=0.125 2023-06-25 07:18:15,176 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.60 vs. limit=10.0 2023-06-25 07:18:22,109 INFO [train.py:996] (2/4) Epoch 11, batch 17500, loss[loss=0.2235, simple_loss=0.2937, pruned_loss=0.07669, over 21633.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3111, pruned_loss=0.07734, over 4275223.08 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:19:00,493 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1934796.0, ans=0.1 2023-06-25 07:19:03,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1934796.0, ans=0.2 2023-06-25 07:20:05,461 INFO [train.py:996] (2/4) Epoch 11, batch 17550, loss[loss=0.2139, simple_loss=0.3067, pruned_loss=0.06055, over 21626.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3109, pruned_loss=0.07612, over 4273771.69 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:20:26,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1935036.0, ans=0.04949747468305833 2023-06-25 07:20:37,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1935036.0, ans=0.2 2023-06-25 07:20:42,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1935096.0, ans=0.0 2023-06-25 07:20:49,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1935096.0, ans=0.125 2023-06-25 07:20:55,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1935096.0, ans=0.125 2023-06-25 07:21:19,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1935156.0, ans=0.125 2023-06-25 07:21:29,551 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.993e+02 7.228e+02 9.267e+02 1.344e+03 3.002e+03, threshold=1.853e+03, percent-clipped=5.0 2023-06-25 07:21:33,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1935216.0, ans=0.2 2023-06-25 07:21:42,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1935216.0, ans=0.125 2023-06-25 07:21:49,265 INFO [train.py:996] (2/4) Epoch 11, batch 17600, loss[loss=0.2456, simple_loss=0.3254, pruned_loss=0.08296, over 21342.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3143, pruned_loss=0.07723, over 4274801.72 frames. ], batch size: 548, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:22:54,102 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.17 vs. limit=6.0 2023-06-25 07:23:21,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=1935456.0, ans=0.0 2023-06-25 07:23:27,735 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1935516.0, ans=0.0 2023-06-25 07:23:43,425 INFO [train.py:996] (2/4) Epoch 11, batch 17650, loss[loss=0.2162, simple_loss=0.2786, pruned_loss=0.07692, over 21429.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3144, pruned_loss=0.07789, over 4265155.28 frames. ], batch size: 131, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:25:12,524 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.151e+02 8.324e+02 1.406e+03 1.795e+03 4.059e+03, threshold=2.812e+03, percent-clipped=23.0 2023-06-25 07:25:30,939 INFO [train.py:996] (2/4) Epoch 11, batch 17700, loss[loss=0.2664, simple_loss=0.3405, pruned_loss=0.09619, over 21601.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3084, pruned_loss=0.07603, over 4257124.69 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:25:54,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1935936.0, ans=0.125 2023-06-25 07:27:21,737 INFO [train.py:996] (2/4) Epoch 11, batch 17750, loss[loss=0.283, simple_loss=0.3561, pruned_loss=0.1049, over 21412.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3169, pruned_loss=0.08005, over 4265439.85 frames. ], batch size: 549, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:27:22,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1936176.0, ans=0.0 2023-06-25 07:27:25,910 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1936176.0, ans=0.2 2023-06-25 07:27:27,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1936176.0, ans=0.2 2023-06-25 07:28:10,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1936296.0, ans=0.125 2023-06-25 07:28:16,439 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.82 vs. limit=15.0 2023-06-25 07:28:50,289 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.938e+02 6.855e+02 8.351e+02 1.068e+03 2.757e+03, threshold=1.670e+03, percent-clipped=0.0 2023-06-25 07:28:56,505 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:28:57,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1936416.0, ans=0.05 2023-06-25 07:29:05,729 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.61 vs. limit=15.0 2023-06-25 07:29:09,857 INFO [train.py:996] (2/4) Epoch 11, batch 17800, loss[loss=0.2157, simple_loss=0.3084, pruned_loss=0.0615, over 21707.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3164, pruned_loss=0.07952, over 4264752.72 frames. ], batch size: 351, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:29:16,724 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1936476.0, ans=0.125 2023-06-25 07:29:42,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1936536.0, ans=0.0 2023-06-25 07:30:42,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=1936716.0, ans=0.125 2023-06-25 07:30:57,465 INFO [train.py:996] (2/4) Epoch 11, batch 17850, loss[loss=0.2475, simple_loss=0.3161, pruned_loss=0.08941, over 21615.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3192, pruned_loss=0.08107, over 4261435.62 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:32:07,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=1936956.0, ans=0.125 2023-06-25 07:32:16,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1936956.0, ans=0.125 2023-06-25 07:32:22,653 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.524e+02 9.470e+02 1.328e+03 1.940e+03 3.459e+03, threshold=2.655e+03, percent-clipped=37.0 2023-06-25 07:32:39,625 INFO [train.py:996] (2/4) Epoch 11, batch 17900, loss[loss=0.2422, simple_loss=0.3404, pruned_loss=0.07201, over 21745.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3226, pruned_loss=0.08191, over 4266281.99 frames. ], batch size: 298, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:33:37,413 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1937196.0, ans=0.0 2023-06-25 07:34:08,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=1937316.0, ans=0.1 2023-06-25 07:34:15,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1937316.0, ans=0.5 2023-06-25 07:34:41,341 INFO [train.py:996] (2/4) Epoch 11, batch 17950, loss[loss=0.2369, simple_loss=0.3427, pruned_loss=0.06562, over 21209.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3216, pruned_loss=0.07796, over 4266079.69 frames. ], batch size: 548, lr: 2.64e-03, grad_scale: 8.0 2023-06-25 07:35:20,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1937496.0, ans=0.125 2023-06-25 07:35:39,390 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1937556.0, ans=0.125 2023-06-25 07:35:59,046 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.579e+02 7.777e+02 1.188e+03 1.807e+03 3.395e+03, threshold=2.377e+03, percent-clipped=4.0 2023-06-25 07:36:27,820 INFO [train.py:996] (2/4) Epoch 11, batch 18000, loss[loss=0.2484, simple_loss=0.3082, pruned_loss=0.09429, over 21991.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3146, pruned_loss=0.0771, over 4262922.40 frames. ], batch size: 103, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:36:27,820 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 07:36:45,001 INFO [train.py:1028] (2/4) Epoch 11, validation: loss=0.2562, simple_loss=0.3557, pruned_loss=0.07833, over 1796401.00 frames. 2023-06-25 07:36:45,002 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-25 07:38:33,167 INFO [train.py:996] (2/4) Epoch 11, batch 18050, loss[loss=0.2139, simple_loss=0.2818, pruned_loss=0.07294, over 21610.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3089, pruned_loss=0.07635, over 4262910.29 frames. ], batch size: 415, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:38:49,560 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.40 vs. limit=10.0 2023-06-25 07:39:13,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1938096.0, ans=0.035 2023-06-25 07:39:59,286 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.782e+02 7.777e+02 1.078e+03 1.586e+03 2.998e+03, threshold=2.156e+03, percent-clipped=7.0 2023-06-25 07:40:21,746 INFO [train.py:996] (2/4) Epoch 11, batch 18100, loss[loss=0.2666, simple_loss=0.3277, pruned_loss=0.1027, over 21818.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3132, pruned_loss=0.07846, over 4264424.66 frames. ], batch size: 102, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:40:24,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1938276.0, ans=0.125 2023-06-25 07:40:48,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=1938336.0, ans=0.07 2023-06-25 07:41:14,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1938396.0, ans=0.125 2023-06-25 07:42:08,815 INFO [train.py:996] (2/4) Epoch 11, batch 18150, loss[loss=0.2, simple_loss=0.2743, pruned_loss=0.06283, over 21529.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3147, pruned_loss=0.07783, over 4270391.22 frames. ], batch size: 230, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:42:58,513 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=15.0 2023-06-25 07:42:59,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1938696.0, ans=0.125 2023-06-25 07:43:24,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1938756.0, ans=0.1 2023-06-25 07:43:31,677 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.895e+02 7.455e+02 1.236e+03 1.816e+03 3.616e+03, threshold=2.471e+03, percent-clipped=14.0 2023-06-25 07:43:44,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1938816.0, ans=0.125 2023-06-25 07:43:54,189 INFO [train.py:996] (2/4) Epoch 11, batch 18200, loss[loss=0.2226, simple_loss=0.2908, pruned_loss=0.07719, over 21589.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3087, pruned_loss=0.07808, over 4262989.15 frames. ], batch size: 415, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:44:06,520 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.60 vs. limit=15.0 2023-06-25 07:44:19,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1938936.0, ans=0.2 2023-06-25 07:45:12,397 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1939056.0, ans=0.125 2023-06-25 07:45:27,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=1939116.0, ans=15.0 2023-06-25 07:45:33,098 INFO [train.py:996] (2/4) Epoch 11, batch 18250, loss[loss=0.1649, simple_loss=0.2422, pruned_loss=0.04385, over 21546.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3006, pruned_loss=0.07548, over 4255934.50 frames. ], batch size: 230, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:46:56,928 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.451e+02 6.639e+02 9.483e+02 1.514e+03 2.544e+03, threshold=1.897e+03, percent-clipped=1.0 2023-06-25 07:47:19,921 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1939476.0, ans=0.04949747468305833 2023-06-25 07:47:21,082 INFO [train.py:996] (2/4) Epoch 11, batch 18300, loss[loss=0.2201, simple_loss=0.3279, pruned_loss=0.05617, over 21647.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3027, pruned_loss=0.07671, over 4266414.60 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:47:40,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1939476.0, ans=0.125 2023-06-25 07:47:40,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1939476.0, ans=0.125 2023-06-25 07:47:47,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=1939536.0, ans=0.5 2023-06-25 07:47:59,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1939596.0, ans=0.125 2023-06-25 07:48:21,638 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.81 vs. limit=15.0 2023-06-25 07:49:01,548 INFO [train.py:996] (2/4) Epoch 11, batch 18350, loss[loss=0.2053, simple_loss=0.2827, pruned_loss=0.06398, over 21544.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3077, pruned_loss=0.07612, over 4266887.59 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:49:04,607 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-25 07:49:30,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1939836.0, ans=0.125 2023-06-25 07:49:37,303 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1939836.0, ans=0.125 2023-06-25 07:49:42,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1939896.0, ans=0.0 2023-06-25 07:50:12,488 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1939956.0, ans=0.1 2023-06-25 07:50:32,427 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.196e+02 8.074e+02 1.390e+03 1.835e+03 4.417e+03, threshold=2.780e+03, percent-clipped=23.0 2023-06-25 07:50:42,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1940016.0, ans=0.1 2023-06-25 07:50:50,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1940016.0, ans=0.2 2023-06-25 07:50:55,372 INFO [train.py:996] (2/4) Epoch 11, batch 18400, loss[loss=0.2374, simple_loss=0.3241, pruned_loss=0.07534, over 21743.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3043, pruned_loss=0.07484, over 4271851.33 frames. ], batch size: 371, lr: 2.64e-03, grad_scale: 32.0 2023-06-25 07:50:59,568 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.91 vs. limit=15.0 2023-06-25 07:51:28,333 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1940196.0, ans=0.125 2023-06-25 07:52:43,119 INFO [train.py:996] (2/4) Epoch 11, batch 18450, loss[loss=0.1739, simple_loss=0.2405, pruned_loss=0.05366, over 21748.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.3007, pruned_loss=0.07097, over 4271715.63 frames. ], batch size: 112, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:52:56,045 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.07 vs. limit=22.5 2023-06-25 07:53:15,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1940496.0, ans=0.1 2023-06-25 07:54:04,991 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.624e+02 6.992e+02 1.032e+03 1.619e+03 3.807e+03, threshold=2.064e+03, percent-clipped=5.0 2023-06-25 07:54:22,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=1940616.0, ans=0.0 2023-06-25 07:54:25,067 INFO [train.py:996] (2/4) Epoch 11, batch 18500, loss[loss=0.2554, simple_loss=0.3359, pruned_loss=0.08747, over 21503.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2971, pruned_loss=0.07109, over 4278437.01 frames. ], batch size: 508, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:54:33,325 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 07:55:46,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1940856.0, ans=0.125 2023-06-25 07:56:15,407 INFO [train.py:996] (2/4) Epoch 11, batch 18550, loss[loss=0.2243, simple_loss=0.2917, pruned_loss=0.07851, over 21490.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2939, pruned_loss=0.06984, over 4282917.04 frames. ], batch size: 132, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:56:17,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1940976.0, ans=0.0 2023-06-25 07:56:24,110 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1940976.0, ans=0.125 2023-06-25 07:56:34,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1941036.0, ans=0.2 2023-06-25 07:56:36,561 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1941036.0, ans=0.125 2023-06-25 07:56:53,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1941096.0, ans=0.0 2023-06-25 07:57:19,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1941096.0, ans=0.2 2023-06-25 07:57:49,420 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.217e+02 7.256e+02 1.032e+03 1.520e+03 3.767e+03, threshold=2.064e+03, percent-clipped=11.0 2023-06-25 07:57:52,285 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.57 vs. limit=15.0 2023-06-25 07:58:01,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1941216.0, ans=0.0 2023-06-25 07:58:04,474 INFO [train.py:996] (2/4) Epoch 11, batch 18600, loss[loss=0.3134, simple_loss=0.3803, pruned_loss=0.1233, over 21531.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2932, pruned_loss=0.07113, over 4280209.44 frames. ], batch size: 509, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 07:58:28,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1941336.0, ans=0.1 2023-06-25 07:59:15,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1941456.0, ans=0.0 2023-06-25 07:59:24,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1941456.0, ans=0.0 2023-06-25 07:59:29,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1941456.0, ans=0.0 2023-06-25 07:59:51,199 INFO [train.py:996] (2/4) Epoch 11, batch 18650, loss[loss=0.2193, simple_loss=0.2879, pruned_loss=0.07532, over 20024.00 frames. ], tot_loss[loss=0.2166, simple_loss=0.2911, pruned_loss=0.07104, over 4265070.61 frames. ], batch size: 703, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:00:02,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1941576.0, ans=0.2 2023-06-25 08:00:25,085 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.24 vs. limit=10.0 2023-06-25 08:00:37,992 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.74 vs. limit=6.0 2023-06-25 08:00:41,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1941696.0, ans=0.0 2023-06-25 08:01:21,717 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.720e+02 7.139e+02 9.409e+02 1.577e+03 2.753e+03, threshold=1.882e+03, percent-clipped=11.0 2023-06-25 08:01:30,795 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.15 vs. limit=15.0 2023-06-25 08:01:35,903 INFO [train.py:996] (2/4) Epoch 11, batch 18700, loss[loss=0.2106, simple_loss=0.2649, pruned_loss=0.07819, over 20412.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2885, pruned_loss=0.07167, over 4267557.20 frames. ], batch size: 703, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:01:36,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1941876.0, ans=0.125 2023-06-25 08:02:20,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=1941996.0, ans=0.125 2023-06-25 08:02:53,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1942056.0, ans=0.2 2023-06-25 08:02:55,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1942056.0, ans=0.125 2023-06-25 08:03:12,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1942116.0, ans=0.125 2023-06-25 08:03:18,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1942116.0, ans=0.0 2023-06-25 08:03:24,893 INFO [train.py:996] (2/4) Epoch 11, batch 18750, loss[loss=0.2194, simple_loss=0.2843, pruned_loss=0.07721, over 21496.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2895, pruned_loss=0.07362, over 4269857.41 frames. ], batch size: 212, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:04:08,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1942296.0, ans=0.0 2023-06-25 08:04:50,021 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.888e+02 8.399e+02 1.249e+03 1.994e+03 4.167e+03, threshold=2.497e+03, percent-clipped=25.0 2023-06-25 08:05:11,254 INFO [train.py:996] (2/4) Epoch 11, batch 18800, loss[loss=0.2368, simple_loss=0.3325, pruned_loss=0.07049, over 21661.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2947, pruned_loss=0.07465, over 4260019.92 frames. ], batch size: 441, lr: 2.64e-03, grad_scale: 32.0 2023-06-25 08:05:16,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1942476.0, ans=0.125 2023-06-25 08:05:17,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1942476.0, ans=0.0 2023-06-25 08:05:29,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1942536.0, ans=0.125 2023-06-25 08:05:43,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1942596.0, ans=0.0 2023-06-25 08:06:46,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1942716.0, ans=0.1 2023-06-25 08:06:56,510 INFO [train.py:996] (2/4) Epoch 11, batch 18850, loss[loss=0.2269, simple_loss=0.2987, pruned_loss=0.07752, over 21609.00 frames. ], tot_loss[loss=0.216, simple_loss=0.291, pruned_loss=0.07047, over 4252436.41 frames. ], batch size: 391, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:07:26,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1942836.0, ans=0.07 2023-06-25 08:08:13,134 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.33 vs. limit=15.0 2023-06-25 08:08:19,304 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.66 vs. limit=22.5 2023-06-25 08:08:21,201 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.163e+02 6.140e+02 8.289e+02 1.259e+03 4.459e+03, threshold=1.658e+03, percent-clipped=10.0 2023-06-25 08:08:40,561 INFO [train.py:996] (2/4) Epoch 11, batch 18900, loss[loss=0.2135, simple_loss=0.2661, pruned_loss=0.08044, over 20220.00 frames. ], tot_loss[loss=0.2133, simple_loss=0.287, pruned_loss=0.06977, over 4256108.33 frames. ], batch size: 702, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:08:46,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1943076.0, ans=0.125 2023-06-25 08:09:19,750 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.68 vs. limit=15.0 2023-06-25 08:09:53,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1943256.0, ans=0.125 2023-06-25 08:10:27,764 INFO [train.py:996] (2/4) Epoch 11, batch 18950, loss[loss=0.1872, simple_loss=0.2539, pruned_loss=0.06026, over 21454.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2892, pruned_loss=0.07212, over 4266781.16 frames. ], batch size: 212, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:11:22,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1943496.0, ans=0.125 2023-06-25 08:12:02,076 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.904e+02 8.258e+02 1.054e+03 1.529e+03 3.478e+03, threshold=2.107e+03, percent-clipped=19.0 2023-06-25 08:12:15,295 INFO [train.py:996] (2/4) Epoch 11, batch 19000, loss[loss=0.2821, simple_loss=0.3503, pruned_loss=0.1069, over 21685.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2989, pruned_loss=0.07349, over 4273724.14 frames. ], batch size: 231, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:12:25,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=1943676.0, ans=22.5 2023-06-25 08:12:42,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1943736.0, ans=0.125 2023-06-25 08:13:23,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1943856.0, ans=0.125 2023-06-25 08:14:01,774 INFO [train.py:996] (2/4) Epoch 11, batch 19050, loss[loss=0.2511, simple_loss=0.3085, pruned_loss=0.09681, over 21478.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3042, pruned_loss=0.07773, over 4277485.32 frames. ], batch size: 194, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:14:10,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1943976.0, ans=0.1 2023-06-25 08:15:33,621 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.841e+02 7.658e+02 1.037e+03 1.522e+03 3.485e+03, threshold=2.073e+03, percent-clipped=12.0 2023-06-25 08:15:48,080 INFO [train.py:996] (2/4) Epoch 11, batch 19100, loss[loss=0.2198, simple_loss=0.2805, pruned_loss=0.07949, over 21473.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3028, pruned_loss=0.07889, over 4287165.80 frames. ], batch size: 441, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:15:51,115 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.35 vs. limit=15.0 2023-06-25 08:16:22,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1944336.0, ans=0.0 2023-06-25 08:17:20,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1944516.0, ans=0.2 2023-06-25 08:17:22,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff2.min_abs, batch_count=1944516.0, ans=0.1 2023-06-25 08:17:27,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1944516.0, ans=0.0 2023-06-25 08:17:35,893 INFO [train.py:996] (2/4) Epoch 11, batch 19150, loss[loss=0.2949, simple_loss=0.393, pruned_loss=0.09842, over 21667.00 frames. ], tot_loss[loss=0.235, simple_loss=0.309, pruned_loss=0.08056, over 4278455.38 frames. ], batch size: 414, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:17:38,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1944576.0, ans=0.125 2023-06-25 08:18:26,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1944636.0, ans=0.0 2023-06-25 08:18:33,716 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.60 vs. limit=15.0 2023-06-25 08:18:36,450 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1944696.0, ans=0.125 2023-06-25 08:19:03,246 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.44 vs. limit=22.5 2023-06-25 08:19:14,084 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.349e+02 9.714e+02 1.394e+03 2.160e+03 4.455e+03, threshold=2.788e+03, percent-clipped=28.0 2023-06-25 08:19:23,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1944816.0, ans=0.0 2023-06-25 08:19:26,269 INFO [train.py:996] (2/4) Epoch 11, batch 19200, loss[loss=0.3311, simple_loss=0.418, pruned_loss=0.1221, over 21636.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3205, pruned_loss=0.0824, over 4275950.92 frames. ], batch size: 441, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:20:34,933 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.15 vs. limit=22.5 2023-06-25 08:20:35,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1944996.0, ans=0.125 2023-06-25 08:20:45,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1945056.0, ans=0.1 2023-06-25 08:20:53,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1945116.0, ans=0.0 2023-06-25 08:21:11,479 INFO [train.py:996] (2/4) Epoch 11, batch 19250, loss[loss=0.1988, simple_loss=0.2874, pruned_loss=0.05514, over 21834.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3168, pruned_loss=0.07589, over 4282505.59 frames. ], batch size: 282, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:21:57,210 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1945236.0, ans=0.125 2023-06-25 08:22:00,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1945296.0, ans=0.1 2023-06-25 08:22:10,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1945296.0, ans=0.0 2023-06-25 08:22:46,274 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.037e+02 6.757e+02 9.006e+02 1.219e+03 2.409e+03, threshold=1.801e+03, percent-clipped=0.0 2023-06-25 08:22:57,449 INFO [train.py:996] (2/4) Epoch 11, batch 19300, loss[loss=0.2414, simple_loss=0.3159, pruned_loss=0.08348, over 21817.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3139, pruned_loss=0.07553, over 4290218.93 frames. ], batch size: 351, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:22:57,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1945476.0, ans=0.1 2023-06-25 08:23:50,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1945596.0, ans=0.125 2023-06-25 08:24:09,560 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.54 vs. limit=15.0 2023-06-25 08:24:26,296 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.54 vs. limit=8.0 2023-06-25 08:24:37,525 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.83 vs. limit=15.0 2023-06-25 08:24:52,179 INFO [train.py:996] (2/4) Epoch 11, batch 19350, loss[loss=0.1999, simple_loss=0.2865, pruned_loss=0.05671, over 21646.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3087, pruned_loss=0.0716, over 4280969.01 frames. ], batch size: 263, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:25:50,030 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:25:56,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1945956.0, ans=0.125 2023-06-25 08:26:18,154 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.662e+02 8.721e+02 1.407e+03 2.132e+03 4.703e+03, threshold=2.815e+03, percent-clipped=33.0 2023-06-25 08:26:36,745 INFO [train.py:996] (2/4) Epoch 11, batch 19400, loss[loss=0.2309, simple_loss=0.3064, pruned_loss=0.07775, over 21417.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3067, pruned_loss=0.07062, over 4285348.05 frames. ], batch size: 131, lr: 2.64e-03, grad_scale: 16.0 2023-06-25 08:26:41,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1946076.0, ans=0.1 2023-06-25 08:27:13,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1946136.0, ans=0.125 2023-06-25 08:27:20,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=1946196.0, ans=0.0 2023-06-25 08:28:01,045 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.30 vs. limit=15.0 2023-06-25 08:28:21,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1946376.0, ans=0.125 2023-06-25 08:28:22,706 INFO [train.py:996] (2/4) Epoch 11, batch 19450, loss[loss=0.2302, simple_loss=0.2884, pruned_loss=0.08601, over 21593.00 frames. ], tot_loss[loss=0.2251, simple_loss=0.3041, pruned_loss=0.07305, over 4286635.10 frames. ], batch size: 441, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 08:28:28,247 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1946376.0, ans=0.125 2023-06-25 08:29:53,111 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.635e+02 8.363e+02 1.164e+03 1.702e+03 3.020e+03, threshold=2.327e+03, percent-clipped=5.0 2023-06-25 08:29:55,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1946616.0, ans=0.2 2023-06-25 08:30:04,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1946616.0, ans=0.125 2023-06-25 08:30:08,971 INFO [train.py:996] (2/4) Epoch 11, batch 19500, loss[loss=0.2294, simple_loss=0.3079, pruned_loss=0.07547, over 20721.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2999, pruned_loss=0.07401, over 4284452.31 frames. ], batch size: 607, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 08:31:09,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1946796.0, ans=0.0 2023-06-25 08:31:55,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1946976.0, ans=0.2 2023-06-25 08:31:57,009 INFO [train.py:996] (2/4) Epoch 11, batch 19550, loss[loss=0.1731, simple_loss=0.2643, pruned_loss=0.04098, over 21825.00 frames. ], tot_loss[loss=0.2196, simple_loss=0.2949, pruned_loss=0.07211, over 4276187.03 frames. ], batch size: 316, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 08:32:48,684 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.82 vs. limit=15.0 2023-06-25 08:33:31,269 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.735e+02 7.971e+02 1.072e+03 1.636e+03 3.226e+03, threshold=2.144e+03, percent-clipped=9.0 2023-06-25 08:33:41,359 INFO [train.py:996] (2/4) Epoch 11, batch 19600, loss[loss=0.2731, simple_loss=0.3395, pruned_loss=0.1033, over 21481.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2966, pruned_loss=0.07282, over 4280823.92 frames. ], batch size: 131, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:33:44,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1947276.0, ans=0.0 2023-06-25 08:34:38,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1947396.0, ans=0.125 2023-06-25 08:35:13,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=1947516.0, ans=0.0 2023-06-25 08:35:21,583 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.12 vs. limit=12.0 2023-06-25 08:35:36,924 INFO [train.py:996] (2/4) Epoch 11, batch 19650, loss[loss=0.2716, simple_loss=0.345, pruned_loss=0.09914, over 21845.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3022, pruned_loss=0.07664, over 4277055.03 frames. ], batch size: 118, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:36:26,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=1947696.0, ans=0.125 2023-06-25 08:37:15,093 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.249e+02 7.644e+02 9.843e+02 1.375e+03 3.676e+03, threshold=1.969e+03, percent-clipped=9.0 2023-06-25 08:37:30,359 INFO [train.py:996] (2/4) Epoch 11, batch 19700, loss[loss=0.276, simple_loss=0.3614, pruned_loss=0.09532, over 21513.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3051, pruned_loss=0.07719, over 4278687.30 frames. ], batch size: 508, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:37:38,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1947876.0, ans=0.1 2023-06-25 08:37:59,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1947936.0, ans=0.2 2023-06-25 08:39:12,077 INFO [train.py:996] (2/4) Epoch 11, batch 19750, loss[loss=0.249, simple_loss=0.3327, pruned_loss=0.0827, over 21284.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3156, pruned_loss=0.07907, over 4272292.46 frames. ], batch size: 159, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:39:20,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1948176.0, ans=0.0 2023-06-25 08:39:23,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=1948176.0, ans=0.125 2023-06-25 08:39:32,486 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-25 08:39:34,772 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1948236.0, ans=0.125 2023-06-25 08:39:49,149 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.00 vs. limit=15.0 2023-06-25 08:40:05,563 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=14.18 vs. limit=15.0 2023-06-25 08:40:14,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1948356.0, ans=0.0 2023-06-25 08:40:49,901 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.288e+02 1.013e+03 1.397e+03 2.237e+03 5.539e+03, threshold=2.794e+03, percent-clipped=30.0 2023-06-25 08:40:59,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1948476.0, ans=0.0 2023-06-25 08:41:00,481 INFO [train.py:996] (2/4) Epoch 11, batch 19800, loss[loss=0.2083, simple_loss=0.2776, pruned_loss=0.06954, over 21806.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.314, pruned_loss=0.07963, over 4270595.31 frames. ], batch size: 124, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:41:02,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1948476.0, ans=0.125 2023-06-25 08:41:41,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=1948596.0, ans=0.015 2023-06-25 08:42:09,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1948656.0, ans=0.1 2023-06-25 08:42:47,225 INFO [train.py:996] (2/4) Epoch 11, batch 19850, loss[loss=0.1974, simple_loss=0.2588, pruned_loss=0.06804, over 21411.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3066, pruned_loss=0.07497, over 4271055.31 frames. ], batch size: 131, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:43:08,121 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:43:34,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1948896.0, ans=0.125 2023-06-25 08:43:51,407 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=1948896.0, ans=10.0 2023-06-25 08:44:23,842 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.088e+02 7.609e+02 1.066e+03 1.634e+03 3.345e+03, threshold=2.132e+03, percent-clipped=4.0 2023-06-25 08:44:33,354 INFO [train.py:996] (2/4) Epoch 11, batch 19900, loss[loss=0.2322, simple_loss=0.2967, pruned_loss=0.08388, over 21122.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3078, pruned_loss=0.07289, over 4261892.78 frames. ], batch size: 159, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:45:01,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1949136.0, ans=0.0 2023-06-25 08:45:33,141 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.04 vs. limit=10.0 2023-06-25 08:46:01,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=1949316.0, ans=0.035 2023-06-25 08:46:16,710 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:46:19,616 INFO [train.py:996] (2/4) Epoch 11, batch 19950, loss[loss=0.2209, simple_loss=0.2845, pruned_loss=0.0787, over 21763.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.3018, pruned_loss=0.07228, over 4258701.70 frames. ], batch size: 102, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:46:20,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=1949376.0, ans=0.125 2023-06-25 08:46:41,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=1949436.0, ans=0.0 2023-06-25 08:47:30,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1949556.0, ans=0.125 2023-06-25 08:47:31,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1949556.0, ans=0.125 2023-06-25 08:47:33,850 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=2.609e-03 2023-06-25 08:47:35,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1949556.0, ans=0.125 2023-06-25 08:47:38,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1949556.0, ans=0.125 2023-06-25 08:47:49,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1949616.0, ans=0.0 2023-06-25 08:47:53,971 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.809e+02 7.247e+02 1.068e+03 1.569e+03 2.873e+03, threshold=2.135e+03, percent-clipped=11.0 2023-06-25 08:48:03,737 INFO [train.py:996] (2/4) Epoch 11, batch 20000, loss[loss=0.2541, simple_loss=0.3237, pruned_loss=0.09224, over 21751.00 frames. ], tot_loss[loss=0.2233, simple_loss=0.3012, pruned_loss=0.0727, over 4258537.99 frames. ], batch size: 441, lr: 2.63e-03, grad_scale: 32.0 2023-06-25 08:49:08,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1949796.0, ans=0.125 2023-06-25 08:49:17,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1949856.0, ans=0.125 2023-06-25 08:49:26,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1949856.0, ans=0.95 2023-06-25 08:49:42,143 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-06-25 08:49:45,811 INFO [train.py:996] (2/4) Epoch 11, batch 20050, loss[loss=0.2652, simple_loss=0.3335, pruned_loss=0.09848, over 21728.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3033, pruned_loss=0.07474, over 4272300.13 frames. ], batch size: 389, lr: 2.63e-03, grad_scale: 32.0 2023-06-25 08:51:20,030 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1950216.0, ans=0.0 2023-06-25 08:51:23,079 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.669e+02 8.021e+02 1.064e+03 1.748e+03 3.117e+03, threshold=2.127e+03, percent-clipped=13.0 2023-06-25 08:51:29,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer_ff3.min_abs, batch_count=1950216.0, ans=0.2 2023-06-25 08:51:33,710 INFO [train.py:996] (2/4) Epoch 11, batch 20100, loss[loss=0.2169, simple_loss=0.2886, pruned_loss=0.07261, over 21118.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3041, pruned_loss=0.07589, over 4281220.19 frames. ], batch size: 143, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:52:43,292 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.73 vs. limit=10.0 2023-06-25 08:52:47,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1950456.0, ans=0.125 2023-06-25 08:53:11,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1950516.0, ans=0.125 2023-06-25 08:53:21,674 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:53:28,294 INFO [train.py:996] (2/4) Epoch 11, batch 20150, loss[loss=0.2485, simple_loss=0.3307, pruned_loss=0.08316, over 21691.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.314, pruned_loss=0.0797, over 4283575.15 frames. ], batch size: 351, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:53:50,935 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 08:54:26,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1950696.0, ans=0.125 2023-06-25 08:54:57,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1950816.0, ans=0.2 2023-06-25 08:55:00,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1950816.0, ans=0.0 2023-06-25 08:55:17,090 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.413e+02 8.388e+02 1.067e+03 1.531e+03 4.094e+03, threshold=2.133e+03, percent-clipped=12.0 2023-06-25 08:55:20,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1950816.0, ans=0.1 2023-06-25 08:55:25,365 INFO [train.py:996] (2/4) Epoch 11, batch 20200, loss[loss=0.2443, simple_loss=0.3777, pruned_loss=0.05541, over 19868.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3216, pruned_loss=0.0834, over 4275631.36 frames. ], batch size: 702, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:55:25,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1950876.0, ans=0.2 2023-06-25 08:55:33,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1950876.0, ans=0.125 2023-06-25 08:55:34,242 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.25 vs. limit=10.0 2023-06-25 08:55:50,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1950936.0, ans=0.0 2023-06-25 08:56:04,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1950996.0, ans=0.0 2023-06-25 08:56:30,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1951056.0, ans=0.1 2023-06-25 08:57:03,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1951116.0, ans=0.2 2023-06-25 08:57:12,765 INFO [train.py:996] (2/4) Epoch 11, batch 20250, loss[loss=0.206, simple_loss=0.2868, pruned_loss=0.06253, over 21427.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3211, pruned_loss=0.08188, over 4275764.46 frames. ], batch size: 211, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:57:49,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1951236.0, ans=0.125 2023-06-25 08:58:00,916 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1951296.0, ans=0.0 2023-06-25 08:58:49,506 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1951416.0, ans=10.0 2023-06-25 08:58:52,146 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.051e+02 7.038e+02 1.016e+03 1.334e+03 4.106e+03, threshold=2.032e+03, percent-clipped=11.0 2023-06-25 08:59:05,781 INFO [train.py:996] (2/4) Epoch 11, batch 20300, loss[loss=0.2139, simple_loss=0.3214, pruned_loss=0.05326, over 20868.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3186, pruned_loss=0.07824, over 4265030.76 frames. ], batch size: 608, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 08:59:09,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1951476.0, ans=0.125 2023-06-25 08:59:16,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1951476.0, ans=0.125 2023-06-25 09:00:03,825 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.43 vs. limit=15.0 2023-06-25 09:00:43,865 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.56 vs. limit=6.0 2023-06-25 09:00:45,677 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=15.0 2023-06-25 09:00:46,142 INFO [train.py:996] (2/4) Epoch 11, batch 20350, loss[loss=0.2283, simple_loss=0.3063, pruned_loss=0.0751, over 21866.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3164, pruned_loss=0.07738, over 4246731.04 frames. ], batch size: 102, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:00:56,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1951776.0, ans=0.125 2023-06-25 09:01:09,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1951836.0, ans=0.0 2023-06-25 09:01:16,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1951836.0, ans=0.125 2023-06-25 09:01:50,317 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1951956.0, ans=0.0 2023-06-25 09:02:06,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1951956.0, ans=0.1 2023-06-25 09:02:24,587 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.181e+02 7.498e+02 1.071e+03 1.543e+03 3.638e+03, threshold=2.141e+03, percent-clipped=16.0 2023-06-25 09:02:31,883 INFO [train.py:996] (2/4) Epoch 11, batch 20400, loss[loss=0.2705, simple_loss=0.3383, pruned_loss=0.1013, over 21271.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3176, pruned_loss=0.07898, over 4236882.28 frames. ], batch size: 159, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:02:43,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1952076.0, ans=0.125 2023-06-25 09:03:37,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1952256.0, ans=0.125 2023-06-25 09:04:16,688 INFO [train.py:996] (2/4) Epoch 11, batch 20450, loss[loss=0.2309, simple_loss=0.3048, pruned_loss=0.07854, over 21493.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3202, pruned_loss=0.08255, over 4240404.75 frames. ], batch size: 131, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:04:17,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=1952376.0, ans=0.0 2023-06-25 09:05:03,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1952496.0, ans=0.0 2023-06-25 09:05:26,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1952556.0, ans=0.125 2023-06-25 09:05:55,799 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.393e+02 8.343e+02 1.181e+03 1.747e+03 3.039e+03, threshold=2.362e+03, percent-clipped=12.0 2023-06-25 09:06:02,407 INFO [train.py:996] (2/4) Epoch 11, batch 20500, loss[loss=0.2328, simple_loss=0.3032, pruned_loss=0.08126, over 21725.00 frames. ], tot_loss[loss=0.2412, simple_loss=0.3162, pruned_loss=0.08306, over 4256181.01 frames. ], batch size: 414, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:06:59,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1952796.0, ans=0.2 2023-06-25 09:07:09,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1952856.0, ans=0.125 2023-06-25 09:07:43,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1952916.0, ans=0.125 2023-06-25 09:07:48,661 INFO [train.py:996] (2/4) Epoch 11, batch 20550, loss[loss=0.2603, simple_loss=0.3741, pruned_loss=0.07328, over 19831.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3094, pruned_loss=0.08159, over 4258858.52 frames. ], batch size: 703, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:09:28,345 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.073e+02 8.204e+02 1.449e+03 2.191e+03 5.725e+03, threshold=2.898e+03, percent-clipped=18.0 2023-06-25 09:09:40,412 INFO [train.py:996] (2/4) Epoch 11, batch 20600, loss[loss=0.2569, simple_loss=0.3252, pruned_loss=0.09435, over 21840.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3137, pruned_loss=0.0809, over 4265125.06 frames. ], batch size: 332, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:09:58,201 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.75 vs. limit=12.0 2023-06-25 09:10:36,216 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.43 vs. limit=15.0 2023-06-25 09:11:26,268 INFO [train.py:996] (2/4) Epoch 11, batch 20650, loss[loss=0.2122, simple_loss=0.2648, pruned_loss=0.07983, over 20218.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3092, pruned_loss=0.0807, over 4266088.44 frames. ], batch size: 703, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:13:04,255 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.039e+02 6.601e+02 8.640e+02 1.224e+03 2.485e+03, threshold=1.728e+03, percent-clipped=0.0 2023-06-25 09:13:15,937 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.58 vs. limit=10.0 2023-06-25 09:13:16,527 INFO [train.py:996] (2/4) Epoch 11, batch 20700, loss[loss=0.1963, simple_loss=0.2677, pruned_loss=0.0625, over 21265.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3009, pruned_loss=0.07684, over 4252611.54 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:13:34,117 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1953936.0, ans=0.125 2023-06-25 09:14:11,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=1953996.0, ans=0.09899494936611666 2023-06-25 09:14:19,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1954056.0, ans=0.125 2023-06-25 09:15:02,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1954116.0, ans=0.1 2023-06-25 09:15:07,997 INFO [train.py:996] (2/4) Epoch 11, batch 20750, loss[loss=0.2087, simple_loss=0.2897, pruned_loss=0.06379, over 21598.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.304, pruned_loss=0.07656, over 4250405.40 frames. ], batch size: 230, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:15:26,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1954236.0, ans=0.0 2023-06-25 09:16:01,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1954296.0, ans=0.0 2023-06-25 09:16:03,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1954296.0, ans=0.04949747468305833 2023-06-25 09:16:06,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1954296.0, ans=0.125 2023-06-25 09:16:48,247 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.098e+02 8.092e+02 1.287e+03 1.980e+03 4.706e+03, threshold=2.574e+03, percent-clipped=34.0 2023-06-25 09:16:55,068 INFO [train.py:996] (2/4) Epoch 11, batch 20800, loss[loss=0.19, simple_loss=0.2576, pruned_loss=0.06118, over 21405.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3074, pruned_loss=0.07758, over 4251932.98 frames. ], batch size: 194, lr: 2.63e-03, grad_scale: 32.0 2023-06-25 09:17:28,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1954536.0, ans=0.125 2023-06-25 09:17:31,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1954536.0, ans=0.1 2023-06-25 09:17:50,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1954596.0, ans=0.1 2023-06-25 09:17:55,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1954596.0, ans=0.125 2023-06-25 09:17:57,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1954596.0, ans=0.125 2023-06-25 09:18:40,344 INFO [train.py:996] (2/4) Epoch 11, batch 20850, loss[loss=0.1675, simple_loss=0.2398, pruned_loss=0.04763, over 21743.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2998, pruned_loss=0.0754, over 4255983.41 frames. ], batch size: 247, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:18:48,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1954776.0, ans=0.125 2023-06-25 09:19:17,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1954836.0, ans=0.2 2023-06-25 09:20:02,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2.whitening_limit, batch_count=1954956.0, ans=15.0 2023-06-25 09:20:19,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1955016.0, ans=0.125 2023-06-25 09:20:20,811 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.513e+02 8.708e+02 1.139e+03 1.646e+03 3.626e+03, threshold=2.277e+03, percent-clipped=8.0 2023-06-25 09:20:25,821 INFO [train.py:996] (2/4) Epoch 11, batch 20900, loss[loss=0.214, simple_loss=0.2903, pruned_loss=0.06885, over 21611.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3013, pruned_loss=0.07635, over 4265060.83 frames. ], batch size: 230, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:22:08,606 INFO [train.py:996] (2/4) Epoch 11, batch 20950, loss[loss=0.1801, simple_loss=0.255, pruned_loss=0.05263, over 21074.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2974, pruned_loss=0.07326, over 4253241.14 frames. ], batch size: 608, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:22:29,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1955436.0, ans=0.0 2023-06-25 09:22:38,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1955436.0, ans=0.1 2023-06-25 09:23:40,530 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.098e+02 8.503e+02 1.270e+03 1.885e+03 4.065e+03, threshold=2.540e+03, percent-clipped=13.0 2023-06-25 09:23:45,525 INFO [train.py:996] (2/4) Epoch 11, batch 21000, loss[loss=0.2312, simple_loss=0.3413, pruned_loss=0.06058, over 19877.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.2966, pruned_loss=0.07381, over 4263364.33 frames. ], batch size: 702, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:23:45,526 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 09:23:59,731 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.0.layers.1.self_attn_weights, attn_weights_entropy = tensor([4.6492, 4.1099, 4.2182, 4.4432], device='cuda:2') 2023-06-25 09:24:03,609 INFO [train.py:1028] (2/4) Epoch 11, validation: loss=0.2627, simple_loss=0.3591, pruned_loss=0.08313, over 1796401.00 frames. 2023-06-25 09:24:03,609 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-25 09:24:04,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1955676.0, ans=0.125 2023-06-25 09:24:04,699 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.07 vs. limit=12.0 2023-06-25 09:24:07,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1955676.0, ans=0.0 2023-06-25 09:24:10,947 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.50 vs. limit=12.0 2023-06-25 09:25:30,788 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1955916.0, ans=0.0 2023-06-25 09:25:40,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1955916.0, ans=0.0 2023-06-25 09:25:46,567 INFO [train.py:996] (2/4) Epoch 11, batch 21050, loss[loss=0.2398, simple_loss=0.3015, pruned_loss=0.08908, over 21885.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2942, pruned_loss=0.074, over 4263599.74 frames. ], batch size: 98, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:25:47,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1955976.0, ans=0.1 2023-06-25 09:25:53,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1955976.0, ans=0.0 2023-06-25 09:26:02,684 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.95 vs. limit=12.0 2023-06-25 09:27:27,183 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.534e+02 6.581e+02 8.702e+02 1.278e+03 3.016e+03, threshold=1.740e+03, percent-clipped=3.0 2023-06-25 09:27:30,692 INFO [train.py:996] (2/4) Epoch 11, batch 21100, loss[loss=0.1735, simple_loss=0.2357, pruned_loss=0.05563, over 21204.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2902, pruned_loss=0.07402, over 4249377.90 frames. ], batch size: 548, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:27:34,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1956276.0, ans=0.125 2023-06-25 09:28:02,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1956336.0, ans=0.125 2023-06-25 09:28:03,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1956336.0, ans=0.0 2023-06-25 09:29:15,549 INFO [train.py:996] (2/4) Epoch 11, batch 21150, loss[loss=0.1932, simple_loss=0.2578, pruned_loss=0.06423, over 21266.00 frames. ], tot_loss[loss=0.2178, simple_loss=0.2875, pruned_loss=0.07405, over 4251056.80 frames. ], batch size: 144, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:29:25,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1956576.0, ans=0.2 2023-06-25 09:29:44,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1956636.0, ans=0.1 2023-06-25 09:29:50,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1956636.0, ans=0.125 2023-06-25 09:30:55,549 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.582e+02 7.371e+02 1.068e+03 1.667e+03 5.764e+03, threshold=2.137e+03, percent-clipped=24.0 2023-06-25 09:30:59,116 INFO [train.py:996] (2/4) Epoch 11, batch 21200, loss[loss=0.1764, simple_loss=0.2543, pruned_loss=0.04927, over 21201.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2834, pruned_loss=0.07299, over 4257277.57 frames. ], batch size: 548, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:31:12,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1956876.0, ans=0.1 2023-06-25 09:32:10,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1957056.0, ans=0.125 2023-06-25 09:32:38,420 INFO [train.py:996] (2/4) Epoch 11, batch 21250, loss[loss=0.2152, simple_loss=0.2867, pruned_loss=0.07188, over 21493.00 frames. ], tot_loss[loss=0.2141, simple_loss=0.2819, pruned_loss=0.07315, over 4253306.94 frames. ], batch size: 212, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:32:39,453 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.49 vs. limit=10.0 2023-06-25 09:33:11,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1957236.0, ans=0.125 2023-06-25 09:33:21,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=1957296.0, ans=0.0 2023-06-25 09:33:32,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1957296.0, ans=0.125 2023-06-25 09:34:16,840 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.832e+02 8.531e+02 1.344e+03 2.187e+03 4.666e+03, threshold=2.689e+03, percent-clipped=25.0 2023-06-25 09:34:18,281 INFO [train.py:996] (2/4) Epoch 11, batch 21300, loss[loss=0.2649, simple_loss=0.3382, pruned_loss=0.09577, over 21771.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2904, pruned_loss=0.07595, over 4261087.40 frames. ], batch size: 441, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:34:33,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1957536.0, ans=0.0 2023-06-25 09:34:45,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1957536.0, ans=0.125 2023-06-25 09:35:06,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1957596.0, ans=0.125 2023-06-25 09:35:13,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1957596.0, ans=0.0 2023-06-25 09:35:14,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1957596.0, ans=0.125 2023-06-25 09:35:31,533 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1957656.0, ans=0.0 2023-06-25 09:35:49,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1957716.0, ans=0.125 2023-06-25 09:36:01,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1957716.0, ans=0.2 2023-06-25 09:36:04,261 INFO [train.py:996] (2/4) Epoch 11, batch 21350, loss[loss=0.2026, simple_loss=0.2734, pruned_loss=0.06596, over 21769.00 frames. ], tot_loss[loss=0.2222, simple_loss=0.2934, pruned_loss=0.07552, over 4265668.29 frames. ], batch size: 112, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:37:00,591 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=22.5 2023-06-25 09:37:03,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1957896.0, ans=0.0 2023-06-25 09:37:40,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1958016.0, ans=0.125 2023-06-25 09:37:55,867 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.389e+02 7.142e+02 1.027e+03 1.660e+03 3.891e+03, threshold=2.053e+03, percent-clipped=5.0 2023-06-25 09:37:57,539 INFO [train.py:996] (2/4) Epoch 11, batch 21400, loss[loss=0.2393, simple_loss=0.3199, pruned_loss=0.07933, over 21296.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.2962, pruned_loss=0.07415, over 4267811.24 frames. ], batch size: 159, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:38:07,626 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:39:06,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1958256.0, ans=10.0 2023-06-25 09:39:06,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1958256.0, ans=0.2 2023-06-25 09:39:31,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1958316.0, ans=0.0 2023-06-25 09:39:41,635 INFO [train.py:996] (2/4) Epoch 11, batch 21450, loss[loss=0.243, simple_loss=0.3127, pruned_loss=0.08666, over 21736.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.3001, pruned_loss=0.07628, over 4272964.55 frames. ], batch size: 389, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:40:59,552 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.21 vs. limit=15.0 2023-06-25 09:41:25,332 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.203e+02 7.277e+02 9.911e+02 1.372e+03 2.622e+03, threshold=1.982e+03, percent-clipped=4.0 2023-06-25 09:41:26,994 INFO [train.py:996] (2/4) Epoch 11, batch 21500, loss[loss=0.2064, simple_loss=0.2674, pruned_loss=0.07269, over 21191.00 frames. ], tot_loss[loss=0.2254, simple_loss=0.2977, pruned_loss=0.0765, over 4280189.00 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:43:11,045 INFO [train.py:996] (2/4) Epoch 11, batch 21550, loss[loss=0.2129, simple_loss=0.3329, pruned_loss=0.04643, over 19798.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2925, pruned_loss=0.07424, over 4270681.06 frames. ], batch size: 703, lr: 2.63e-03, grad_scale: 8.0 2023-06-25 09:44:03,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=1959096.0, ans=0.95 2023-06-25 09:44:07,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1959096.0, ans=0.125 2023-06-25 09:44:30,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=1959216.0, ans=15.0 2023-06-25 09:44:51,393 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.933e+02 7.990e+02 1.429e+03 2.000e+03 5.379e+03, threshold=2.857e+03, percent-clipped=25.0 2023-06-25 09:44:53,165 INFO [train.py:996] (2/4) Epoch 11, batch 21600, loss[loss=0.2102, simple_loss=0.271, pruned_loss=0.07471, over 21591.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2884, pruned_loss=0.07347, over 4262978.06 frames. ], batch size: 415, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:44:57,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=1959276.0, ans=0.2 2023-06-25 09:45:04,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1959276.0, ans=0.125 2023-06-25 09:45:21,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1959336.0, ans=0.125 2023-06-25 09:45:26,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1959336.0, ans=0.125 2023-06-25 09:45:31,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1959336.0, ans=0.125 2023-06-25 09:45:50,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1959396.0, ans=0.125 2023-06-25 09:46:40,918 INFO [train.py:996] (2/4) Epoch 11, batch 21650, loss[loss=0.2147, simple_loss=0.3046, pruned_loss=0.06244, over 21291.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2937, pruned_loss=0.07208, over 4258060.08 frames. ], batch size: 176, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:46:41,951 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.35 vs. limit=6.0 2023-06-25 09:47:10,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1959636.0, ans=0.2 2023-06-25 09:47:13,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1959636.0, ans=0.125 2023-06-25 09:47:33,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=1959696.0, ans=0.2 2023-06-25 09:47:57,205 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=22.5 2023-06-25 09:48:25,900 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.099e+02 8.539e+02 1.351e+03 1.899e+03 3.491e+03, threshold=2.702e+03, percent-clipped=7.0 2023-06-25 09:48:26,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1959876.0, ans=0.1 2023-06-25 09:48:27,770 INFO [train.py:996] (2/4) Epoch 11, batch 21700, loss[loss=0.2126, simple_loss=0.2802, pruned_loss=0.07245, over 21748.00 frames. ], tot_loss[loss=0.218, simple_loss=0.2945, pruned_loss=0.07076, over 4264489.73 frames. ], batch size: 371, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:48:49,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1959936.0, ans=0.0 2023-06-25 09:49:09,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1959936.0, ans=0.125 2023-06-25 09:49:33,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=1960056.0, ans=0.125 2023-06-25 09:50:12,970 INFO [train.py:996] (2/4) Epoch 11, batch 21750, loss[loss=0.1881, simple_loss=0.2631, pruned_loss=0.05656, over 16334.00 frames. ], tot_loss[loss=0.215, simple_loss=0.2899, pruned_loss=0.07003, over 4252142.93 frames. ], batch size: 65, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:50:24,547 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1960176.0, ans=0.2 2023-06-25 09:50:33,906 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1960236.0, ans=0.125 2023-06-25 09:51:08,354 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.53 vs. limit=10.0 2023-06-25 09:51:11,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1960296.0, ans=0.125 2023-06-25 09:51:47,722 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.11 vs. limit=15.0 2023-06-25 09:51:58,428 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.456e+02 8.216e+02 1.100e+03 1.452e+03 3.027e+03, threshold=2.200e+03, percent-clipped=1.0 2023-06-25 09:51:58,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1960476.0, ans=0.0 2023-06-25 09:51:59,871 INFO [train.py:996] (2/4) Epoch 11, batch 21800, loss[loss=0.1907, simple_loss=0.2621, pruned_loss=0.05964, over 21791.00 frames. ], tot_loss[loss=0.2159, simple_loss=0.2888, pruned_loss=0.07152, over 4248598.65 frames. ], batch size: 118, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:52:20,439 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.97 vs. limit=15.0 2023-06-25 09:53:04,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1960656.0, ans=0.0 2023-06-25 09:53:04,909 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-25 09:53:18,483 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.93 vs. limit=6.0 2023-06-25 09:53:28,854 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=1960716.0, ans=0.2 2023-06-25 09:53:34,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1960716.0, ans=0.125 2023-06-25 09:53:35,558 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.09 vs. limit=12.0 2023-06-25 09:53:45,084 INFO [train.py:996] (2/4) Epoch 11, batch 21850, loss[loss=0.2418, simple_loss=0.3619, pruned_loss=0.06087, over 20802.00 frames. ], tot_loss[loss=0.2186, simple_loss=0.294, pruned_loss=0.07161, over 4248996.62 frames. ], batch size: 607, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:54:01,222 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-06-25 09:55:25,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1961016.0, ans=0.125 2023-06-25 09:55:27,837 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.221e+02 7.360e+02 1.053e+03 1.458e+03 3.571e+03, threshold=2.107e+03, percent-clipped=7.0 2023-06-25 09:55:35,127 INFO [train.py:996] (2/4) Epoch 11, batch 21900, loss[loss=0.2282, simple_loss=0.3007, pruned_loss=0.07784, over 21815.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2958, pruned_loss=0.07274, over 4256905.20 frames. ], batch size: 441, lr: 2.63e-03, grad_scale: 16.0 2023-06-25 09:56:49,353 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:57:07,886 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 09:57:20,406 INFO [train.py:996] (2/4) Epoch 11, batch 21950, loss[loss=0.1723, simple_loss=0.2607, pruned_loss=0.04199, over 21197.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2899, pruned_loss=0.07186, over 4261938.45 frames. ], batch size: 548, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 09:57:41,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1961436.0, ans=0.125 2023-06-25 09:58:04,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1961496.0, ans=0.0 2023-06-25 09:58:53,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1961616.0, ans=0.1 2023-06-25 09:58:56,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1961616.0, ans=0.125 2023-06-25 09:58:57,733 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.305e+02 6.629e+02 8.784e+02 1.230e+03 3.737e+03, threshold=1.757e+03, percent-clipped=5.0 2023-06-25 09:58:59,377 INFO [train.py:996] (2/4) Epoch 11, batch 22000, loss[loss=0.2582, simple_loss=0.3079, pruned_loss=0.1042, over 21354.00 frames. ], tot_loss[loss=0.2122, simple_loss=0.2849, pruned_loss=0.06972, over 4264785.85 frames. ], batch size: 473, lr: 2.62e-03, grad_scale: 32.0 2023-06-25 09:59:26,833 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1961736.0, ans=0.125 2023-06-25 09:59:37,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1961736.0, ans=0.1 2023-06-25 10:00:01,469 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.55 vs. limit=22.5 2023-06-25 10:00:04,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1961856.0, ans=0.0 2023-06-25 10:00:16,019 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.26 vs. limit=15.0 2023-06-25 10:00:31,975 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:00:50,097 INFO [train.py:996] (2/4) Epoch 11, batch 22050, loss[loss=0.2409, simple_loss=0.3214, pruned_loss=0.08023, over 21456.00 frames. ], tot_loss[loss=0.215, simple_loss=0.289, pruned_loss=0.07045, over 4252210.15 frames. ], batch size: 211, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:00:52,712 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.67 vs. limit=15.0 2023-06-25 10:01:06,531 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.76 vs. limit=10.0 2023-06-25 10:01:29,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1962096.0, ans=0.125 2023-06-25 10:01:33,402 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.19 vs. limit=6.0 2023-06-25 10:01:34,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1962096.0, ans=0.1 2023-06-25 10:01:53,984 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-25 10:02:09,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=1962156.0, ans=0.5 2023-06-25 10:02:16,526 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1962216.0, ans=0.2 2023-06-25 10:02:37,404 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.423e+02 9.011e+02 1.269e+03 1.922e+03 5.194e+03, threshold=2.539e+03, percent-clipped=30.0 2023-06-25 10:02:37,426 INFO [train.py:996] (2/4) Epoch 11, batch 22100, loss[loss=0.2551, simple_loss=0.3197, pruned_loss=0.09526, over 21367.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3017, pruned_loss=0.07679, over 4263468.15 frames. ], batch size: 144, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:03:21,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1962396.0, ans=0.2 2023-06-25 10:04:23,276 INFO [train.py:996] (2/4) Epoch 11, batch 22150, loss[loss=0.2221, simple_loss=0.2939, pruned_loss=0.0751, over 21523.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3051, pruned_loss=0.07904, over 4270895.26 frames. ], batch size: 548, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:04:25,814 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.17 vs. limit=22.5 2023-06-25 10:04:27,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1962576.0, ans=0.125 2023-06-25 10:04:35,187 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.74 vs. limit=15.0 2023-06-25 10:05:06,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=1962696.0, ans=0.0 2023-06-25 10:05:36,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1962756.0, ans=0.0 2023-06-25 10:05:38,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1962756.0, ans=0.1 2023-06-25 10:06:10,659 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.244e+02 8.641e+02 1.312e+03 2.175e+03 4.145e+03, threshold=2.624e+03, percent-clipped=16.0 2023-06-25 10:06:10,691 INFO [train.py:996] (2/4) Epoch 11, batch 22200, loss[loss=0.2107, simple_loss=0.2742, pruned_loss=0.07363, over 21694.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3068, pruned_loss=0.07977, over 4280272.41 frames. ], batch size: 263, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:06:32,749 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.04 vs. limit=15.0 2023-06-25 10:06:36,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1962936.0, ans=0.2 2023-06-25 10:07:15,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1963056.0, ans=0.125 2023-06-25 10:07:47,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1963116.0, ans=0.0 2023-06-25 10:07:56,903 INFO [train.py:996] (2/4) Epoch 11, batch 22250, loss[loss=0.2213, simple_loss=0.3346, pruned_loss=0.05397, over 19864.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3117, pruned_loss=0.08102, over 4285282.11 frames. ], batch size: 703, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:08:32,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1963236.0, ans=0.125 2023-06-25 10:08:37,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1963296.0, ans=0.0 2023-06-25 10:09:36,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1963416.0, ans=0.2 2023-06-25 10:09:41,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1963416.0, ans=0.125 2023-06-25 10:09:44,566 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.502e+02 7.191e+02 1.032e+03 1.470e+03 3.757e+03, threshold=2.063e+03, percent-clipped=7.0 2023-06-25 10:09:44,588 INFO [train.py:996] (2/4) Epoch 11, batch 22300, loss[loss=0.2437, simple_loss=0.3156, pruned_loss=0.08588, over 21717.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3134, pruned_loss=0.08301, over 4289107.13 frames. ], batch size: 414, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:10:01,023 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=22.5 2023-06-25 10:11:34,582 INFO [train.py:996] (2/4) Epoch 11, batch 22350, loss[loss=0.2575, simple_loss=0.3104, pruned_loss=0.1023, over 21803.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3115, pruned_loss=0.08369, over 4289684.51 frames. ], batch size: 508, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:11:43,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1963776.0, ans=0.125 2023-06-25 10:11:47,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1963776.0, ans=0.125 2023-06-25 10:12:26,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1963896.0, ans=0.125 2023-06-25 10:12:35,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1963956.0, ans=0.0 2023-06-25 10:13:14,856 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.10 vs. limit=6.0 2023-06-25 10:13:21,900 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.769e+02 7.078e+02 9.389e+02 1.336e+03 2.790e+03, threshold=1.878e+03, percent-clipped=4.0 2023-06-25 10:13:21,921 INFO [train.py:996] (2/4) Epoch 11, batch 22400, loss[loss=0.2051, simple_loss=0.2753, pruned_loss=0.0675, over 21476.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3082, pruned_loss=0.08063, over 4286610.61 frames. ], batch size: 194, lr: 2.62e-03, grad_scale: 32.0 2023-06-25 10:13:32,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1964076.0, ans=0.1 2023-06-25 10:14:10,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=1964196.0, ans=0.125 2023-06-25 10:14:31,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1964256.0, ans=10.0 2023-06-25 10:15:05,395 INFO [train.py:996] (2/4) Epoch 11, batch 22450, loss[loss=0.2014, simple_loss=0.2617, pruned_loss=0.07058, over 21586.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.302, pruned_loss=0.07967, over 4288647.34 frames. ], batch size: 231, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:15:27,292 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=1964436.0, ans=0.2 2023-06-25 10:15:38,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1964436.0, ans=0.0 2023-06-25 10:16:22,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1964556.0, ans=0.0 2023-06-25 10:16:24,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1964556.0, ans=0.125 2023-06-25 10:16:41,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1964616.0, ans=0.125 2023-06-25 10:16:53,968 INFO [train.py:996] (2/4) Epoch 11, batch 22500, loss[loss=0.2617, simple_loss=0.3639, pruned_loss=0.07974, over 21631.00 frames. ], tot_loss[loss=0.2274, simple_loss=0.2974, pruned_loss=0.07868, over 4272660.50 frames. ], batch size: 414, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:16:55,659 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.803e+02 7.459e+02 1.048e+03 1.318e+03 4.554e+03, threshold=2.097e+03, percent-clipped=12.0 2023-06-25 10:16:56,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1964676.0, ans=0.2 2023-06-25 10:16:56,896 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.48 vs. limit=12.0 2023-06-25 10:17:45,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1964796.0, ans=0.2 2023-06-25 10:18:05,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1964856.0, ans=0.125 2023-06-25 10:18:09,166 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=14.64 vs. limit=15.0 2023-06-25 10:18:18,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1964916.0, ans=0.125 2023-06-25 10:18:41,429 INFO [train.py:996] (2/4) Epoch 11, batch 22550, loss[loss=0.1998, simple_loss=0.277, pruned_loss=0.06128, over 21843.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3022, pruned_loss=0.07944, over 4277069.81 frames. ], batch size: 247, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:19:51,500 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=15.0 2023-06-25 10:20:21,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1965216.0, ans=0.0 2023-06-25 10:20:21,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1965216.0, ans=0.125 2023-06-25 10:20:36,191 INFO [train.py:996] (2/4) Epoch 11, batch 22600, loss[loss=0.1955, simple_loss=0.2625, pruned_loss=0.06421, over 21779.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3068, pruned_loss=0.08016, over 4282650.02 frames. ], batch size: 112, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:20:39,510 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.096e+02 1.052e+03 1.426e+03 2.192e+03 4.902e+03, threshold=2.852e+03, percent-clipped=27.0 2023-06-25 10:20:54,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=1965276.0, ans=0.2 2023-06-25 10:20:59,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1965336.0, ans=0.07 2023-06-25 10:21:04,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1965336.0, ans=0.125 2023-06-25 10:21:48,276 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1965456.0, ans=0.0 2023-06-25 10:21:57,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1965516.0, ans=0.2 2023-06-25 10:22:20,050 INFO [train.py:996] (2/4) Epoch 11, batch 22650, loss[loss=0.2415, simple_loss=0.3019, pruned_loss=0.09051, over 21869.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3026, pruned_loss=0.0794, over 4279171.28 frames. ], batch size: 373, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:22:20,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1965576.0, ans=0.125 2023-06-25 10:24:01,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1965876.0, ans=0.0 2023-06-25 10:24:02,898 INFO [train.py:996] (2/4) Epoch 11, batch 22700, loss[loss=0.1985, simple_loss=0.2628, pruned_loss=0.0671, over 21742.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2957, pruned_loss=0.0783, over 4280038.75 frames. ], batch size: 124, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:24:06,047 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.859e+02 7.606e+02 1.053e+03 1.643e+03 3.332e+03, threshold=2.107e+03, percent-clipped=4.0 2023-06-25 10:24:08,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1965876.0, ans=0.125 2023-06-25 10:24:49,116 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.13 vs. limit=15.0 2023-06-25 10:24:59,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=1965996.0, ans=0.125 2023-06-25 10:25:50,180 INFO [train.py:996] (2/4) Epoch 11, batch 22750, loss[loss=0.2864, simple_loss=0.3472, pruned_loss=0.1127, over 21426.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3002, pruned_loss=0.08111, over 4274896.64 frames. ], batch size: 211, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:26:01,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1966176.0, ans=0.125 2023-06-25 10:26:45,621 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.21 vs. limit=10.0 2023-06-25 10:26:45,631 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.69 vs. limit=15.0 2023-06-25 10:27:39,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1966476.0, ans=10.0 2023-06-25 10:27:40,876 INFO [train.py:996] (2/4) Epoch 11, batch 22800, loss[loss=0.2865, simple_loss=0.3339, pruned_loss=0.1196, over 21728.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3051, pruned_loss=0.08375, over 4280333.22 frames. ], batch size: 508, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:27:41,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1966476.0, ans=0.125 2023-06-25 10:27:44,251 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.117e+02 8.724e+02 1.380e+03 2.378e+03 6.132e+03, threshold=2.761e+03, percent-clipped=34.0 2023-06-25 10:28:10,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1966536.0, ans=0.125 2023-06-25 10:28:26,827 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.95 vs. limit=15.0 2023-06-25 10:29:25,743 INFO [train.py:996] (2/4) Epoch 11, batch 22850, loss[loss=0.2344, simple_loss=0.2917, pruned_loss=0.08851, over 21624.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3011, pruned_loss=0.08276, over 4277485.35 frames. ], batch size: 247, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:29:44,455 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.88 vs. limit=12.0 2023-06-25 10:30:10,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1966896.0, ans=0.125 2023-06-25 10:30:29,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=1966956.0, ans=0.04949747468305833 2023-06-25 10:30:31,600 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=22.5 2023-06-25 10:31:09,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1967016.0, ans=10.0 2023-06-25 10:31:12,050 INFO [train.py:996] (2/4) Epoch 11, batch 22900, loss[loss=0.2873, simple_loss=0.3625, pruned_loss=0.1061, over 21249.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3028, pruned_loss=0.08158, over 4278087.56 frames. ], batch size: 548, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:31:15,872 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.509e+02 6.842e+02 1.024e+03 1.500e+03 4.089e+03, threshold=2.047e+03, percent-clipped=2.0 2023-06-25 10:31:24,021 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.44 vs. limit=10.0 2023-06-25 10:31:32,034 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 10:32:59,796 INFO [train.py:996] (2/4) Epoch 11, batch 22950, loss[loss=0.2368, simple_loss=0.3431, pruned_loss=0.06524, over 21404.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3138, pruned_loss=0.07955, over 4279453.31 frames. ], batch size: 211, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:33:13,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1967376.0, ans=0.125 2023-06-25 10:34:44,064 INFO [train.py:996] (2/4) Epoch 11, batch 23000, loss[loss=0.2385, simple_loss=0.3095, pruned_loss=0.08374, over 21857.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3136, pruned_loss=0.07795, over 4283506.81 frames. ], batch size: 371, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:34:44,585 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1967676.0, ans=0.125 2023-06-25 10:34:47,327 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.236e+02 8.204e+02 1.340e+03 2.035e+03 4.542e+03, threshold=2.680e+03, percent-clipped=23.0 2023-06-25 10:36:02,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1967856.0, ans=0.0 2023-06-25 10:36:15,757 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1967916.0, ans=0.125 2023-06-25 10:36:18,159 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.51 vs. limit=15.0 2023-06-25 10:36:31,924 INFO [train.py:996] (2/4) Epoch 11, batch 23050, loss[loss=0.2709, simple_loss=0.337, pruned_loss=0.1024, over 21449.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3147, pruned_loss=0.07939, over 4277052.44 frames. ], batch size: 131, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:36:46,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=1967976.0, ans=0.95 2023-06-25 10:37:48,624 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.31 vs. limit=6.0 2023-06-25 10:37:50,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1968156.0, ans=0.04949747468305833 2023-06-25 10:37:57,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1968156.0, ans=0.0 2023-06-25 10:38:06,405 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-06-25 10:38:18,101 INFO [train.py:996] (2/4) Epoch 11, batch 23100, loss[loss=0.217, simple_loss=0.2827, pruned_loss=0.0757, over 21812.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3109, pruned_loss=0.08056, over 4273497.54 frames. ], batch size: 118, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:38:18,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1968276.0, ans=0.2 2023-06-25 10:38:29,161 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.702e+02 7.516e+02 1.022e+03 1.433e+03 4.307e+03, threshold=2.044e+03, percent-clipped=3.0 2023-06-25 10:38:53,138 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.58 vs. limit=15.0 2023-06-25 10:39:06,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1968396.0, ans=0.125 2023-06-25 10:39:13,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=1968396.0, ans=0.0 2023-06-25 10:39:36,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1968456.0, ans=0.125 2023-06-25 10:39:41,778 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=15.0 2023-06-25 10:40:00,046 INFO [train.py:996] (2/4) Epoch 11, batch 23150, loss[loss=0.2309, simple_loss=0.3077, pruned_loss=0.077, over 20709.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3055, pruned_loss=0.07924, over 4273714.43 frames. ], batch size: 607, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:41:38,186 INFO [train.py:996] (2/4) Epoch 11, batch 23200, loss[loss=0.2152, simple_loss=0.2848, pruned_loss=0.07283, over 21375.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3055, pruned_loss=0.08022, over 4274954.12 frames. ], batch size: 143, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:41:43,101 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.976e+02 7.986e+02 1.089e+03 1.684e+03 3.717e+03, threshold=2.178e+03, percent-clipped=18.0 2023-06-25 10:42:47,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=1969056.0, ans=10.0 2023-06-25 10:42:58,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=1969056.0, ans=0.04949747468305833 2023-06-25 10:43:14,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1969116.0, ans=0.0 2023-06-25 10:43:30,216 INFO [train.py:996] (2/4) Epoch 11, batch 23250, loss[loss=0.2627, simple_loss=0.335, pruned_loss=0.09521, over 19943.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3049, pruned_loss=0.08098, over 4278118.49 frames. ], batch size: 702, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:43:36,214 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=15.0 2023-06-25 10:43:37,335 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1969176.0, ans=0.125 2023-06-25 10:45:03,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1969416.0, ans=0.125 2023-06-25 10:45:17,582 INFO [train.py:996] (2/4) Epoch 11, batch 23300, loss[loss=0.2022, simple_loss=0.2668, pruned_loss=0.06877, over 21187.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3122, pruned_loss=0.08322, over 4281654.56 frames. ], batch size: 608, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:45:18,044 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1969476.0, ans=0.0 2023-06-25 10:45:22,822 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.331e+02 7.944e+02 1.056e+03 1.535e+03 4.546e+03, threshold=2.112e+03, percent-clipped=10.0 2023-06-25 10:46:41,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1969656.0, ans=0.1 2023-06-25 10:46:43,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1969656.0, ans=0.0 2023-06-25 10:47:03,382 INFO [train.py:996] (2/4) Epoch 11, batch 23350, loss[loss=0.179, simple_loss=0.2466, pruned_loss=0.0557, over 21899.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3163, pruned_loss=0.08289, over 4285548.02 frames. ], batch size: 107, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:47:31,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1969836.0, ans=0.0 2023-06-25 10:48:14,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1969956.0, ans=0.125 2023-06-25 10:48:54,014 INFO [train.py:996] (2/4) Epoch 11, batch 23400, loss[loss=0.2139, simple_loss=0.2877, pruned_loss=0.07009, over 21506.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3096, pruned_loss=0.07885, over 4286772.45 frames. ], batch size: 131, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:49:04,442 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1970076.0, ans=0.0 2023-06-25 10:49:07,024 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.051e+02 8.471e+02 1.302e+03 1.874e+03 3.604e+03, threshold=2.604e+03, percent-clipped=20.0 2023-06-25 10:49:11,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1970076.0, ans=0.125 2023-06-25 10:49:36,590 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1970136.0, ans=0.0 2023-06-25 10:49:42,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=1970196.0, ans=10.0 2023-06-25 10:50:48,355 INFO [train.py:996] (2/4) Epoch 11, batch 23450, loss[loss=0.2465, simple_loss=0.3174, pruned_loss=0.08784, over 21831.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3105, pruned_loss=0.08013, over 4285879.66 frames. ], batch size: 441, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:52:16,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1970616.0, ans=0.0 2023-06-25 10:52:34,805 INFO [train.py:996] (2/4) Epoch 11, batch 23500, loss[loss=0.2233, simple_loss=0.323, pruned_loss=0.06183, over 19990.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3121, pruned_loss=0.08169, over 4287642.14 frames. ], batch size: 702, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:52:41,429 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.140e+02 8.381e+02 1.197e+03 1.768e+03 4.081e+03, threshold=2.394e+03, percent-clipped=6.0 2023-06-25 10:53:16,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1970796.0, ans=0.125 2023-06-25 10:53:24,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1970796.0, ans=0.125 2023-06-25 10:53:29,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1970796.0, ans=0.0 2023-06-25 10:53:34,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1970856.0, ans=0.0 2023-06-25 10:54:16,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1970916.0, ans=0.125 2023-06-25 10:54:19,575 INFO [train.py:996] (2/4) Epoch 11, batch 23550, loss[loss=0.2582, simple_loss=0.2951, pruned_loss=0.1106, over 21411.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3069, pruned_loss=0.08222, over 4286016.24 frames. ], batch size: 508, lr: 2.62e-03, grad_scale: 8.0 2023-06-25 10:54:27,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1970976.0, ans=0.1 2023-06-25 10:55:23,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten.whitening_limit, batch_count=1971156.0, ans=15.0 2023-06-25 10:56:04,918 INFO [train.py:996] (2/4) Epoch 11, batch 23600, loss[loss=0.241, simple_loss=0.326, pruned_loss=0.07801, over 21579.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3065, pruned_loss=0.08173, over 4287918.77 frames. ], batch size: 414, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:56:17,383 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.710e+02 7.780e+02 1.013e+03 1.475e+03 2.570e+03, threshold=2.026e+03, percent-clipped=2.0 2023-06-25 10:56:17,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1971276.0, ans=0.0 2023-06-25 10:56:37,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff2.min_abs, batch_count=1971336.0, ans=0.1 2023-06-25 10:56:41,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1971336.0, ans=0.0 2023-06-25 10:57:56,464 INFO [train.py:996] (2/4) Epoch 11, batch 23650, loss[loss=0.2507, simple_loss=0.3315, pruned_loss=0.085, over 21488.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3075, pruned_loss=0.08085, over 4288968.43 frames. ], batch size: 471, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:59:26,870 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1971816.0, ans=0.1 2023-06-25 10:59:31,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1971816.0, ans=0.125 2023-06-25 10:59:34,829 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1971816.0, ans=0.2 2023-06-25 10:59:40,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=1971816.0, ans=0.05 2023-06-25 10:59:44,377 INFO [train.py:996] (2/4) Epoch 11, batch 23700, loss[loss=0.2143, simple_loss=0.2928, pruned_loss=0.06791, over 21925.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3081, pruned_loss=0.07943, over 4288019.09 frames. ], batch size: 317, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 10:59:53,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=1971876.0, ans=0.125 2023-06-25 10:59:56,586 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.633e+02 7.722e+02 1.155e+03 1.933e+03 4.444e+03, threshold=2.311e+03, percent-clipped=20.0 2023-06-25 11:00:09,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1971936.0, ans=0.125 2023-06-25 11:00:30,845 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=1971996.0, ans=0.125 2023-06-25 11:01:06,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1972056.0, ans=0.125 2023-06-25 11:01:40,514 INFO [train.py:996] (2/4) Epoch 11, batch 23750, loss[loss=0.191, simple_loss=0.2897, pruned_loss=0.04618, over 21951.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3132, pruned_loss=0.08003, over 4278133.26 frames. ], batch size: 317, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:02:55,698 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1972356.0, ans=0.125 2023-06-25 11:03:09,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1972416.0, ans=0.0 2023-06-25 11:03:21,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1972416.0, ans=0.125 2023-06-25 11:03:25,758 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.50 vs. limit=15.0 2023-06-25 11:03:28,670 INFO [train.py:996] (2/4) Epoch 11, batch 23800, loss[loss=0.1973, simple_loss=0.2722, pruned_loss=0.06118, over 21761.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3115, pruned_loss=0.07849, over 4254925.82 frames. ], batch size: 124, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:03:29,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1972476.0, ans=0.125 2023-06-25 11:03:30,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1972476.0, ans=0.0 2023-06-25 11:03:35,170 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.508e+02 9.725e+02 1.368e+03 2.347e+03 4.369e+03, threshold=2.737e+03, percent-clipped=25.0 2023-06-25 11:03:55,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1972536.0, ans=0.0 2023-06-25 11:04:21,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1972596.0, ans=0.2 2023-06-25 11:04:23,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1972596.0, ans=0.125 2023-06-25 11:04:27,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1972596.0, ans=0.1 2023-06-25 11:04:36,063 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.93 vs. limit=6.0 2023-06-25 11:04:45,880 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.96 vs. limit=10.0 2023-06-25 11:04:54,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=1972656.0, ans=0.5 2023-06-25 11:05:16,755 INFO [train.py:996] (2/4) Epoch 11, batch 23850, loss[loss=0.2626, simple_loss=0.3283, pruned_loss=0.09851, over 21221.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3184, pruned_loss=0.0811, over 4255454.34 frames. ], batch size: 143, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:05:41,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=1972776.0, ans=0.0 2023-06-25 11:06:13,066 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1972896.0, ans=0.125 2023-06-25 11:07:14,231 INFO [train.py:996] (2/4) Epoch 11, batch 23900, loss[loss=0.2255, simple_loss=0.2882, pruned_loss=0.08143, over 21153.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.324, pruned_loss=0.08335, over 4255677.34 frames. ], batch size: 159, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:07:20,911 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.632e+02 1.020e+03 1.662e+03 2.575e+03 5.101e+03, threshold=3.324e+03, percent-clipped=22.0 2023-06-25 11:07:34,993 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.02 vs. limit=15.0 2023-06-25 11:08:30,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1973316.0, ans=0.125 2023-06-25 11:08:54,232 INFO [train.py:996] (2/4) Epoch 11, batch 23950, loss[loss=0.2185, simple_loss=0.2834, pruned_loss=0.07678, over 21188.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3181, pruned_loss=0.0822, over 4259287.24 frames. ], batch size: 143, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:09:33,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1973436.0, ans=0.125 2023-06-25 11:09:40,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=1973496.0, ans=0.2 2023-06-25 11:10:08,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1973556.0, ans=0.125 2023-06-25 11:10:10,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=1973556.0, ans=0.02 2023-06-25 11:10:18,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1973616.0, ans=0.2 2023-06-25 11:10:31,677 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=13.98 vs. limit=22.5 2023-06-25 11:10:47,911 INFO [train.py:996] (2/4) Epoch 11, batch 24000, loss[loss=0.2511, simple_loss=0.3196, pruned_loss=0.09133, over 21460.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3206, pruned_loss=0.08554, over 4267660.58 frames. ], batch size: 211, lr: 2.62e-03, grad_scale: 32.0 2023-06-25 11:10:47,911 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 11:11:07,127 INFO [train.py:1028] (2/4) Epoch 11, validation: loss=0.263, simple_loss=0.3578, pruned_loss=0.08405, over 1796401.00 frames. 2023-06-25 11:11:07,127 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-25 11:11:09,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1973676.0, ans=0.0 2023-06-25 11:11:13,028 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:11:14,107 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.140e+02 7.509e+02 1.143e+03 1.580e+03 3.381e+03, threshold=2.286e+03, percent-clipped=1.0 2023-06-25 11:11:26,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=1973736.0, ans=0.2 2023-06-25 11:11:42,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1973736.0, ans=0.1 2023-06-25 11:11:44,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1973736.0, ans=0.125 2023-06-25 11:11:49,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=1973796.0, ans=0.125 2023-06-25 11:12:02,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=1973856.0, ans=0.2 2023-06-25 11:12:45,550 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1973916.0, ans=0.2 2023-06-25 11:12:55,327 INFO [train.py:996] (2/4) Epoch 11, batch 24050, loss[loss=0.2022, simple_loss=0.2912, pruned_loss=0.05661, over 21607.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3206, pruned_loss=0.08467, over 4265796.42 frames. ], batch size: 230, lr: 2.62e-03, grad_scale: 32.0 2023-06-25 11:13:43,483 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.01 vs. limit=15.0 2023-06-25 11:14:16,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1974156.0, ans=0.125 2023-06-25 11:14:44,587 INFO [train.py:996] (2/4) Epoch 11, batch 24100, loss[loss=0.2278, simple_loss=0.3202, pruned_loss=0.06769, over 21549.00 frames. ], tot_loss[loss=0.2423, simple_loss=0.3191, pruned_loss=0.0828, over 4264206.32 frames. ], batch size: 230, lr: 2.62e-03, grad_scale: 32.0 2023-06-25 11:14:50,900 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.104e+02 8.872e+02 1.198e+03 1.771e+03 4.014e+03, threshold=2.396e+03, percent-clipped=16.0 2023-06-25 11:15:12,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1974336.0, ans=0.125 2023-06-25 11:15:27,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1974396.0, ans=0.0 2023-06-25 11:16:11,768 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1974516.0, ans=0.125 2023-06-25 11:16:11,804 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1974516.0, ans=0.125 2023-06-25 11:16:29,672 INFO [train.py:996] (2/4) Epoch 11, batch 24150, loss[loss=0.1882, simple_loss=0.23, pruned_loss=0.07327, over 20026.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3198, pruned_loss=0.08473, over 4265237.30 frames. ], batch size: 703, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:17:52,927 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.53 vs. limit=15.0 2023-06-25 11:18:09,107 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1974816.0, ans=0.1 2023-06-25 11:18:14,932 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.58 vs. limit=5.0 2023-06-25 11:18:20,278 INFO [train.py:996] (2/4) Epoch 11, batch 24200, loss[loss=0.2068, simple_loss=0.3003, pruned_loss=0.05668, over 21704.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3213, pruned_loss=0.08588, over 4270341.79 frames. ], batch size: 247, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:18:32,575 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=22.5 2023-06-25 11:18:34,536 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.944e+02 9.608e+02 1.226e+03 1.956e+03 3.417e+03, threshold=2.452e+03, percent-clipped=15.0 2023-06-25 11:18:48,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1974936.0, ans=0.125 2023-06-25 11:19:19,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1974996.0, ans=0.1 2023-06-25 11:19:40,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1975056.0, ans=0.2 2023-06-25 11:20:16,265 INFO [train.py:996] (2/4) Epoch 11, batch 24250, loss[loss=0.2008, simple_loss=0.3018, pruned_loss=0.0499, over 21793.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3185, pruned_loss=0.08049, over 4268491.73 frames. ], batch size: 282, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:20:45,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1975236.0, ans=0.125 2023-06-25 11:21:32,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1975356.0, ans=0.0 2023-06-25 11:22:03,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1975476.0, ans=0.0 2023-06-25 11:22:04,802 INFO [train.py:996] (2/4) Epoch 11, batch 24300, loss[loss=0.1696, simple_loss=0.261, pruned_loss=0.03909, over 21627.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3128, pruned_loss=0.07488, over 4267102.42 frames. ], batch size: 230, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:22:12,744 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.467e+02 7.478e+02 1.137e+03 1.748e+03 3.902e+03, threshold=2.274e+03, percent-clipped=10.0 2023-06-25 11:22:29,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1975536.0, ans=0.1 2023-06-25 11:22:55,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1975596.0, ans=0.07 2023-06-25 11:23:03,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1975596.0, ans=0.125 2023-06-25 11:23:05,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1975596.0, ans=0.0 2023-06-25 11:23:32,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1975716.0, ans=0.125 2023-06-25 11:23:49,703 INFO [train.py:996] (2/4) Epoch 11, batch 24350, loss[loss=0.2576, simple_loss=0.3205, pruned_loss=0.09736, over 21241.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.31, pruned_loss=0.07433, over 4273325.57 frames. ], batch size: 143, lr: 2.62e-03, grad_scale: 16.0 2023-06-25 11:24:25,249 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=15.0 2023-06-25 11:24:26,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=1975836.0, ans=0.07 2023-06-25 11:24:42,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=1975896.0, ans=0.2 2023-06-25 11:25:04,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1975956.0, ans=0.0 2023-06-25 11:25:14,827 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1975956.0, ans=0.125 2023-06-25 11:25:16,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1975956.0, ans=0.1 2023-06-25 11:25:26,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1976016.0, ans=0.1 2023-06-25 11:25:43,128 INFO [train.py:996] (2/4) Epoch 11, batch 24400, loss[loss=0.3374, simple_loss=0.3879, pruned_loss=0.1434, over 21460.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3148, pruned_loss=0.079, over 4280557.25 frames. ], batch size: 509, lr: 2.62e-03, grad_scale: 32.0 2023-06-25 11:26:00,904 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.788e+02 8.688e+02 1.209e+03 1.955e+03 3.228e+03, threshold=2.419e+03, percent-clipped=16.0 2023-06-25 11:26:06,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1976136.0, ans=0.1 2023-06-25 11:26:52,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1976256.0, ans=0.125 2023-06-25 11:27:03,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1976256.0, ans=0.125 2023-06-25 11:27:36,881 INFO [train.py:996] (2/4) Epoch 11, batch 24450, loss[loss=0.281, simple_loss=0.37, pruned_loss=0.09597, over 21682.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3163, pruned_loss=0.0799, over 4282595.00 frames. ], batch size: 298, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:27:44,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=1976376.0, ans=0.2 2023-06-25 11:27:57,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1976436.0, ans=0.0 2023-06-25 11:28:07,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1976436.0, ans=0.1 2023-06-25 11:28:10,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=1976436.0, ans=0.2 2023-06-25 11:28:12,385 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=15.0 2023-06-25 11:28:20,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1976496.0, ans=0.0 2023-06-25 11:28:28,026 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.07 vs. limit=6.0 2023-06-25 11:29:25,317 INFO [train.py:996] (2/4) Epoch 11, batch 24500, loss[loss=0.2277, simple_loss=0.3053, pruned_loss=0.07503, over 21162.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3152, pruned_loss=0.07982, over 4286311.10 frames. ], batch size: 143, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:29:34,925 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.743e+02 7.294e+02 9.026e+02 1.332e+03 4.707e+03, threshold=1.805e+03, percent-clipped=7.0 2023-06-25 11:29:50,775 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.16 vs. limit=22.5 2023-06-25 11:29:58,075 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1976736.0, ans=0.2 2023-06-25 11:31:11,555 INFO [train.py:996] (2/4) Epoch 11, batch 24550, loss[loss=0.3256, simple_loss=0.3922, pruned_loss=0.1295, over 21835.00 frames. ], tot_loss[loss=0.241, simple_loss=0.318, pruned_loss=0.08202, over 4283573.84 frames. ], batch size: 124, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:31:20,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1976976.0, ans=0.1 2023-06-25 11:32:11,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1977156.0, ans=0.125 2023-06-25 11:32:51,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=1977216.0, ans=0.125 2023-06-25 11:33:02,809 INFO [train.py:996] (2/4) Epoch 11, batch 24600, loss[loss=0.1833, simple_loss=0.2535, pruned_loss=0.05658, over 21269.00 frames. ], tot_loss[loss=0.2403, simple_loss=0.3155, pruned_loss=0.08254, over 4280411.46 frames. ], batch size: 176, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:33:13,016 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.578e+02 9.421e+02 1.303e+03 2.147e+03 3.735e+03, threshold=2.606e+03, percent-clipped=31.0 2023-06-25 11:33:28,556 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-25 11:34:51,944 INFO [train.py:996] (2/4) Epoch 11, batch 24650, loss[loss=0.2006, simple_loss=0.2611, pruned_loss=0.0701, over 21095.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3076, pruned_loss=0.08062, over 4281134.24 frames. ], batch size: 143, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:34:55,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1977576.0, ans=0.1 2023-06-25 11:35:00,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1977576.0, ans=0.1 2023-06-25 11:35:47,968 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.04 vs. limit=22.5 2023-06-25 11:36:22,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=1977816.0, ans=0.125 2023-06-25 11:36:36,908 INFO [train.py:996] (2/4) Epoch 11, batch 24700, loss[loss=0.2706, simple_loss=0.3236, pruned_loss=0.1088, over 21437.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3044, pruned_loss=0.07954, over 4267780.50 frames. ], batch size: 509, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:36:37,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1977876.0, ans=0.1 2023-06-25 11:36:45,486 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=1977876.0, ans=0.125 2023-06-25 11:36:46,541 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.759e+02 8.060e+02 1.267e+03 1.761e+03 3.816e+03, threshold=2.533e+03, percent-clipped=4.0 2023-06-25 11:36:52,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1977936.0, ans=0.0 2023-06-25 11:36:55,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1977936.0, ans=0.0 2023-06-25 11:37:03,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=1977936.0, ans=0.0 2023-06-25 11:37:05,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1977936.0, ans=0.125 2023-06-25 11:37:53,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1978056.0, ans=0.1 2023-06-25 11:38:10,541 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=12.0 2023-06-25 11:38:14,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1978116.0, ans=0.0 2023-06-25 11:38:17,804 INFO [train.py:996] (2/4) Epoch 11, batch 24750, loss[loss=0.2047, simple_loss=0.2651, pruned_loss=0.07216, over 21371.00 frames. ], tot_loss[loss=0.2266, simple_loss=0.2985, pruned_loss=0.07733, over 4262573.06 frames. ], batch size: 160, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:38:40,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.max_abs, batch_count=1978236.0, ans=10.0 2023-06-25 11:38:40,933 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1978236.0, ans=0.0 2023-06-25 11:38:44,916 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.49 vs. limit=15.0 2023-06-25 11:38:56,307 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:39:07,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=1978296.0, ans=0.0 2023-06-25 11:39:42,032 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.90 vs. limit=15.0 2023-06-25 11:39:50,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1978416.0, ans=0.1 2023-06-25 11:39:54,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1978416.0, ans=0.125 2023-06-25 11:39:57,334 INFO [train.py:996] (2/4) Epoch 11, batch 24800, loss[loss=0.2136, simple_loss=0.2843, pruned_loss=0.07145, over 21836.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.2936, pruned_loss=0.07681, over 4253908.42 frames. ], batch size: 298, lr: 2.61e-03, grad_scale: 32.0 2023-06-25 11:40:14,471 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.678e+02 6.530e+02 8.946e+02 1.365e+03 3.586e+03, threshold=1.789e+03, percent-clipped=4.0 2023-06-25 11:40:29,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1978536.0, ans=0.125 2023-06-25 11:41:17,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1978656.0, ans=0.04949747468305833 2023-06-25 11:41:48,330 INFO [train.py:996] (2/4) Epoch 11, batch 24850, loss[loss=0.2812, simple_loss=0.3537, pruned_loss=0.1043, over 21557.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.2947, pruned_loss=0.07862, over 4261861.87 frames. ], batch size: 508, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:41:59,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1978776.0, ans=0.125 2023-06-25 11:42:25,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1978836.0, ans=0.125 2023-06-25 11:42:27,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1978896.0, ans=0.125 2023-06-25 11:43:00,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1978956.0, ans=0.125 2023-06-25 11:43:27,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=1979016.0, ans=0.2 2023-06-25 11:43:30,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=1979016.0, ans=0.0 2023-06-25 11:43:34,974 INFO [train.py:996] (2/4) Epoch 11, batch 24900, loss[loss=0.2023, simple_loss=0.2545, pruned_loss=0.07503, over 21343.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.2965, pruned_loss=0.07911, over 4267526.42 frames. ], batch size: 176, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:43:47,681 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.897e+02 9.704e+02 1.419e+03 1.998e+03 4.449e+03, threshold=2.839e+03, percent-clipped=31.0 2023-06-25 11:43:49,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1979136.0, ans=0.125 2023-06-25 11:44:09,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=1979136.0, ans=0.2 2023-06-25 11:44:17,616 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 11:44:52,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1979316.0, ans=0.125 2023-06-25 11:45:15,024 INFO [train.py:996] (2/4) Epoch 11, batch 24950, loss[loss=0.3035, simple_loss=0.366, pruned_loss=0.1205, over 21821.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3061, pruned_loss=0.08368, over 4269963.56 frames. ], batch size: 441, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:45:27,760 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-25 11:47:01,076 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.27 vs. limit=15.0 2023-06-25 11:47:03,412 INFO [train.py:996] (2/4) Epoch 11, batch 25000, loss[loss=0.1998, simple_loss=0.2712, pruned_loss=0.06418, over 21379.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3111, pruned_loss=0.08448, over 4269001.81 frames. ], batch size: 211, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:47:23,244 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.368e+02 7.420e+02 9.724e+02 1.691e+03 3.300e+03, threshold=1.945e+03, percent-clipped=1.0 2023-06-25 11:48:31,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=1979916.0, ans=0.2 2023-06-25 11:48:48,908 INFO [train.py:996] (2/4) Epoch 11, batch 25050, loss[loss=0.2093, simple_loss=0.2664, pruned_loss=0.07613, over 21481.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3059, pruned_loss=0.08324, over 4269094.73 frames. ], batch size: 212, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:49:03,727 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.37 vs. limit=15.0 2023-06-25 11:49:13,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=1980036.0, ans=0.0 2023-06-25 11:49:22,948 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.07 vs. limit=12.0 2023-06-25 11:49:32,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1980036.0, ans=0.1 2023-06-25 11:49:43,072 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.94 vs. limit=15.0 2023-06-25 11:49:47,417 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1980096.0, ans=0.125 2023-06-25 11:49:50,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=1980096.0, ans=0.125 2023-06-25 11:50:30,782 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.57 vs. limit=15.0 2023-06-25 11:50:31,696 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=1980216.0, ans=0.125 2023-06-25 11:50:37,627 INFO [train.py:996] (2/4) Epoch 11, batch 25100, loss[loss=0.2367, simple_loss=0.3257, pruned_loss=0.07385, over 21753.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3018, pruned_loss=0.08177, over 4255291.36 frames. ], batch size: 316, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:50:39,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=1980276.0, ans=0.125 2023-06-25 11:50:50,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=1980276.0, ans=10.0 2023-06-25 11:50:58,387 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.064e+02 8.337e+02 1.105e+03 1.657e+03 3.761e+03, threshold=2.211e+03, percent-clipped=17.0 2023-06-25 11:51:03,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=1980336.0, ans=0.2 2023-06-25 11:51:24,709 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=22.5 2023-06-25 11:51:30,559 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.79 vs. limit=15.0 2023-06-25 11:52:19,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1980516.0, ans=0.0 2023-06-25 11:52:22,289 INFO [train.py:996] (2/4) Epoch 11, batch 25150, loss[loss=0.2272, simple_loss=0.3083, pruned_loss=0.07306, over 21222.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3024, pruned_loss=0.07906, over 4262865.74 frames. ], batch size: 159, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:52:27,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1980576.0, ans=0.125 2023-06-25 11:52:58,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1980636.0, ans=0.125 2023-06-25 11:53:15,353 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.61 vs. limit=10.0 2023-06-25 11:53:31,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1980756.0, ans=0.125 2023-06-25 11:54:01,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1980816.0, ans=0.2 2023-06-25 11:54:08,603 INFO [train.py:996] (2/4) Epoch 11, batch 25200, loss[loss=0.2155, simple_loss=0.315, pruned_loss=0.05802, over 21757.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3012, pruned_loss=0.07719, over 4268577.81 frames. ], batch size: 332, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:54:18,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1980876.0, ans=0.125 2023-06-25 11:54:21,787 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.183e+02 7.466e+02 1.183e+03 1.682e+03 4.504e+03, threshold=2.365e+03, percent-clipped=14.0 2023-06-25 11:54:24,117 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.66 vs. limit=6.0 2023-06-25 11:55:15,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1981056.0, ans=0.125 2023-06-25 11:55:31,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1981056.0, ans=0.1 2023-06-25 11:55:56,152 INFO [train.py:996] (2/4) Epoch 11, batch 25250, loss[loss=0.202, simple_loss=0.272, pruned_loss=0.06595, over 21594.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2986, pruned_loss=0.07585, over 4271554.11 frames. ], batch size: 247, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:55:59,755 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1981176.0, ans=0.125 2023-06-25 11:56:29,165 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.56 vs. limit=10.0 2023-06-25 11:56:33,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1981236.0, ans=0.1 2023-06-25 11:57:40,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=1981416.0, ans=0.0 2023-06-25 11:57:44,423 INFO [train.py:996] (2/4) Epoch 11, batch 25300, loss[loss=0.228, simple_loss=0.3006, pruned_loss=0.07764, over 21506.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2975, pruned_loss=0.07468, over 4262060.35 frames. ], batch size: 194, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 11:57:51,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=1981476.0, ans=0.2 2023-06-25 11:57:57,589 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.052e+02 7.897e+02 1.317e+03 1.738e+03 3.362e+03, threshold=2.634e+03, percent-clipped=11.0 2023-06-25 11:57:59,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1981536.0, ans=0.125 2023-06-25 11:59:32,059 INFO [train.py:996] (2/4) Epoch 11, batch 25350, loss[loss=0.2381, simple_loss=0.3084, pruned_loss=0.0839, over 20005.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2989, pruned_loss=0.07393, over 4255956.20 frames. ], batch size: 703, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 11:59:37,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1981776.0, ans=0.125 2023-06-25 11:59:50,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=1981776.0, ans=0.0 2023-06-25 12:00:28,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1981896.0, ans=0.125 2023-06-25 12:00:51,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=1981956.0, ans=0.0 2023-06-25 12:00:57,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1981956.0, ans=0.125 2023-06-25 12:01:06,502 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.80 vs. limit=6.0 2023-06-25 12:01:17,471 INFO [train.py:996] (2/4) Epoch 11, batch 25400, loss[loss=0.2645, simple_loss=0.3226, pruned_loss=0.1031, over 21362.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2969, pruned_loss=0.07401, over 4260825.34 frames. ], batch size: 471, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 12:01:21,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1982076.0, ans=0.0 2023-06-25 12:01:24,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=1982076.0, ans=0.125 2023-06-25 12:01:27,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1982076.0, ans=0.07 2023-06-25 12:01:37,980 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.367e+02 9.282e+02 1.307e+03 1.888e+03 3.568e+03, threshold=2.613e+03, percent-clipped=8.0 2023-06-25 12:02:36,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1982256.0, ans=0.0 2023-06-25 12:02:47,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=1982316.0, ans=0.0 2023-06-25 12:03:02,662 INFO [train.py:996] (2/4) Epoch 11, batch 25450, loss[loss=0.2487, simple_loss=0.3348, pruned_loss=0.08135, over 21677.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2977, pruned_loss=0.07518, over 4266381.60 frames. ], batch size: 441, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 12:03:08,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1982376.0, ans=0.2 2023-06-25 12:04:12,049 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.91 vs. limit=15.0 2023-06-25 12:04:49,928 INFO [train.py:996] (2/4) Epoch 11, batch 25500, loss[loss=0.2739, simple_loss=0.3534, pruned_loss=0.09714, over 21204.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.298, pruned_loss=0.07227, over 4256663.53 frames. ], batch size: 143, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 12:05:10,398 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.710e+02 7.694e+02 1.169e+03 1.712e+03 3.614e+03, threshold=2.338e+03, percent-clipped=5.0 2023-06-25 12:05:21,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=1982736.0, ans=0.0 2023-06-25 12:05:43,908 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.96 vs. limit=10.0 2023-06-25 12:06:33,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1982976.0, ans=0.125 2023-06-25 12:06:34,912 INFO [train.py:996] (2/4) Epoch 11, batch 25550, loss[loss=0.1963, simple_loss=0.2638, pruned_loss=0.0644, over 16389.00 frames. ], tot_loss[loss=0.2237, simple_loss=0.3036, pruned_loss=0.07191, over 4251565.57 frames. ], batch size: 62, lr: 2.61e-03, grad_scale: 8.0 2023-06-25 12:06:59,895 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:07:01,902 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=15.0 2023-06-25 12:07:09,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=1983036.0, ans=0.125 2023-06-25 12:07:56,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=1983156.0, ans=0.0 2023-06-25 12:08:09,274 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=22.5 2023-06-25 12:08:38,083 INFO [train.py:996] (2/4) Epoch 11, batch 25600, loss[loss=0.213, simple_loss=0.3015, pruned_loss=0.06229, over 19847.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3071, pruned_loss=0.07168, over 4244103.80 frames. ], batch size: 702, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:08:46,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1983276.0, ans=0.125 2023-06-25 12:08:52,919 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.144e+02 7.738e+02 1.030e+03 1.718e+03 3.511e+03, threshold=2.059e+03, percent-clipped=11.0 2023-06-25 12:09:03,195 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.61 vs. limit=15.0 2023-06-25 12:09:53,790 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1983516.0, ans=0.125 2023-06-25 12:10:22,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1983576.0, ans=0.1 2023-06-25 12:10:24,077 INFO [train.py:996] (2/4) Epoch 11, batch 25650, loss[loss=0.2859, simple_loss=0.4215, pruned_loss=0.07518, over 19678.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3087, pruned_loss=0.07428, over 4246390.08 frames. ], batch size: 702, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:10:31,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1983576.0, ans=0.125 2023-06-25 12:11:39,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1983756.0, ans=0.125 2023-06-25 12:11:40,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1983756.0, ans=0.125 2023-06-25 12:11:43,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=1983816.0, ans=0.0 2023-06-25 12:12:11,592 INFO [train.py:996] (2/4) Epoch 11, batch 25700, loss[loss=0.2045, simple_loss=0.273, pruned_loss=0.06803, over 21887.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.305, pruned_loss=0.07575, over 4248953.77 frames. ], batch size: 316, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:12:38,811 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.851e+02 8.245e+02 1.134e+03 1.562e+03 3.915e+03, threshold=2.269e+03, percent-clipped=11.0 2023-06-25 12:12:48,397 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.46 vs. limit=15.0 2023-06-25 12:12:50,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1983936.0, ans=0.125 2023-06-25 12:13:23,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=1984056.0, ans=0.2 2023-06-25 12:13:49,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1984116.0, ans=0.1 2023-06-25 12:14:00,816 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.55 vs. limit=15.0 2023-06-25 12:14:01,417 INFO [train.py:996] (2/4) Epoch 11, batch 25750, loss[loss=0.2501, simple_loss=0.3189, pruned_loss=0.09066, over 20736.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3098, pruned_loss=0.07838, over 4257155.99 frames. ], batch size: 608, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:14:42,523 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1984296.0, ans=0.125 2023-06-25 12:15:56,815 INFO [train.py:996] (2/4) Epoch 11, batch 25800, loss[loss=0.2721, simple_loss=0.377, pruned_loss=0.08358, over 20734.00 frames. ], tot_loss[loss=0.2445, simple_loss=0.3237, pruned_loss=0.08264, over 4256820.85 frames. ], batch size: 607, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:16:12,228 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.210e+02 8.965e+02 1.490e+03 2.590e+03 4.866e+03, threshold=2.981e+03, percent-clipped=29.0 2023-06-25 12:16:17,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1984536.0, ans=0.0 2023-06-25 12:16:51,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1984596.0, ans=0.0 2023-06-25 12:17:30,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=1984716.0, ans=0.0 2023-06-25 12:17:45,334 INFO [train.py:996] (2/4) Epoch 11, batch 25850, loss[loss=0.2268, simple_loss=0.304, pruned_loss=0.07482, over 21781.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3253, pruned_loss=0.08267, over 4255475.51 frames. ], batch size: 247, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:18:00,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1984836.0, ans=0.125 2023-06-25 12:18:56,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1984956.0, ans=0.125 2023-06-25 12:19:33,832 INFO [train.py:996] (2/4) Epoch 11, batch 25900, loss[loss=0.2657, simple_loss=0.3611, pruned_loss=0.08515, over 21844.00 frames. ], tot_loss[loss=0.247, simple_loss=0.3268, pruned_loss=0.08357, over 4264030.68 frames. ], batch size: 316, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:19:41,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1985076.0, ans=0.125 2023-06-25 12:19:54,637 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.078e+02 8.473e+02 1.223e+03 1.634e+03 2.981e+03, threshold=2.447e+03, percent-clipped=0.0 2023-06-25 12:19:55,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1985136.0, ans=0.125 2023-06-25 12:20:58,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=1985256.0, ans=0.125 2023-06-25 12:21:00,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1985256.0, ans=0.1 2023-06-25 12:21:03,824 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.67 vs. limit=12.0 2023-06-25 12:21:21,876 INFO [train.py:996] (2/4) Epoch 11, batch 25950, loss[loss=0.2566, simple_loss=0.3354, pruned_loss=0.08894, over 21791.00 frames. ], tot_loss[loss=0.2514, simple_loss=0.3317, pruned_loss=0.08553, over 4268890.21 frames. ], batch size: 124, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:22:08,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=1985436.0, ans=0.0 2023-06-25 12:23:18,147 INFO [train.py:996] (2/4) Epoch 11, batch 26000, loss[loss=0.2422, simple_loss=0.3226, pruned_loss=0.08092, over 21319.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3305, pruned_loss=0.08416, over 4268863.89 frames. ], batch size: 176, lr: 2.61e-03, grad_scale: 32.0 2023-06-25 12:23:28,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1985676.0, ans=0.125 2023-06-25 12:23:36,643 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1985676.0, ans=0.125 2023-06-25 12:23:40,955 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.169e+02 7.867e+02 1.001e+03 1.506e+03 3.925e+03, threshold=2.003e+03, percent-clipped=6.0 2023-06-25 12:23:53,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=1985736.0, ans=0.2 2023-06-25 12:23:53,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1985736.0, ans=0.125 2023-06-25 12:24:04,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1985736.0, ans=0.1 2023-06-25 12:24:07,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1985796.0, ans=0.125 2023-06-25 12:24:32,364 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.44 vs. limit=15.0 2023-06-25 12:24:46,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=1985916.0, ans=0.5 2023-06-25 12:24:47,820 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1985916.0, ans=0.125 2023-06-25 12:25:03,150 INFO [train.py:996] (2/4) Epoch 11, batch 26050, loss[loss=0.2288, simple_loss=0.2921, pruned_loss=0.08278, over 21276.00 frames. ], tot_loss[loss=0.2486, simple_loss=0.3291, pruned_loss=0.08404, over 4272114.55 frames. ], batch size: 143, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:25:22,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1985976.0, ans=0.0 2023-06-25 12:26:31,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1986216.0, ans=0.0 2023-06-25 12:26:47,666 INFO [train.py:996] (2/4) Epoch 11, batch 26100, loss[loss=0.2201, simple_loss=0.2841, pruned_loss=0.07802, over 21837.00 frames. ], tot_loss[loss=0.2461, simple_loss=0.3242, pruned_loss=0.08401, over 4270696.42 frames. ], batch size: 247, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:27:09,752 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.534e+02 7.535e+02 1.106e+03 1.701e+03 2.759e+03, threshold=2.213e+03, percent-clipped=15.0 2023-06-25 12:27:31,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1986336.0, ans=0.04949747468305833 2023-06-25 12:28:04,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=1986456.0, ans=0.1 2023-06-25 12:28:07,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1986456.0, ans=0.125 2023-06-25 12:28:26,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=1986516.0, ans=0.125 2023-06-25 12:28:38,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1986576.0, ans=0.125 2023-06-25 12:28:39,861 INFO [train.py:996] (2/4) Epoch 11, batch 26150, loss[loss=0.2535, simple_loss=0.3237, pruned_loss=0.09163, over 21794.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3215, pruned_loss=0.08484, over 4278201.62 frames. ], batch size: 441, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:28:43,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=1986576.0, ans=0.025 2023-06-25 12:28:52,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1986576.0, ans=0.125 2023-06-25 12:29:11,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1986636.0, ans=0.125 2023-06-25 12:29:35,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1986696.0, ans=0.1 2023-06-25 12:30:26,202 INFO [train.py:996] (2/4) Epoch 11, batch 26200, loss[loss=0.2311, simple_loss=0.342, pruned_loss=0.06006, over 21599.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3228, pruned_loss=0.08318, over 4282680.65 frames. ], batch size: 389, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:30:47,602 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=1986876.0, ans=0.125 2023-06-25 12:30:53,723 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.107e+02 7.785e+02 1.042e+03 1.454e+03 3.867e+03, threshold=2.084e+03, percent-clipped=10.0 2023-06-25 12:30:57,339 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1986936.0, ans=0.125 2023-06-25 12:31:25,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=1987056.0, ans=0.125 2023-06-25 12:31:26,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1987056.0, ans=0.125 2023-06-25 12:31:37,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1987056.0, ans=0.0 2023-06-25 12:31:49,222 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-25 12:31:59,660 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:32:10,634 INFO [train.py:996] (2/4) Epoch 11, batch 26250, loss[loss=0.235, simple_loss=0.3062, pruned_loss=0.08187, over 21251.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3258, pruned_loss=0.08174, over 4289777.22 frames. ], batch size: 176, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:32:47,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=1987236.0, ans=0.125 2023-06-25 12:32:51,750 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.79 vs. limit=15.0 2023-06-25 12:33:08,827 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=12.0 2023-06-25 12:33:09,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.min_positive, batch_count=1987296.0, ans=0.025 2023-06-25 12:33:16,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=1987356.0, ans=0.125 2023-06-25 12:33:18,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=1987356.0, ans=0.125 2023-06-25 12:34:04,631 INFO [train.py:996] (2/4) Epoch 11, batch 26300, loss[loss=0.2534, simple_loss=0.3261, pruned_loss=0.09032, over 21471.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3226, pruned_loss=0.08332, over 4296542.90 frames. ], batch size: 131, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:34:13,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1987476.0, ans=0.1 2023-06-25 12:34:26,034 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.433e+02 7.825e+02 1.057e+03 1.626e+03 4.026e+03, threshold=2.114e+03, percent-clipped=11.0 2023-06-25 12:34:36,595 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.51 vs. limit=12.0 2023-06-25 12:34:56,822 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.77 vs. limit=15.0 2023-06-25 12:35:49,278 INFO [train.py:996] (2/4) Epoch 11, batch 26350, loss[loss=0.2461, simple_loss=0.3221, pruned_loss=0.08504, over 21741.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3218, pruned_loss=0.0843, over 4296052.69 frames. ], batch size: 332, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:35:51,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1987776.0, ans=0.1 2023-06-25 12:36:17,685 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1987836.0, ans=0.125 2023-06-25 12:37:08,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1988016.0, ans=0.125 2023-06-25 12:37:12,166 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:37:31,594 INFO [train.py:996] (2/4) Epoch 11, batch 26400, loss[loss=0.2149, simple_loss=0.2747, pruned_loss=0.07752, over 21810.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3166, pruned_loss=0.08488, over 4294943.66 frames. ], batch size: 352, lr: 2.61e-03, grad_scale: 32.0 2023-06-25 12:37:38,803 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:37:50,226 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.280e+02 8.065e+02 9.903e+02 1.362e+03 2.931e+03, threshold=1.981e+03, percent-clipped=5.0 2023-06-25 12:38:06,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1988196.0, ans=0.125 2023-06-25 12:38:52,356 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1988256.0, ans=0.0 2023-06-25 12:39:06,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1988316.0, ans=0.125 2023-06-25 12:39:22,502 INFO [train.py:996] (2/4) Epoch 11, batch 26450, loss[loss=0.2564, simple_loss=0.3593, pruned_loss=0.07678, over 21642.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3165, pruned_loss=0.08438, over 4284968.35 frames. ], batch size: 389, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:41:04,373 INFO [train.py:996] (2/4) Epoch 11, batch 26500, loss[loss=0.2766, simple_loss=0.3619, pruned_loss=0.09566, over 21714.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3166, pruned_loss=0.08271, over 4272913.59 frames. ], batch size: 414, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:41:34,314 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.135e+02 9.452e+02 1.417e+03 2.268e+03 5.584e+03, threshold=2.834e+03, percent-clipped=34.0 2023-06-25 12:41:45,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=1988736.0, ans=0.0 2023-06-25 12:41:49,714 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2023-06-25 12:42:19,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=1988856.0, ans=0.0 2023-06-25 12:42:40,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1988916.0, ans=0.125 2023-06-25 12:42:55,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=1988976.0, ans=0.0 2023-06-25 12:43:02,812 INFO [train.py:996] (2/4) Epoch 11, batch 26550, loss[loss=0.2801, simple_loss=0.3672, pruned_loss=0.09657, over 21520.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3148, pruned_loss=0.08005, over 4269350.42 frames. ], batch size: 508, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:43:32,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1989036.0, ans=0.1 2023-06-25 12:43:48,694 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:43:51,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1989096.0, ans=0.125 2023-06-25 12:44:01,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=1989096.0, ans=0.0 2023-06-25 12:44:05,235 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=2.98 vs. limit=15.0 2023-06-25 12:44:17,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=1989156.0, ans=0.0 2023-06-25 12:44:29,085 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.31 vs. limit=22.5 2023-06-25 12:44:54,992 INFO [train.py:996] (2/4) Epoch 11, batch 26600, loss[loss=0.1837, simple_loss=0.2587, pruned_loss=0.05437, over 20747.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3143, pruned_loss=0.07665, over 4263775.29 frames. ], batch size: 608, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:44:55,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1989276.0, ans=0.1 2023-06-25 12:44:55,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1989276.0, ans=0.2 2023-06-25 12:45:18,700 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.580e+02 8.378e+02 1.280e+03 1.887e+03 4.610e+03, threshold=2.560e+03, percent-clipped=7.0 2023-06-25 12:45:20,504 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 12:45:35,794 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.12 vs. limit=6.0 2023-06-25 12:45:50,623 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=22.5 2023-06-25 12:46:19,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1989516.0, ans=0.1 2023-06-25 12:46:36,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1989516.0, ans=0.125 2023-06-25 12:46:41,105 INFO [train.py:996] (2/4) Epoch 11, batch 26650, loss[loss=0.1759, simple_loss=0.2498, pruned_loss=0.05099, over 21542.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3064, pruned_loss=0.075, over 4262229.32 frames. ], batch size: 195, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:47:02,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1989636.0, ans=0.0 2023-06-25 12:47:42,471 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.06 vs. limit=6.0 2023-06-25 12:47:45,146 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1989756.0, ans=0.1 2023-06-25 12:47:59,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=1989816.0, ans=0.2 2023-06-25 12:48:26,207 INFO [train.py:996] (2/4) Epoch 11, batch 26700, loss[loss=0.1978, simple_loss=0.2644, pruned_loss=0.06562, over 21199.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.2987, pruned_loss=0.07181, over 4261852.69 frames. ], batch size: 176, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:48:49,818 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.073e+02 6.649e+02 8.698e+02 1.295e+03 2.536e+03, threshold=1.740e+03, percent-clipped=0.0 2023-06-25 12:49:13,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1989996.0, ans=0.125 2023-06-25 12:50:13,219 INFO [train.py:996] (2/4) Epoch 11, batch 26750, loss[loss=0.2514, simple_loss=0.3325, pruned_loss=0.08509, over 21724.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2996, pruned_loss=0.0707, over 4275711.55 frames. ], batch size: 441, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:51:54,751 INFO [train.py:996] (2/4) Epoch 11, batch 26800, loss[loss=0.2257, simple_loss=0.2989, pruned_loss=0.07621, over 20023.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3078, pruned_loss=0.07527, over 4275800.52 frames. ], batch size: 703, lr: 2.61e-03, grad_scale: 32.0 2023-06-25 12:52:10,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1990536.0, ans=0.125 2023-06-25 12:52:12,578 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1990536.0, ans=0.125 2023-06-25 12:52:15,287 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.285e+02 8.736e+02 1.158e+03 1.774e+03 3.470e+03, threshold=2.315e+03, percent-clipped=25.0 2023-06-25 12:52:29,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1990536.0, ans=0.125 2023-06-25 12:52:36,941 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.29 vs. limit=15.0 2023-06-25 12:53:02,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1990596.0, ans=0.125 2023-06-25 12:53:26,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1990716.0, ans=0.0 2023-06-25 12:53:36,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=1990716.0, ans=0.0 2023-06-25 12:53:42,506 INFO [train.py:996] (2/4) Epoch 11, batch 26850, loss[loss=0.1849, simple_loss=0.2481, pruned_loss=0.06079, over 21693.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3088, pruned_loss=0.07818, over 4282111.52 frames. ], batch size: 282, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:54:01,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1990836.0, ans=0.125 2023-06-25 12:54:04,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=1990836.0, ans=0.0 2023-06-25 12:54:13,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=1990836.0, ans=0.0 2023-06-25 12:54:48,247 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.64 vs. limit=15.0 2023-06-25 12:54:58,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1990956.0, ans=0.2 2023-06-25 12:55:08,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=1990956.0, ans=0.04949747468305833 2023-06-25 12:55:17,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1991016.0, ans=0.125 2023-06-25 12:55:27,795 INFO [train.py:996] (2/4) Epoch 11, batch 26900, loss[loss=0.229, simple_loss=0.2908, pruned_loss=0.08357, over 21913.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3011, pruned_loss=0.07789, over 4279507.22 frames. ], batch size: 125, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:55:47,030 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.238e+02 7.282e+02 8.869e+02 1.344e+03 2.683e+03, threshold=1.774e+03, percent-clipped=1.0 2023-06-25 12:55:48,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=1991136.0, ans=0.07 2023-06-25 12:55:53,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=1991136.0, ans=0.2 2023-06-25 12:56:01,531 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.20 vs. limit=15.0 2023-06-25 12:56:38,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=1991256.0, ans=0.125 2023-06-25 12:57:01,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1991316.0, ans=0.0 2023-06-25 12:57:13,445 INFO [train.py:996] (2/4) Epoch 11, batch 26950, loss[loss=0.236, simple_loss=0.3166, pruned_loss=0.07764, over 21742.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.301, pruned_loss=0.07814, over 4277346.48 frames. ], batch size: 351, lr: 2.61e-03, grad_scale: 16.0 2023-06-25 12:57:22,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1991376.0, ans=0.125 2023-06-25 12:57:23,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=1991376.0, ans=0.125 2023-06-25 12:57:23,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1991376.0, ans=0.125 2023-06-25 12:57:46,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1991436.0, ans=0.125 2023-06-25 12:57:50,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1991496.0, ans=0.125 2023-06-25 12:58:37,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1991556.0, ans=0.125 2023-06-25 12:58:46,706 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.25 vs. limit=15.0 2023-06-25 12:58:51,135 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=1991616.0, ans=0.0 2023-06-25 12:58:59,442 INFO [train.py:996] (2/4) Epoch 11, batch 27000, loss[loss=0.2885, simple_loss=0.3657, pruned_loss=0.1057, over 21453.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3015, pruned_loss=0.07569, over 4277322.47 frames. ], batch size: 508, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 12:58:59,442 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 12:59:16,969 INFO [train.py:1028] (2/4) Epoch 11, validation: loss=0.235, simple_loss=0.334, pruned_loss=0.06803, over 1796401.00 frames. 2023-06-25 12:59:16,970 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-25 12:59:18,344 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.04 vs. limit=15.0 2023-06-25 12:59:39,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1991736.0, ans=0.1 2023-06-25 12:59:55,904 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.981e+02 9.015e+02 1.282e+03 1.827e+03 4.662e+03, threshold=2.565e+03, percent-clipped=27.0 2023-06-25 13:00:38,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=1991856.0, ans=0.0 2023-06-25 13:01:06,433 INFO [train.py:996] (2/4) Epoch 11, batch 27050, loss[loss=0.2633, simple_loss=0.3324, pruned_loss=0.09715, over 21634.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.3046, pruned_loss=0.07285, over 4278931.06 frames. ], batch size: 507, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:01:12,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1991976.0, ans=0.1 2023-06-25 13:01:38,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=1992036.0, ans=0.2 2023-06-25 13:01:41,616 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=1992036.0, ans=0.0 2023-06-25 13:02:48,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=1992216.0, ans=0.125 2023-06-25 13:02:53,332 INFO [train.py:996] (2/4) Epoch 11, batch 27100, loss[loss=0.2192, simple_loss=0.3217, pruned_loss=0.0584, over 21638.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3055, pruned_loss=0.07344, over 4281189.73 frames. ], batch size: 230, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:02:58,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=1992276.0, ans=0.125 2023-06-25 13:03:27,669 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.097e+02 9.568e+02 1.359e+03 2.016e+03 3.804e+03, threshold=2.717e+03, percent-clipped=7.0 2023-06-25 13:04:41,018 INFO [train.py:996] (2/4) Epoch 11, batch 27150, loss[loss=0.2824, simple_loss=0.3908, pruned_loss=0.08697, over 21262.00 frames. ], tot_loss[loss=0.2367, simple_loss=0.3187, pruned_loss=0.07738, over 4283489.65 frames. ], batch size: 548, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:04:43,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=1992576.0, ans=0.0 2023-06-25 13:05:50,401 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.50 vs. limit=15.0 2023-06-25 13:05:58,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1992756.0, ans=0.0 2023-06-25 13:06:08,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=1992816.0, ans=15.0 2023-06-25 13:06:18,846 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1992816.0, ans=0.125 2023-06-25 13:06:27,146 INFO [train.py:996] (2/4) Epoch 11, batch 27200, loss[loss=0.3195, simple_loss=0.3875, pruned_loss=0.1258, over 21590.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3264, pruned_loss=0.08045, over 4283715.81 frames. ], batch size: 414, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:06:48,955 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-25 13:07:01,199 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.231e+02 8.854e+02 1.107e+03 1.912e+03 4.473e+03, threshold=2.214e+03, percent-clipped=8.0 2023-06-25 13:08:18,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=1993116.0, ans=0.125 2023-06-25 13:08:27,014 INFO [train.py:996] (2/4) Epoch 11, batch 27250, loss[loss=0.2535, simple_loss=0.3206, pruned_loss=0.09318, over 20038.00 frames. ], tot_loss[loss=0.2475, simple_loss=0.3281, pruned_loss=0.08349, over 4284158.32 frames. ], batch size: 703, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:08:53,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=1993236.0, ans=0.125 2023-06-25 13:10:18,677 INFO [train.py:996] (2/4) Epoch 11, batch 27300, loss[loss=0.226, simple_loss=0.3152, pruned_loss=0.06842, over 21955.00 frames. ], tot_loss[loss=0.2495, simple_loss=0.3294, pruned_loss=0.08484, over 4278312.19 frames. ], batch size: 317, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:10:19,667 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.81 vs. limit=10.0 2023-06-25 13:10:46,365 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.500e+02 8.064e+02 1.048e+03 1.560e+03 3.072e+03, threshold=2.097e+03, percent-clipped=8.0 2023-06-25 13:12:04,732 INFO [train.py:996] (2/4) Epoch 11, batch 27350, loss[loss=0.2301, simple_loss=0.3112, pruned_loss=0.07447, over 21678.00 frames. ], tot_loss[loss=0.2529, simple_loss=0.3325, pruned_loss=0.08666, over 4271579.28 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:12:09,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=1993776.0, ans=0.125 2023-06-25 13:12:22,133 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:12:28,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1993836.0, ans=0.0 2023-06-25 13:12:35,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=1993836.0, ans=0.0 2023-06-25 13:13:41,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1994016.0, ans=0.125 2023-06-25 13:13:50,508 INFO [train.py:996] (2/4) Epoch 11, batch 27400, loss[loss=0.2129, simple_loss=0.2766, pruned_loss=0.07457, over 21329.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3275, pruned_loss=0.08556, over 4270826.86 frames. ], batch size: 176, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:13:57,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1994076.0, ans=0.1 2023-06-25 13:14:13,211 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:14:17,599 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.960e+02 7.617e+02 1.033e+03 1.386e+03 3.217e+03, threshold=2.066e+03, percent-clipped=9.0 2023-06-25 13:14:18,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=1994136.0, ans=0.5 2023-06-25 13:14:49,067 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1994196.0, ans=0.0 2023-06-25 13:15:05,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1994256.0, ans=0.1 2023-06-25 13:15:07,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1994256.0, ans=0.0 2023-06-25 13:15:14,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=1994256.0, ans=0.0 2023-06-25 13:15:37,971 INFO [train.py:996] (2/4) Epoch 11, batch 27450, loss[loss=0.2275, simple_loss=0.312, pruned_loss=0.0715, over 21736.00 frames. ], tot_loss[loss=0.2427, simple_loss=0.3192, pruned_loss=0.08307, over 4263122.78 frames. ], batch size: 282, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:16:10,439 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-25 13:16:19,152 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.20 vs. limit=15.0 2023-06-25 13:16:20,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=1994496.0, ans=0.2 2023-06-25 13:17:01,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1994556.0, ans=0.125 2023-06-25 13:17:21,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=1994616.0, ans=0.2 2023-06-25 13:17:23,736 INFO [train.py:996] (2/4) Epoch 11, batch 27500, loss[loss=0.2398, simple_loss=0.2985, pruned_loss=0.09053, over 21578.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3173, pruned_loss=0.08283, over 4269602.45 frames. ], batch size: 548, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:17:49,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1994736.0, ans=0.125 2023-06-25 13:17:51,914 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.938e+02 7.259e+02 1.005e+03 1.389e+03 2.816e+03, threshold=2.010e+03, percent-clipped=4.0 2023-06-25 13:18:57,262 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-25 13:19:07,756 INFO [train.py:996] (2/4) Epoch 11, batch 27550, loss[loss=0.3202, simple_loss=0.4334, pruned_loss=0.1035, over 19874.00 frames. ], tot_loss[loss=0.2362, simple_loss=0.3125, pruned_loss=0.07997, over 4268104.42 frames. ], batch size: 702, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:20:01,642 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.15 vs. limit=15.0 2023-06-25 13:20:14,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=1995156.0, ans=0.125 2023-06-25 13:20:54,516 INFO [train.py:996] (2/4) Epoch 11, batch 27600, loss[loss=0.2183, simple_loss=0.2822, pruned_loss=0.07721, over 21803.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3046, pruned_loss=0.07862, over 4274610.73 frames. ], batch size: 102, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:21:17,404 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.292e+02 9.329e+02 1.469e+03 1.993e+03 3.791e+03, threshold=2.938e+03, percent-clipped=25.0 2023-06-25 13:21:41,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1995396.0, ans=0.125 2023-06-25 13:21:46,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=1995396.0, ans=0.125 2023-06-25 13:21:49,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1995396.0, ans=0.1 2023-06-25 13:22:01,184 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.44 vs. limit=10.0 2023-06-25 13:22:27,943 INFO [train.py:996] (2/4) Epoch 11, batch 27650, loss[loss=0.2229, simple_loss=0.2775, pruned_loss=0.08413, over 21220.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3001, pruned_loss=0.07874, over 4258948.23 frames. ], batch size: 176, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:22:47,495 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=15.0 2023-06-25 13:24:19,713 INFO [train.py:996] (2/4) Epoch 11, batch 27700, loss[loss=0.2783, simple_loss=0.3575, pruned_loss=0.09958, over 21788.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.301, pruned_loss=0.07697, over 4266592.14 frames. ], batch size: 316, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:24:43,459 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.775e+02 8.277e+02 1.271e+03 1.738e+03 3.564e+03, threshold=2.542e+03, percent-clipped=2.0 2023-06-25 13:25:33,423 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.30 vs. limit=6.0 2023-06-25 13:26:05,562 INFO [train.py:996] (2/4) Epoch 11, batch 27750, loss[loss=0.2213, simple_loss=0.3088, pruned_loss=0.06687, over 21704.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3059, pruned_loss=0.0771, over 4269340.45 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:27:11,174 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.27 vs. limit=12.0 2023-06-25 13:27:17,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=1996356.0, ans=0.0 2023-06-25 13:27:32,531 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-25 13:27:43,137 INFO [train.py:996] (2/4) Epoch 11, batch 27800, loss[loss=0.257, simple_loss=0.3191, pruned_loss=0.09743, over 21648.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3049, pruned_loss=0.07775, over 4276328.80 frames. ], batch size: 471, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:27:58,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=1996476.0, ans=0.0 2023-06-25 13:28:00,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=1996476.0, ans=0.0 2023-06-25 13:28:10,889 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.247e+02 7.249e+02 9.541e+02 1.506e+03 2.955e+03, threshold=1.908e+03, percent-clipped=10.0 2023-06-25 13:29:13,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=1996716.0, ans=0.2 2023-06-25 13:29:27,196 INFO [train.py:996] (2/4) Epoch 11, batch 27850, loss[loss=0.2344, simple_loss=0.307, pruned_loss=0.08091, over 21829.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3045, pruned_loss=0.07934, over 4280476.45 frames. ], batch size: 118, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:31:08,117 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:31:17,903 INFO [train.py:996] (2/4) Epoch 11, batch 27900, loss[loss=0.2429, simple_loss=0.3391, pruned_loss=0.07335, over 21841.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3142, pruned_loss=0.08088, over 4272210.56 frames. ], batch size: 316, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:31:45,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=1997136.0, ans=0.125 2023-06-25 13:31:53,157 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.447e+02 7.594e+02 1.073e+03 1.549e+03 3.110e+03, threshold=2.145e+03, percent-clipped=9.0 2023-06-25 13:32:27,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1997256.0, ans=0.1 2023-06-25 13:33:12,651 INFO [train.py:996] (2/4) Epoch 11, batch 27950, loss[loss=0.1952, simple_loss=0.2794, pruned_loss=0.05547, over 21419.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3125, pruned_loss=0.07672, over 4277119.79 frames. ], batch size: 194, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:33:28,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=1997376.0, ans=0.125 2023-06-25 13:33:28,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=1997376.0, ans=0.2 2023-06-25 13:33:49,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=1997436.0, ans=0.125 2023-06-25 13:33:51,274 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.83 vs. limit=15.0 2023-06-25 13:34:39,679 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1997616.0, ans=0.1 2023-06-25 13:34:57,424 INFO [train.py:996] (2/4) Epoch 11, batch 28000, loss[loss=0.2527, simple_loss=0.3226, pruned_loss=0.09142, over 21875.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3108, pruned_loss=0.07553, over 4280871.00 frames. ], batch size: 351, lr: 2.60e-03, grad_scale: 32.0 2023-06-25 13:35:25,418 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.615e+02 8.690e+02 1.335e+03 1.864e+03 4.176e+03, threshold=2.670e+03, percent-clipped=16.0 2023-06-25 13:35:36,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1997796.0, ans=0.125 2023-06-25 13:35:38,738 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-25 13:36:06,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=1997856.0, ans=0.125 2023-06-25 13:36:07,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=1997856.0, ans=0.05 2023-06-25 13:36:49,650 INFO [train.py:996] (2/4) Epoch 11, batch 28050, loss[loss=0.2398, simple_loss=0.3176, pruned_loss=0.08101, over 21839.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3082, pruned_loss=0.07679, over 4283280.72 frames. ], batch size: 371, lr: 2.60e-03, grad_scale: 32.0 2023-06-25 13:37:42,322 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=1998096.0, ans=0.125 2023-06-25 13:37:50,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=1998156.0, ans=0.125 2023-06-25 13:38:00,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=1998156.0, ans=0.1 2023-06-25 13:38:33,483 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=8.51 vs. limit=12.0 2023-06-25 13:38:37,690 INFO [train.py:996] (2/4) Epoch 11, batch 28100, loss[loss=0.2172, simple_loss=0.2813, pruned_loss=0.07654, over 21502.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3084, pruned_loss=0.07648, over 4284652.16 frames. ], batch size: 441, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:38:39,709 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=1998276.0, ans=0.0 2023-06-25 13:39:01,514 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.345e+02 8.283e+02 1.257e+03 1.912e+03 3.792e+03, threshold=2.513e+03, percent-clipped=5.0 2023-06-25 13:39:33,321 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:39:45,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=1998456.0, ans=0.0 2023-06-25 13:39:47,460 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=1998456.0, ans=0.0 2023-06-25 13:40:22,781 INFO [train.py:996] (2/4) Epoch 11, batch 28150, loss[loss=0.2362, simple_loss=0.2824, pruned_loss=0.09495, over 21296.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3021, pruned_loss=0.07653, over 4286756.52 frames. ], batch size: 144, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:40:53,298 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=1998636.0, ans=0.125 2023-06-25 13:41:25,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1998756.0, ans=0.0 2023-06-25 13:42:06,619 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.40 vs. limit=15.0 2023-06-25 13:42:07,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=1998816.0, ans=0.0 2023-06-25 13:42:10,813 INFO [train.py:996] (2/4) Epoch 11, batch 28200, loss[loss=0.2279, simple_loss=0.3019, pruned_loss=0.07694, over 21694.00 frames. ], tot_loss[loss=0.2269, simple_loss=0.2996, pruned_loss=0.07716, over 4274265.89 frames. ], batch size: 332, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:42:25,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=1998876.0, ans=0.2 2023-06-25 13:42:42,176 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.056e+02 7.864e+02 1.049e+03 1.647e+03 3.891e+03, threshold=2.099e+03, percent-clipped=11.0 2023-06-25 13:43:04,292 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.08 vs. limit=12.0 2023-06-25 13:43:57,335 INFO [train.py:996] (2/4) Epoch 11, batch 28250, loss[loss=0.244, simple_loss=0.3049, pruned_loss=0.09156, over 21750.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3023, pruned_loss=0.07989, over 4276429.07 frames. ], batch size: 351, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:45:34,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=1999416.0, ans=0.09899494936611666 2023-06-25 13:45:39,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=1999416.0, ans=0.1 2023-06-25 13:45:45,815 INFO [train.py:996] (2/4) Epoch 11, batch 28300, loss[loss=0.1759, simple_loss=0.2677, pruned_loss=0.04205, over 21638.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3015, pruned_loss=0.0781, over 4252061.39 frames. ], batch size: 263, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:45:54,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=1999476.0, ans=0.125 2023-06-25 13:46:01,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=1999476.0, ans=0.0 2023-06-25 13:46:15,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=1999536.0, ans=0.0 2023-06-25 13:46:24,389 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.207e+02 7.571e+02 1.027e+03 1.599e+03 2.949e+03, threshold=2.054e+03, percent-clipped=6.0 2023-06-25 13:47:10,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=1999656.0, ans=0.125 2023-06-25 13:47:17,932 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:47:38,011 INFO [train.py:996] (2/4) Epoch 11, batch 28350, loss[loss=0.1843, simple_loss=0.3028, pruned_loss=0.03288, over 20900.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2973, pruned_loss=0.07215, over 4251850.81 frames. ], batch size: 608, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:47:50,375 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=1999776.0, ans=0.125 2023-06-25 13:49:24,421 INFO [train.py:996] (2/4) Epoch 11, batch 28400, loss[loss=0.2515, simple_loss=0.3138, pruned_loss=0.09464, over 21704.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2928, pruned_loss=0.0724, over 4259364.39 frames. ], batch size: 332, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:49:26,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2000076.0, ans=0.125 2023-06-25 13:49:57,006 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.013e+02 9.043e+02 1.474e+03 1.977e+03 3.910e+03, threshold=2.949e+03, percent-clipped=21.0 2023-06-25 13:51:09,801 INFO [train.py:996] (2/4) Epoch 11, batch 28450, loss[loss=0.272, simple_loss=0.3422, pruned_loss=0.1009, over 21880.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2987, pruned_loss=0.07587, over 4261244.93 frames. ], batch size: 118, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 13:51:45,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=2000436.0, ans=15.0 2023-06-25 13:52:01,181 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 13:52:43,715 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.26 vs. limit=10.0 2023-06-25 13:52:46,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2000616.0, ans=0.0 2023-06-25 13:53:03,108 INFO [train.py:996] (2/4) Epoch 11, batch 28500, loss[loss=0.2628, simple_loss=0.3341, pruned_loss=0.09575, over 21184.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3014, pruned_loss=0.07829, over 4266198.47 frames. ], batch size: 143, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:53:38,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2000736.0, ans=0.0 2023-06-25 13:53:46,216 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.310e+02 7.652e+02 9.900e+02 1.430e+03 3.378e+03, threshold=1.980e+03, percent-clipped=2.0 2023-06-25 13:53:51,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2000796.0, ans=0.0 2023-06-25 13:54:03,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2000796.0, ans=0.125 2023-06-25 13:54:04,266 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-25 13:54:27,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2000916.0, ans=0.0 2023-06-25 13:54:51,316 INFO [train.py:996] (2/4) Epoch 11, batch 28550, loss[loss=0.2692, simple_loss=0.3594, pruned_loss=0.08947, over 21269.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3089, pruned_loss=0.08084, over 4272641.07 frames. ], batch size: 176, lr: 2.60e-03, grad_scale: 4.0 2023-06-25 13:55:00,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2000976.0, ans=0.09899494936611666 2023-06-25 13:56:02,927 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.99 vs. limit=10.0 2023-06-25 13:56:44,018 INFO [train.py:996] (2/4) Epoch 11, batch 28600, loss[loss=0.2721, simple_loss=0.3433, pruned_loss=0.1005, over 21564.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3163, pruned_loss=0.08341, over 4274551.98 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 13:57:18,465 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.357e+02 7.619e+02 9.941e+02 1.518e+03 3.528e+03, threshold=1.988e+03, percent-clipped=12.0 2023-06-25 13:57:29,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2001396.0, ans=0.125 2023-06-25 13:57:58,420 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.30 vs. limit=15.0 2023-06-25 13:58:28,360 INFO [train.py:996] (2/4) Epoch 11, batch 28650, loss[loss=0.1948, simple_loss=0.2565, pruned_loss=0.06657, over 21516.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3104, pruned_loss=0.08214, over 4277734.73 frames. ], batch size: 213, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 14:00:16,258 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.81 vs. limit=15.0 2023-06-25 14:00:19,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2001876.0, ans=0.0 2023-06-25 14:00:20,295 INFO [train.py:996] (2/4) Epoch 11, batch 28700, loss[loss=0.2844, simple_loss=0.3509, pruned_loss=0.1089, over 21462.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3094, pruned_loss=0.08305, over 4274785.42 frames. ], batch size: 471, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 14:00:55,769 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.441e+02 7.291e+02 9.817e+02 1.860e+03 4.444e+03, threshold=1.963e+03, percent-clipped=16.0 2023-06-25 14:01:16,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2001996.0, ans=0.1 2023-06-25 14:01:55,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2002116.0, ans=0.125 2023-06-25 14:02:00,915 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=22.5 2023-06-25 14:02:02,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2002176.0, ans=0.125 2023-06-25 14:02:03,139 INFO [train.py:996] (2/4) Epoch 11, batch 28750, loss[loss=0.2066, simple_loss=0.3038, pruned_loss=0.05468, over 21727.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.309, pruned_loss=0.08294, over 4281012.84 frames. ], batch size: 389, lr: 2.60e-03, grad_scale: 8.0 2023-06-25 14:02:27,694 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2002236.0, ans=0.125 2023-06-25 14:03:08,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2002356.0, ans=0.0 2023-06-25 14:03:29,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer_na.min_abs, batch_count=2002356.0, ans=0.02 2023-06-25 14:03:48,787 INFO [train.py:996] (2/4) Epoch 11, batch 28800, loss[loss=0.3092, simple_loss=0.3725, pruned_loss=0.123, over 21306.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3121, pruned_loss=0.08328, over 4276796.47 frames. ], batch size: 159, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:03:57,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2002476.0, ans=0.0 2023-06-25 14:04:06,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2002476.0, ans=0.0 2023-06-25 14:04:29,429 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.183e+02 7.612e+02 1.083e+03 1.492e+03 3.378e+03, threshold=2.166e+03, percent-clipped=11.0 2023-06-25 14:04:29,894 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2002536.0, ans=0.0 2023-06-25 14:04:46,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2002596.0, ans=0.125 2023-06-25 14:04:48,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2002656.0, ans=0.2 2023-06-25 14:05:08,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2002656.0, ans=0.0 2023-06-25 14:05:16,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=2002716.0, ans=0.025 2023-06-25 14:05:29,399 INFO [train.py:996] (2/4) Epoch 11, batch 28850, loss[loss=0.2185, simple_loss=0.2687, pruned_loss=0.08418, over 20267.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3133, pruned_loss=0.08437, over 4278917.89 frames. ], batch size: 702, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:05:47,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2002776.0, ans=0.125 2023-06-25 14:05:54,462 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=22.5 2023-06-25 14:06:18,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2002896.0, ans=0.0 2023-06-25 14:06:47,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2002956.0, ans=0.125 2023-06-25 14:06:52,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2002956.0, ans=0.125 2023-06-25 14:06:52,733 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2002956.0, ans=0.125 2023-06-25 14:06:54,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2002956.0, ans=0.035 2023-06-25 14:07:12,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2003016.0, ans=0.0 2023-06-25 14:07:22,625 INFO [train.py:996] (2/4) Epoch 11, batch 28900, loss[loss=0.3415, simple_loss=0.3982, pruned_loss=0.1424, over 21575.00 frames. ], tot_loss[loss=0.2433, simple_loss=0.3155, pruned_loss=0.08559, over 4281098.39 frames. ], batch size: 508, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:07:47,574 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.11 vs. limit=15.0 2023-06-25 14:08:00,100 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.503e+02 7.472e+02 9.917e+02 1.436e+03 2.913e+03, threshold=1.983e+03, percent-clipped=5.0 2023-06-25 14:08:39,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2003256.0, ans=0.125 2023-06-25 14:09:17,152 INFO [train.py:996] (2/4) Epoch 11, batch 28950, loss[loss=0.2474, simple_loss=0.3292, pruned_loss=0.08279, over 21844.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3188, pruned_loss=0.08563, over 4283296.03 frames. ], batch size: 371, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:09:19,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2003376.0, ans=0.125 2023-06-25 14:09:50,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2003436.0, ans=0.2 2023-06-25 14:10:19,881 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-25 14:10:20,924 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2003556.0, ans=0.125 2023-06-25 14:10:46,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2003616.0, ans=0.035 2023-06-25 14:11:05,996 INFO [train.py:996] (2/4) Epoch 11, batch 29000, loss[loss=0.2987, simple_loss=0.3562, pruned_loss=0.1206, over 21334.00 frames. ], tot_loss[loss=0.245, simple_loss=0.322, pruned_loss=0.08404, over 4280585.37 frames. ], batch size: 507, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:11:08,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2003676.0, ans=0.2 2023-06-25 14:11:13,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2003676.0, ans=0.1 2023-06-25 14:11:20,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2003676.0, ans=0.0 2023-06-25 14:11:42,998 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=15.0 2023-06-25 14:11:43,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2003736.0, ans=0.05 2023-06-25 14:11:48,360 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.370e+02 8.594e+02 1.350e+03 2.116e+03 4.440e+03, threshold=2.700e+03, percent-clipped=27.0 2023-06-25 14:11:53,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2003796.0, ans=0.2 2023-06-25 14:11:57,660 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-25 14:12:52,684 INFO [train.py:996] (2/4) Epoch 11, batch 29050, loss[loss=0.2535, simple_loss=0.3169, pruned_loss=0.09512, over 21871.00 frames. ], tot_loss[loss=0.2458, simple_loss=0.3207, pruned_loss=0.08544, over 4278062.96 frames. ], batch size: 414, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:13:01,770 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.50 vs. limit=15.0 2023-06-25 14:13:28,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2004036.0, ans=0.04949747468305833 2023-06-25 14:14:04,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2004156.0, ans=0.125 2023-06-25 14:14:21,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2004216.0, ans=0.0 2023-06-25 14:14:25,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2004216.0, ans=0.125 2023-06-25 14:14:37,977 INFO [train.py:996] (2/4) Epoch 11, batch 29100, loss[loss=0.1846, simple_loss=0.2475, pruned_loss=0.06086, over 21563.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3113, pruned_loss=0.08294, over 4280578.06 frames. ], batch size: 231, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:14:50,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2004276.0, ans=0.0 2023-06-25 14:15:19,683 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.995e+02 7.507e+02 9.912e+02 1.574e+03 3.418e+03, threshold=1.982e+03, percent-clipped=5.0 2023-06-25 14:15:55,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2004456.0, ans=0.1 2023-06-25 14:16:09,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2004516.0, ans=0.125 2023-06-25 14:16:24,455 INFO [train.py:996] (2/4) Epoch 11, batch 29150, loss[loss=0.2315, simple_loss=0.3112, pruned_loss=0.07587, over 21522.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.31, pruned_loss=0.08165, over 4284201.42 frames. ], batch size: 230, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:17:00,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2004636.0, ans=0.0 2023-06-25 14:17:25,717 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.38 vs. limit=15.0 2023-06-25 14:17:53,637 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.48 vs. limit=15.0 2023-06-25 14:18:00,197 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.70 vs. limit=6.0 2023-06-25 14:18:02,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2004816.0, ans=0.125 2023-06-25 14:18:08,685 INFO [train.py:996] (2/4) Epoch 11, batch 29200, loss[loss=0.2814, simple_loss=0.3412, pruned_loss=0.1108, over 21397.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3057, pruned_loss=0.08108, over 4269960.06 frames. ], batch size: 508, lr: 2.60e-03, grad_scale: 32.0 2023-06-25 14:18:31,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2004876.0, ans=0.125 2023-06-25 14:18:49,090 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.775e+02 8.207e+02 1.113e+03 1.658e+03 3.096e+03, threshold=2.226e+03, percent-clipped=9.0 2023-06-25 14:20:00,622 INFO [train.py:996] (2/4) Epoch 11, batch 29250, loss[loss=0.2436, simple_loss=0.3335, pruned_loss=0.07681, over 21803.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3042, pruned_loss=0.07882, over 4274276.78 frames. ], batch size: 333, lr: 2.60e-03, grad_scale: 32.0 2023-06-25 14:20:58,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2005296.0, ans=0.125 2023-06-25 14:21:14,722 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.75 vs. limit=15.0 2023-06-25 14:21:19,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2005356.0, ans=0.0 2023-06-25 14:21:26,074 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=2005416.0, ans=10.0 2023-06-25 14:21:37,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2005416.0, ans=0.125 2023-06-25 14:21:47,238 INFO [train.py:996] (2/4) Epoch 11, batch 29300, loss[loss=0.1888, simple_loss=0.2604, pruned_loss=0.05862, over 21586.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3058, pruned_loss=0.07768, over 4279876.27 frames. ], batch size: 298, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:22:25,272 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.878e+02 9.346e+02 1.272e+03 1.765e+03 3.710e+03, threshold=2.544e+03, percent-clipped=11.0 2023-06-25 14:22:25,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2005596.0, ans=0.0 2023-06-25 14:22:43,259 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.07 vs. limit=15.0 2023-06-25 14:22:59,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2005656.0, ans=0.125 2023-06-25 14:23:38,524 INFO [train.py:996] (2/4) Epoch 11, batch 29350, loss[loss=0.2109, simple_loss=0.2968, pruned_loss=0.06255, over 21510.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3018, pruned_loss=0.07746, over 4282791.72 frames. ], batch size: 195, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:23:57,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2005836.0, ans=0.125 2023-06-25 14:24:16,457 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:24:16,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2005896.0, ans=0.04949747468305833 2023-06-25 14:24:21,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2005896.0, ans=0.125 2023-06-25 14:25:00,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2005956.0, ans=0.125 2023-06-25 14:25:00,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2005956.0, ans=0.125 2023-06-25 14:25:26,787 INFO [train.py:996] (2/4) Epoch 11, batch 29400, loss[loss=0.1974, simple_loss=0.3032, pruned_loss=0.04574, over 20803.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.302, pruned_loss=0.07529, over 4263658.58 frames. ], batch size: 608, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:26:03,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2006136.0, ans=0.0 2023-06-25 14:26:04,461 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.130e+02 8.606e+02 1.280e+03 1.886e+03 3.409e+03, threshold=2.560e+03, percent-clipped=11.0 2023-06-25 14:26:26,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2006196.0, ans=0.125 2023-06-25 14:27:14,994 INFO [train.py:996] (2/4) Epoch 11, batch 29450, loss[loss=0.2572, simple_loss=0.3274, pruned_loss=0.09356, over 21269.00 frames. ], tot_loss[loss=0.224, simple_loss=0.2997, pruned_loss=0.07415, over 4261738.56 frames. ], batch size: 159, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:28:27,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2006556.0, ans=0.1 2023-06-25 14:28:29,840 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.95 vs. limit=12.0 2023-06-25 14:28:54,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2006616.0, ans=0.125 2023-06-25 14:29:00,480 INFO [train.py:996] (2/4) Epoch 11, batch 29500, loss[loss=0.2607, simple_loss=0.3299, pruned_loss=0.09577, over 21847.00 frames. ], tot_loss[loss=0.2302, simple_loss=0.3048, pruned_loss=0.07783, over 4268265.41 frames. ], batch size: 414, lr: 2.60e-03, grad_scale: 16.0 2023-06-25 14:29:00,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2006676.0, ans=0.1 2023-06-25 14:29:11,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2006676.0, ans=0.035 2023-06-25 14:29:23,487 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-25 14:29:32,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2006736.0, ans=0.125 2023-06-25 14:29:44,968 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.231e+02 8.728e+02 1.187e+03 1.757e+03 3.879e+03, threshold=2.373e+03, percent-clipped=3.0 2023-06-25 14:30:01,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2006856.0, ans=0.125 2023-06-25 14:30:48,722 INFO [train.py:996] (2/4) Epoch 11, batch 29550, loss[loss=0.2408, simple_loss=0.3053, pruned_loss=0.08815, over 21342.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3046, pruned_loss=0.07929, over 4280821.72 frames. ], batch size: 176, lr: 2.59e-03, grad_scale: 16.0 2023-06-25 14:30:49,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2006976.0, ans=0.0 2023-06-25 14:30:50,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2006976.0, ans=0.0 2023-06-25 14:31:29,793 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:31:40,449 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.71 vs. limit=15.0 2023-06-25 14:32:42,226 INFO [train.py:996] (2/4) Epoch 11, batch 29600, loss[loss=0.3272, simple_loss=0.4184, pruned_loss=0.1181, over 21226.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3111, pruned_loss=0.08153, over 4284218.60 frames. ], batch size: 548, lr: 2.59e-03, grad_scale: 32.0 2023-06-25 14:32:51,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2007276.0, ans=0.04949747468305833 2023-06-25 14:33:22,205 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.467e+02 8.361e+02 1.294e+03 2.319e+03 6.850e+03, threshold=2.587e+03, percent-clipped=23.0 2023-06-25 14:34:03,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2007516.0, ans=0.125 2023-06-25 14:34:08,930 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:34:27,152 INFO [train.py:996] (2/4) Epoch 11, batch 29650, loss[loss=0.1856, simple_loss=0.2625, pruned_loss=0.05433, over 21764.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3073, pruned_loss=0.07788, over 4276315.17 frames. ], batch size: 247, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:35:11,922 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.37 vs. limit=15.0 2023-06-25 14:36:16,266 INFO [train.py:996] (2/4) Epoch 11, batch 29700, loss[loss=0.2137, simple_loss=0.3249, pruned_loss=0.05128, over 19861.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3094, pruned_loss=0.07758, over 4281204.34 frames. ], batch size: 702, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:36:20,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2007876.0, ans=0.1 2023-06-25 14:36:34,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.min_positive, batch_count=2007876.0, ans=0.025 2023-06-25 14:36:34,230 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2007876.0, ans=0.0 2023-06-25 14:36:56,699 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.35 vs. limit=22.5 2023-06-25 14:37:02,465 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.011e+02 9.154e+02 1.304e+03 2.529e+03 6.535e+03, threshold=2.607e+03, percent-clipped=22.0 2023-06-25 14:37:21,753 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:37:27,108 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2008056.0, ans=0.125 2023-06-25 14:37:28,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2008056.0, ans=0.0 2023-06-25 14:37:40,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2008116.0, ans=0.0 2023-06-25 14:37:50,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2008116.0, ans=0.125 2023-06-25 14:37:53,082 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.40 vs. limit=10.0 2023-06-25 14:38:01,611 INFO [train.py:996] (2/4) Epoch 11, batch 29750, loss[loss=0.2803, simple_loss=0.3614, pruned_loss=0.09957, over 21576.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3138, pruned_loss=0.07737, over 4283733.23 frames. ], batch size: 507, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:38:33,319 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.03 vs. limit=22.5 2023-06-25 14:39:44,287 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 14:39:45,386 INFO [train.py:996] (2/4) Epoch 11, batch 29800, loss[loss=0.2451, simple_loss=0.3118, pruned_loss=0.08918, over 21391.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3153, pruned_loss=0.07893, over 4286936.06 frames. ], batch size: 159, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:40:28,164 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2008596.0, ans=0.0 2023-06-25 14:40:30,959 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.324e+02 8.347e+02 1.266e+03 1.868e+03 3.431e+03, threshold=2.532e+03, percent-clipped=7.0 2023-06-25 14:40:42,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2008596.0, ans=0.125 2023-06-25 14:40:51,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2008656.0, ans=0.1 2023-06-25 14:40:56,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2008656.0, ans=0.07 2023-06-25 14:41:30,266 INFO [train.py:996] (2/4) Epoch 11, batch 29850, loss[loss=0.2106, simple_loss=0.2867, pruned_loss=0.06725, over 21869.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3111, pruned_loss=0.0765, over 4284089.96 frames. ], batch size: 371, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:42:12,642 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=22.5 2023-06-25 14:42:14,602 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=13.79 vs. limit=15.0 2023-06-25 14:42:46,463 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.77 vs. limit=15.0 2023-06-25 14:43:16,104 INFO [train.py:996] (2/4) Epoch 11, batch 29900, loss[loss=0.2482, simple_loss=0.3203, pruned_loss=0.08805, over 21326.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3101, pruned_loss=0.07778, over 4289705.68 frames. ], batch size: 143, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:43:51,946 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.00 vs. limit=15.0 2023-06-25 14:44:02,445 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.086e+02 7.684e+02 1.155e+03 1.725e+03 4.466e+03, threshold=2.311e+03, percent-clipped=10.0 2023-06-25 14:44:41,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2009256.0, ans=0.125 2023-06-25 14:44:54,817 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.88 vs. limit=6.0 2023-06-25 14:45:08,281 INFO [train.py:996] (2/4) Epoch 11, batch 29950, loss[loss=0.2459, simple_loss=0.327, pruned_loss=0.08238, over 21252.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3145, pruned_loss=0.08168, over 4286315.49 frames. ], batch size: 143, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:45:10,343 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2009376.0, ans=0.125 2023-06-25 14:46:55,261 INFO [train.py:996] (2/4) Epoch 11, batch 30000, loss[loss=0.1964, simple_loss=0.2885, pruned_loss=0.05215, over 21734.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3168, pruned_loss=0.08151, over 4283429.65 frames. ], batch size: 247, lr: 2.59e-03, grad_scale: 16.0 2023-06-25 14:46:55,262 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 14:47:14,794 INFO [train.py:1028] (2/4) Epoch 11, validation: loss=0.2475, simple_loss=0.3451, pruned_loss=0.07497, over 1796401.00 frames. 2023-06-25 14:47:14,795 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-25 14:47:29,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2009676.0, ans=0.1 2023-06-25 14:47:33,492 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.17 vs. limit=15.0 2023-06-25 14:48:03,290 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.762e+02 8.420e+02 1.340e+03 1.867e+03 3.638e+03, threshold=2.681e+03, percent-clipped=9.0 2023-06-25 14:48:27,653 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2009856.0, ans=0.125 2023-06-25 14:48:34,884 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2009856.0, ans=0.0 2023-06-25 14:48:59,913 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.61 vs. limit=6.0 2023-06-25 14:49:02,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2009916.0, ans=0.125 2023-06-25 14:49:16,215 INFO [train.py:996] (2/4) Epoch 11, batch 30050, loss[loss=0.2683, simple_loss=0.3692, pruned_loss=0.08369, over 21750.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3183, pruned_loss=0.07839, over 4273127.82 frames. ], batch size: 332, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:49:59,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2010096.0, ans=0.07 2023-06-25 14:50:43,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2010216.0, ans=0.125 2023-06-25 14:50:45,710 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.71 vs. limit=10.0 2023-06-25 14:51:01,283 INFO [train.py:996] (2/4) Epoch 11, batch 30100, loss[loss=0.2253, simple_loss=0.2772, pruned_loss=0.08665, over 21224.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3161, pruned_loss=0.0779, over 4257862.77 frames. ], batch size: 176, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:51:44,621 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.182e+02 9.788e+02 1.555e+03 2.396e+03 5.388e+03, threshold=3.111e+03, percent-clipped=17.0 2023-06-25 14:52:41,401 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.14 vs. limit=12.0 2023-06-25 14:52:42,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2010516.0, ans=0.125 2023-06-25 14:52:49,214 INFO [train.py:996] (2/4) Epoch 11, batch 30150, loss[loss=0.2775, simple_loss=0.3516, pruned_loss=0.1017, over 21827.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3134, pruned_loss=0.08005, over 4246996.80 frames. ], batch size: 124, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:52:49,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2010576.0, ans=0.125 2023-06-25 14:52:51,166 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2010576.0, ans=0.125 2023-06-25 14:52:57,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2010576.0, ans=0.0 2023-06-25 14:53:11,768 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.42 vs. limit=15.0 2023-06-25 14:53:38,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2010696.0, ans=0.07 2023-06-25 14:54:12,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2010756.0, ans=0.125 2023-06-25 14:54:32,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2010816.0, ans=0.05 2023-06-25 14:54:43,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2010816.0, ans=0.2 2023-06-25 14:54:46,713 INFO [train.py:996] (2/4) Epoch 11, batch 30200, loss[loss=0.236, simple_loss=0.3097, pruned_loss=0.0811, over 21762.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3156, pruned_loss=0.07869, over 4257042.53 frames. ], batch size: 124, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:55:34,488 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.364e+02 8.137e+02 1.157e+03 1.769e+03 3.974e+03, threshold=2.314e+03, percent-clipped=2.0 2023-06-25 14:55:43,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2010996.0, ans=0.125 2023-06-25 14:55:58,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2011056.0, ans=0.125 2023-06-25 14:55:58,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2011056.0, ans=0.125 2023-06-25 14:56:34,976 INFO [train.py:996] (2/4) Epoch 11, batch 30250, loss[loss=0.2491, simple_loss=0.3568, pruned_loss=0.07067, over 21765.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3238, pruned_loss=0.08128, over 4264953.21 frames. ], batch size: 282, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:56:35,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2011176.0, ans=0.0 2023-06-25 14:57:34,355 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2011296.0, ans=0.125 2023-06-25 14:58:11,521 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.62 vs. limit=8.0 2023-06-25 14:58:20,274 INFO [train.py:996] (2/4) Epoch 11, batch 30300, loss[loss=0.1908, simple_loss=0.2575, pruned_loss=0.06205, over 21520.00 frames. ], tot_loss[loss=0.2431, simple_loss=0.3223, pruned_loss=0.08198, over 4264616.65 frames. ], batch size: 231, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 14:58:29,330 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2011476.0, ans=0.125 2023-06-25 14:58:32,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2011476.0, ans=0.2 2023-06-25 14:58:49,174 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=6.34 vs. limit=12.0 2023-06-25 14:59:02,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2011536.0, ans=0.0 2023-06-25 14:59:14,778 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.082e+02 1.032e+03 1.380e+03 1.875e+03 4.556e+03, threshold=2.761e+03, percent-clipped=17.0 2023-06-25 14:59:42,581 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:00:21,504 INFO [train.py:996] (2/4) Epoch 11, batch 30350, loss[loss=0.2163, simple_loss=0.2739, pruned_loss=0.07933, over 20217.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3237, pruned_loss=0.08333, over 4262525.26 frames. ], batch size: 703, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 15:01:36,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2012016.0, ans=0.1 2023-06-25 15:01:40,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2012016.0, ans=0.2 2023-06-25 15:01:43,049 INFO [train.py:996] (2/4) Epoch 11, batch 30400, loss[loss=0.1966, simple_loss=0.2555, pruned_loss=0.06883, over 20311.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3184, pruned_loss=0.08173, over 4253267.74 frames. ], batch size: 703, lr: 2.59e-03, grad_scale: 16.0 2023-06-25 15:02:04,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2012136.0, ans=0.0 2023-06-25 15:02:09,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2012136.0, ans=0.125 2023-06-25 15:02:24,036 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.345e+02 1.116e+03 1.633e+03 2.614e+03 1.022e+04, threshold=3.266e+03, percent-clipped=19.0 2023-06-25 15:03:11,919 INFO [train.py:996] (2/4) Epoch 11, batch 30450, loss[loss=0.3021, simple_loss=0.4136, pruned_loss=0.0953, over 19943.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3185, pruned_loss=0.08041, over 4196024.45 frames. ], batch size: 702, lr: 2.59e-03, grad_scale: 8.0 2023-06-25 15:03:28,770 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2012436.0, ans=0.2 2023-06-25 15:03:37,321 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:04:03,457 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2012556.0, ans=0.125 2023-06-25 15:06:14,777 INFO [train.py:996] (2/4) Epoch 12, batch 0, loss[loss=0.2193, simple_loss=0.2817, pruned_loss=0.07847, over 21395.00 frames. ], tot_loss[loss=0.2193, simple_loss=0.2817, pruned_loss=0.07847, over 21395.00 frames. ], batch size: 212, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:06:14,778 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 15:06:38,475 INFO [train.py:1028] (2/4) Epoch 12, validation: loss=0.246, simple_loss=0.3509, pruned_loss=0.07057, over 1796401.00 frames. 2023-06-25 15:06:38,476 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-25 15:06:42,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2012646.0, ans=0.125 2023-06-25 15:06:45,331 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2012646.0, ans=0.125 2023-06-25 15:06:57,974 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.73 vs. limit=22.5 2023-06-25 15:07:11,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2012706.0, ans=0.2 2023-06-25 15:07:19,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2012766.0, ans=0.0 2023-06-25 15:07:29,957 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.737e+02 2.108e+03 3.291e+03 4.750e+03 1.246e+04, threshold=6.583e+03, percent-clipped=51.0 2023-06-25 15:07:38,490 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=15.0 2023-06-25 15:07:44,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2012826.0, ans=0.125 2023-06-25 15:08:17,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2012946.0, ans=0.1 2023-06-25 15:08:23,996 INFO [train.py:996] (2/4) Epoch 12, batch 50, loss[loss=0.1907, simple_loss=0.2697, pruned_loss=0.05589, over 21800.00 frames. ], tot_loss[loss=0.2388, simple_loss=0.3166, pruned_loss=0.08044, over 968875.76 frames. ], batch size: 124, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:09:22,046 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2013126.0, ans=0.5 2023-06-25 15:09:31,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2013126.0, ans=0.125 2023-06-25 15:10:07,059 INFO [train.py:996] (2/4) Epoch 12, batch 100, loss[loss=0.2547, simple_loss=0.3392, pruned_loss=0.08509, over 21770.00 frames. ], tot_loss[loss=0.248, simple_loss=0.333, pruned_loss=0.0815, over 1700423.32 frames. ], batch size: 247, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:10:15,965 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.12 vs. limit=15.0 2023-06-25 15:10:20,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2013246.0, ans=0.0 2023-06-25 15:10:39,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2013306.0, ans=0.125 2023-06-25 15:11:03,481 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.950e+02 8.744e+02 1.296e+03 2.082e+03 4.002e+03, threshold=2.593e+03, percent-clipped=0.0 2023-06-25 15:11:10,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2013426.0, ans=0.2 2023-06-25 15:11:10,869 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.19 vs. limit=22.5 2023-06-25 15:11:43,829 INFO [train.py:996] (2/4) Epoch 12, batch 150, loss[loss=0.2894, simple_loss=0.368, pruned_loss=0.1054, over 21738.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.3308, pruned_loss=0.08056, over 2273758.68 frames. ], batch size: 441, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:12:35,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2013666.0, ans=0.125 2023-06-25 15:12:45,438 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:13:11,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2013786.0, ans=0.1 2023-06-25 15:13:21,448 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2013786.0, ans=0.125 2023-06-25 15:13:32,355 INFO [train.py:996] (2/4) Epoch 12, batch 200, loss[loss=0.2578, simple_loss=0.3412, pruned_loss=0.0872, over 21455.00 frames. ], tot_loss[loss=0.2451, simple_loss=0.3295, pruned_loss=0.08032, over 2721876.62 frames. ], batch size: 471, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:14:31,668 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.459e+02 8.933e+02 1.301e+03 1.792e+03 3.949e+03, threshold=2.602e+03, percent-clipped=5.0 2023-06-25 15:15:20,319 INFO [train.py:996] (2/4) Epoch 12, batch 250, loss[loss=0.2794, simple_loss=0.3391, pruned_loss=0.1099, over 21407.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.3266, pruned_loss=0.08082, over 3060210.88 frames. ], batch size: 548, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:15:45,096 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.63 vs. limit=10.0 2023-06-25 15:16:35,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2014326.0, ans=0.125 2023-06-25 15:16:52,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2014386.0, ans=0.125 2023-06-25 15:17:00,568 INFO [train.py:996] (2/4) Epoch 12, batch 300, loss[loss=0.2573, simple_loss=0.342, pruned_loss=0.08635, over 21445.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3217, pruned_loss=0.08097, over 3323351.60 frames. ], batch size: 548, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:17:55,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2014566.0, ans=0.0 2023-06-25 15:18:01,871 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.795e+02 8.285e+02 1.102e+03 1.636e+03 4.756e+03, threshold=2.203e+03, percent-clipped=8.0 2023-06-25 15:18:13,864 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.86 vs. limit=10.0 2023-06-25 15:18:17,103 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:18:24,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2014626.0, ans=0.125 2023-06-25 15:18:47,690 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.23 vs. limit=22.5 2023-06-25 15:18:49,638 INFO [train.py:996] (2/4) Epoch 12, batch 350, loss[loss=0.2617, simple_loss=0.3511, pruned_loss=0.08617, over 21459.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3149, pruned_loss=0.07889, over 3522890.81 frames. ], batch size: 471, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:19:47,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2014866.0, ans=0.1 2023-06-25 15:20:06,061 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2014926.0, ans=0.125 2023-06-25 15:20:07,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2014926.0, ans=0.125 2023-06-25 15:20:09,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2014926.0, ans=0.2 2023-06-25 15:20:37,708 INFO [train.py:996] (2/4) Epoch 12, batch 400, loss[loss=0.2144, simple_loss=0.2971, pruned_loss=0.06578, over 21700.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3065, pruned_loss=0.07711, over 3679483.64 frames. ], batch size: 333, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:20:41,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2015046.0, ans=0.125 2023-06-25 15:21:23,706 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:21:27,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2015166.0, ans=0.0 2023-06-25 15:21:37,413 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.810e+02 9.305e+02 1.280e+03 1.857e+03 4.239e+03, threshold=2.560e+03, percent-clipped=17.0 2023-06-25 15:21:41,545 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:22:24,727 INFO [train.py:996] (2/4) Epoch 12, batch 450, loss[loss=0.1826, simple_loss=0.2561, pruned_loss=0.05453, over 21786.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3015, pruned_loss=0.07543, over 3813731.80 frames. ], batch size: 317, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:22:25,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2015346.0, ans=0.125 2023-06-25 15:22:44,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2015346.0, ans=0.0 2023-06-25 15:23:06,054 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2015406.0, ans=0.035 2023-06-25 15:23:13,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2015466.0, ans=0.2 2023-06-25 15:23:13,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2015466.0, ans=0.125 2023-06-25 15:23:46,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2015586.0, ans=10.0 2023-06-25 15:24:15,400 INFO [train.py:996] (2/4) Epoch 12, batch 500, loss[loss=0.2027, simple_loss=0.2938, pruned_loss=0.05578, over 21613.00 frames. ], tot_loss[loss=0.224, simple_loss=0.3017, pruned_loss=0.07314, over 3914298.76 frames. ], batch size: 263, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:24:46,623 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.07 vs. limit=15.0 2023-06-25 15:25:14,865 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.486e+02 9.539e+02 1.426e+03 2.120e+03 6.298e+03, threshold=2.852e+03, percent-clipped=19.0 2023-06-25 15:25:15,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2015826.0, ans=0.125 2023-06-25 15:26:02,525 INFO [train.py:996] (2/4) Epoch 12, batch 550, loss[loss=0.2465, simple_loss=0.318, pruned_loss=0.08752, over 21796.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3056, pruned_loss=0.07303, over 3997385.81 frames. ], batch size: 112, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:26:03,681 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.58 vs. limit=10.0 2023-06-25 15:26:03,842 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=14.71 vs. limit=15.0 2023-06-25 15:26:07,212 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.16 vs. limit=15.0 2023-06-25 15:26:23,251 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2016006.0, ans=0.125 2023-06-25 15:27:04,718 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.36 vs. limit=12.0 2023-06-25 15:27:17,747 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.26 vs. limit=15.0 2023-06-25 15:27:27,512 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2016186.0, ans=0.07 2023-06-25 15:27:49,294 INFO [train.py:996] (2/4) Epoch 12, batch 600, loss[loss=0.2296, simple_loss=0.2905, pruned_loss=0.08432, over 21870.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3125, pruned_loss=0.07462, over 4067434.17 frames. ], batch size: 107, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:28:49,899 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.802e+02 1.034e+03 1.598e+03 2.241e+03 5.970e+03, threshold=3.196e+03, percent-clipped=11.0 2023-06-25 15:29:21,160 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 15:29:38,777 INFO [train.py:996] (2/4) Epoch 12, batch 650, loss[loss=0.2244, simple_loss=0.3077, pruned_loss=0.07061, over 21432.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3111, pruned_loss=0.07418, over 4118023.24 frames. ], batch size: 211, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:30:38,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2016666.0, ans=0.05 2023-06-25 15:30:50,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2016726.0, ans=0.125 2023-06-25 15:30:53,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2016726.0, ans=0.0 2023-06-25 15:31:28,092 INFO [train.py:996] (2/4) Epoch 12, batch 700, loss[loss=0.2641, simple_loss=0.3327, pruned_loss=0.09772, over 21831.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3097, pruned_loss=0.07416, over 4159193.72 frames. ], batch size: 107, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:31:39,551 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.48 vs. limit=22.5 2023-06-25 15:32:28,837 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.353e+02 8.196e+02 1.234e+03 2.056e+03 5.759e+03, threshold=2.467e+03, percent-clipped=11.0 2023-06-25 15:32:45,537 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.03 vs. limit=10.0 2023-06-25 15:32:49,068 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.42 vs. limit=12.0 2023-06-25 15:32:52,286 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.30 vs. limit=12.0 2023-06-25 15:33:16,916 INFO [train.py:996] (2/4) Epoch 12, batch 750, loss[loss=0.1948, simple_loss=0.274, pruned_loss=0.05777, over 21896.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3126, pruned_loss=0.07671, over 4195128.18 frames. ], batch size: 316, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:33:20,762 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2017146.0, ans=0.0 2023-06-25 15:33:27,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2017146.0, ans=0.125 2023-06-25 15:34:12,345 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2017266.0, ans=0.125 2023-06-25 15:34:35,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2017386.0, ans=0.0 2023-06-25 15:34:41,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2017386.0, ans=0.04949747468305833 2023-06-25 15:35:08,434 INFO [train.py:996] (2/4) Epoch 12, batch 800, loss[loss=0.2189, simple_loss=0.2868, pruned_loss=0.07555, over 21802.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3092, pruned_loss=0.07724, over 4211635.05 frames. ], batch size: 247, lr: 2.47e-03, grad_scale: 32.0 2023-06-25 15:35:37,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2017506.0, ans=0.125 2023-06-25 15:36:04,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2017566.0, ans=0.05 2023-06-25 15:36:10,820 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.338e+02 9.265e+02 1.322e+03 1.960e+03 3.991e+03, threshold=2.645e+03, percent-clipped=16.0 2023-06-25 15:36:11,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2017626.0, ans=0.2 2023-06-25 15:36:58,030 INFO [train.py:996] (2/4) Epoch 12, batch 850, loss[loss=0.2513, simple_loss=0.3107, pruned_loss=0.09598, over 21286.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3066, pruned_loss=0.0782, over 4234419.68 frames. ], batch size: 176, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:37:49,927 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2017866.0, ans=0.125 2023-06-25 15:37:56,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2017866.0, ans=0.0 2023-06-25 15:38:46,353 INFO [train.py:996] (2/4) Epoch 12, batch 900, loss[loss=0.2124, simple_loss=0.2789, pruned_loss=0.07298, over 21629.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3058, pruned_loss=0.07863, over 4250011.45 frames. ], batch size: 391, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:38:46,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2018046.0, ans=0.125 2023-06-25 15:38:52,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2018046.0, ans=0.125 2023-06-25 15:39:05,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2018046.0, ans=0.0 2023-06-25 15:39:13,251 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.42 vs. limit=15.0 2023-06-25 15:39:49,737 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.898e+02 1.048e+03 1.588e+03 2.681e+03 4.714e+03, threshold=3.177e+03, percent-clipped=25.0 2023-06-25 15:40:29,979 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.82 vs. limit=15.0 2023-06-25 15:40:37,412 INFO [train.py:996] (2/4) Epoch 12, batch 950, loss[loss=0.2781, simple_loss=0.3476, pruned_loss=0.1043, over 21872.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3043, pruned_loss=0.07759, over 4256252.62 frames. ], batch size: 371, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:41:09,234 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2018406.0, ans=0.125 2023-06-25 15:42:25,956 INFO [train.py:996] (2/4) Epoch 12, batch 1000, loss[loss=0.1865, simple_loss=0.272, pruned_loss=0.05055, over 21595.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3043, pruned_loss=0.07701, over 4263012.86 frames. ], batch size: 230, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:42:38,633 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2018646.0, ans=0.125 2023-06-25 15:42:38,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2018646.0, ans=0.0 2023-06-25 15:43:16,991 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.50 vs. limit=12.0 2023-06-25 15:43:29,125 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.398e+02 9.331e+02 1.352e+03 1.940e+03 3.326e+03, threshold=2.703e+03, percent-clipped=1.0 2023-06-25 15:44:05,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2018886.0, ans=0.1 2023-06-25 15:44:21,938 INFO [train.py:996] (2/4) Epoch 12, batch 1050, loss[loss=0.3186, simple_loss=0.3752, pruned_loss=0.131, over 21579.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3044, pruned_loss=0.07671, over 4265568.10 frames. ], batch size: 414, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:44:23,949 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2018946.0, ans=0.125 2023-06-25 15:44:24,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2018946.0, ans=0.125 2023-06-25 15:44:54,861 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.76 vs. limit=12.0 2023-06-25 15:45:02,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2019066.0, ans=0.125 2023-06-25 15:46:01,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2019186.0, ans=0.2 2023-06-25 15:46:13,561 INFO [train.py:996] (2/4) Epoch 12, batch 1100, loss[loss=0.1881, simple_loss=0.2889, pruned_loss=0.04367, over 21773.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3042, pruned_loss=0.07551, over 4269348.99 frames. ], batch size: 351, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:46:31,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2019246.0, ans=0.0 2023-06-25 15:46:46,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2019306.0, ans=0.0 2023-06-25 15:46:58,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2019366.0, ans=0.125 2023-06-25 15:47:12,025 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.833e+02 8.143e+02 1.241e+03 1.792e+03 5.093e+03, threshold=2.482e+03, percent-clipped=8.0 2023-06-25 15:47:13,223 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.64 vs. limit=6.0 2023-06-25 15:47:23,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2019426.0, ans=0.125 2023-06-25 15:47:25,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2019426.0, ans=0.0 2023-06-25 15:47:59,975 INFO [train.py:996] (2/4) Epoch 12, batch 1150, loss[loss=0.2245, simple_loss=0.2991, pruned_loss=0.07494, over 21517.00 frames. ], tot_loss[loss=0.227, simple_loss=0.304, pruned_loss=0.07503, over 4274419.49 frames. ], batch size: 548, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:48:13,669 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.99 vs. limit=10.0 2023-06-25 15:48:38,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2019666.0, ans=0.125 2023-06-25 15:49:12,308 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.14 vs. limit=15.0 2023-06-25 15:49:25,790 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.04 vs. limit=15.0 2023-06-25 15:49:56,068 INFO [train.py:996] (2/4) Epoch 12, batch 1200, loss[loss=0.2377, simple_loss=0.3256, pruned_loss=0.07492, over 21741.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3065, pruned_loss=0.07604, over 4276921.38 frames. ], batch size: 351, lr: 2.47e-03, grad_scale: 32.0 2023-06-25 15:50:14,964 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2019906.0, ans=0.0 2023-06-25 15:50:52,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2019966.0, ans=0.125 2023-06-25 15:50:59,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2020026.0, ans=0.125 2023-06-25 15:51:00,438 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.676e+02 8.295e+02 1.207e+03 1.708e+03 3.534e+03, threshold=2.414e+03, percent-clipped=4.0 2023-06-25 15:51:14,349 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.16 vs. limit=15.0 2023-06-25 15:51:22,883 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=9.78 vs. limit=15.0 2023-06-25 15:51:47,238 INFO [train.py:996] (2/4) Epoch 12, batch 1250, loss[loss=0.2196, simple_loss=0.3213, pruned_loss=0.05899, over 21750.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3083, pruned_loss=0.07759, over 4281726.68 frames. ], batch size: 351, lr: 2.47e-03, grad_scale: 32.0 2023-06-25 15:52:28,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2020206.0, ans=0.1 2023-06-25 15:53:41,552 INFO [train.py:996] (2/4) Epoch 12, batch 1300, loss[loss=0.2551, simple_loss=0.3509, pruned_loss=0.07966, over 21889.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.31, pruned_loss=0.07846, over 4275787.60 frames. ], batch size: 372, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:53:45,816 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-25 15:54:00,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2020506.0, ans=0.125 2023-06-25 15:54:12,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2020506.0, ans=0.125 2023-06-25 15:54:39,201 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2020566.0, ans=0.0 2023-06-25 15:54:46,924 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.616e+02 1.419e+03 1.872e+03 2.553e+03 5.619e+03, threshold=3.744e+03, percent-clipped=29.0 2023-06-25 15:55:18,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2020686.0, ans=0.125 2023-06-25 15:55:30,066 INFO [train.py:996] (2/4) Epoch 12, batch 1350, loss[loss=0.1976, simple_loss=0.2693, pruned_loss=0.06291, over 21236.00 frames. ], tot_loss[loss=0.2315, simple_loss=0.3083, pruned_loss=0.07733, over 4278467.42 frames. ], batch size: 608, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:55:35,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2020746.0, ans=0.0 2023-06-25 15:56:51,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2020926.0, ans=0.125 2023-06-25 15:57:12,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2020986.0, ans=0.0 2023-06-25 15:57:17,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2021046.0, ans=0.125 2023-06-25 15:57:18,216 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.68 vs. limit=15.0 2023-06-25 15:57:18,666 INFO [train.py:996] (2/4) Epoch 12, batch 1400, loss[loss=0.2258, simple_loss=0.294, pruned_loss=0.07881, over 21705.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3056, pruned_loss=0.07771, over 4277281.53 frames. ], batch size: 230, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 15:57:19,469 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2021046.0, ans=0.0 2023-06-25 15:57:25,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2021046.0, ans=0.0 2023-06-25 15:58:00,708 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2021106.0, ans=0.015 2023-06-25 15:58:01,383 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=15.0 2023-06-25 15:58:31,004 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.714e+02 6.812e+02 9.835e+02 1.527e+03 2.832e+03, threshold=1.967e+03, percent-clipped=0.0 2023-06-25 15:58:40,562 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.39 vs. limit=10.0 2023-06-25 15:58:54,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2021286.0, ans=0.125 2023-06-25 15:59:07,927 INFO [train.py:996] (2/4) Epoch 12, batch 1450, loss[loss=0.2371, simple_loss=0.3144, pruned_loss=0.07985, over 21597.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3076, pruned_loss=0.07932, over 4277328.40 frames. ], batch size: 112, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 15:59:15,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2021346.0, ans=0.125 2023-06-25 15:59:20,516 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2021346.0, ans=0.95 2023-06-25 16:00:13,614 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2021466.0, ans=0.2 2023-06-25 16:00:25,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2021526.0, ans=0.125 2023-06-25 16:00:48,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2021586.0, ans=0.1 2023-06-25 16:00:56,934 INFO [train.py:996] (2/4) Epoch 12, batch 1500, loss[loss=0.2507, simple_loss=0.3312, pruned_loss=0.08512, over 21523.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3095, pruned_loss=0.08052, over 4283365.91 frames. ], batch size: 414, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:00:57,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2021646.0, ans=0.1 2023-06-25 16:01:07,942 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=15.0 2023-06-25 16:01:09,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2021646.0, ans=0.125 2023-06-25 16:01:16,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2021706.0, ans=0.125 2023-06-25 16:01:37,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2021766.0, ans=0.2 2023-06-25 16:01:41,102 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.15 vs. limit=15.0 2023-06-25 16:02:00,597 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.71 vs. limit=15.0 2023-06-25 16:02:04,718 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.651e+02 8.922e+02 1.288e+03 1.827e+03 4.851e+03, threshold=2.577e+03, percent-clipped=21.0 2023-06-25 16:02:43,234 INFO [train.py:996] (2/4) Epoch 12, batch 1550, loss[loss=0.2003, simple_loss=0.3045, pruned_loss=0.04803, over 19909.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.309, pruned_loss=0.07913, over 4286102.55 frames. ], batch size: 702, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:02:54,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2021946.0, ans=0.125 2023-06-25 16:02:54,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2021946.0, ans=0.125 2023-06-25 16:03:40,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2022066.0, ans=0.2 2023-06-25 16:04:13,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2022126.0, ans=0.1 2023-06-25 16:04:16,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2022126.0, ans=0.125 2023-06-25 16:04:20,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2022186.0, ans=0.2 2023-06-25 16:04:36,329 INFO [train.py:996] (2/4) Epoch 12, batch 1600, loss[loss=0.2105, simple_loss=0.3008, pruned_loss=0.06007, over 21826.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3103, pruned_loss=0.07954, over 4280336.64 frames. ], batch size: 282, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 16:04:58,899 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2022246.0, ans=0.0 2023-06-25 16:05:11,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2022306.0, ans=0.1 2023-06-25 16:05:15,157 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2022306.0, ans=0.1 2023-06-25 16:06:00,527 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.153e+02 9.608e+02 1.362e+03 1.874e+03 5.231e+03, threshold=2.724e+03, percent-clipped=11.0 2023-06-25 16:06:20,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2022486.0, ans=0.125 2023-06-25 16:06:38,420 INFO [train.py:996] (2/4) Epoch 12, batch 1650, loss[loss=0.1985, simple_loss=0.2886, pruned_loss=0.05417, over 21661.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3071, pruned_loss=0.07831, over 4279264.90 frames. ], batch size: 247, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 16:08:31,828 INFO [train.py:996] (2/4) Epoch 12, batch 1700, loss[loss=0.2349, simple_loss=0.3095, pruned_loss=0.08009, over 21589.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3106, pruned_loss=0.07992, over 4279752.51 frames. ], batch size: 263, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:08:54,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2022846.0, ans=0.0 2023-06-25 16:09:48,452 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.619e+02 9.466e+02 1.193e+03 1.792e+03 3.263e+03, threshold=2.387e+03, percent-clipped=3.0 2023-06-25 16:09:55,992 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.88 vs. limit=22.5 2023-06-25 16:10:12,642 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2023086.0, ans=0.125 2023-06-25 16:10:32,432 INFO [train.py:996] (2/4) Epoch 12, batch 1750, loss[loss=0.1836, simple_loss=0.2562, pruned_loss=0.0555, over 21375.00 frames. ], tot_loss[loss=0.2332, simple_loss=0.3097, pruned_loss=0.07839, over 4281670.52 frames. ], batch size: 194, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:10:55,020 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.50 vs. limit=15.0 2023-06-25 16:11:10,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2023206.0, ans=0.125 2023-06-25 16:12:29,639 INFO [train.py:996] (2/4) Epoch 12, batch 1800, loss[loss=0.2308, simple_loss=0.3239, pruned_loss=0.0689, over 21647.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3099, pruned_loss=0.07511, over 4283320.17 frames. ], batch size: 247, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:12:36,537 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.89 vs. limit=6.0 2023-06-25 16:12:56,156 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.68 vs. limit=22.5 2023-06-25 16:13:40,957 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.675e+02 9.606e+02 1.409e+03 2.077e+03 5.009e+03, threshold=2.818e+03, percent-clipped=17.0 2023-06-25 16:13:50,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2023626.0, ans=0.125 2023-06-25 16:13:53,538 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.40 vs. limit=10.0 2023-06-25 16:14:21,247 INFO [train.py:996] (2/4) Epoch 12, batch 1850, loss[loss=0.2697, simple_loss=0.3379, pruned_loss=0.1008, over 21637.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3107, pruned_loss=0.07315, over 4279444.26 frames. ], batch size: 263, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:15:48,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2023926.0, ans=0.125 2023-06-25 16:15:50,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2023926.0, ans=0.025 2023-06-25 16:16:21,266 INFO [train.py:996] (2/4) Epoch 12, batch 1900, loss[loss=0.3194, simple_loss=0.3684, pruned_loss=0.1352, over 21683.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3113, pruned_loss=0.07377, over 4280885.55 frames. ], batch size: 507, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:16:41,519 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=15.0 2023-06-25 16:16:42,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2024106.0, ans=0.07 2023-06-25 16:17:13,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2024166.0, ans=0.1 2023-06-25 16:17:14,145 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.71 vs. limit=5.0 2023-06-25 16:17:30,683 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.998e+02 9.175e+02 1.448e+03 2.003e+03 3.751e+03, threshold=2.896e+03, percent-clipped=10.0 2023-06-25 16:17:31,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=2024226.0, ans=15.0 2023-06-25 16:18:12,822 INFO [train.py:996] (2/4) Epoch 12, batch 1950, loss[loss=0.2178, simple_loss=0.284, pruned_loss=0.07585, over 21833.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3078, pruned_loss=0.07443, over 4279081.85 frames. ], batch size: 118, lr: 2.47e-03, grad_scale: 8.0 2023-06-25 16:19:03,106 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2024466.0, ans=0.125 2023-06-25 16:19:25,338 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2024526.0, ans=0.0 2023-06-25 16:20:06,007 INFO [train.py:996] (2/4) Epoch 12, batch 2000, loss[loss=0.2834, simple_loss=0.375, pruned_loss=0.09591, over 20029.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.3015, pruned_loss=0.07261, over 4263546.01 frames. ], batch size: 702, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 16:20:08,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2024646.0, ans=0.015 2023-06-25 16:20:47,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2024766.0, ans=0.0 2023-06-25 16:21:15,646 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.378e+02 9.911e+02 1.522e+03 2.173e+03 4.229e+03, threshold=3.044e+03, percent-clipped=10.0 2023-06-25 16:21:56,997 INFO [train.py:996] (2/4) Epoch 12, batch 2050, loss[loss=0.2326, simple_loss=0.3319, pruned_loss=0.06662, over 21555.00 frames. ], tot_loss[loss=0.2238, simple_loss=0.303, pruned_loss=0.0723, over 4267849.72 frames. ], batch size: 473, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 16:22:32,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2025006.0, ans=0.125 2023-06-25 16:22:36,853 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.56 vs. limit=6.0 2023-06-25 16:22:44,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2025066.0, ans=0.0 2023-06-25 16:23:07,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2025126.0, ans=0.1 2023-06-25 16:23:30,095 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.24 vs. limit=15.0 2023-06-25 16:23:44,373 INFO [train.py:996] (2/4) Epoch 12, batch 2100, loss[loss=0.2282, simple_loss=0.3063, pruned_loss=0.07509, over 21252.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3073, pruned_loss=0.07435, over 4275656.19 frames. ], batch size: 549, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 16:23:46,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2025246.0, ans=0.125 2023-06-25 16:24:16,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2025306.0, ans=0.0 2023-06-25 16:24:56,541 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2025426.0, ans=0.125 2023-06-25 16:24:57,648 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.106e+02 8.786e+02 1.413e+03 2.170e+03 3.783e+03, threshold=2.827e+03, percent-clipped=9.0 2023-06-25 16:25:19,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2025486.0, ans=0.125 2023-06-25 16:25:38,356 INFO [train.py:996] (2/4) Epoch 12, batch 2150, loss[loss=0.2207, simple_loss=0.2775, pruned_loss=0.08192, over 21594.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3077, pruned_loss=0.07587, over 4272187.05 frames. ], batch size: 263, lr: 2.47e-03, grad_scale: 16.0 2023-06-25 16:25:56,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2025606.0, ans=0.125 2023-06-25 16:25:59,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2025606.0, ans=0.2 2023-06-25 16:26:44,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2025726.0, ans=0.0 2023-06-25 16:26:46,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2025726.0, ans=0.0 2023-06-25 16:26:53,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2025726.0, ans=0.125 2023-06-25 16:27:31,512 INFO [train.py:996] (2/4) Epoch 12, batch 2200, loss[loss=0.2177, simple_loss=0.2783, pruned_loss=0.07858, over 21401.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3068, pruned_loss=0.07556, over 4267232.36 frames. ], batch size: 473, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:27:32,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2025846.0, ans=0.2 2023-06-25 16:27:34,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2025846.0, ans=0.125 2023-06-25 16:27:34,745 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.87 vs. limit=12.0 2023-06-25 16:27:43,243 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.96 vs. limit=22.5 2023-06-25 16:28:14,658 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.50 vs. limit=15.0 2023-06-25 16:28:29,871 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2025966.0, ans=0.05 2023-06-25 16:28:46,666 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.633e+02 9.347e+02 1.379e+03 2.152e+03 4.543e+03, threshold=2.758e+03, percent-clipped=14.0 2023-06-25 16:29:23,302 INFO [train.py:996] (2/4) Epoch 12, batch 2250, loss[loss=0.2748, simple_loss=0.3606, pruned_loss=0.09448, over 21479.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3063, pruned_loss=0.07484, over 4269021.57 frames. ], batch size: 471, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:29:25,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2026146.0, ans=0.1 2023-06-25 16:30:58,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2026386.0, ans=0.0 2023-06-25 16:31:10,464 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten.whitening_limit, batch_count=2026386.0, ans=22.5 2023-06-25 16:31:14,342 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.67 vs. limit=15.0 2023-06-25 16:31:14,995 INFO [train.py:996] (2/4) Epoch 12, batch 2300, loss[loss=0.2173, simple_loss=0.288, pruned_loss=0.07335, over 21650.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3023, pruned_loss=0.0744, over 4267525.52 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:31:29,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2026446.0, ans=0.125 2023-06-25 16:31:29,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2026446.0, ans=0.0 2023-06-25 16:31:46,504 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.46 vs. limit=15.0 2023-06-25 16:31:49,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2026506.0, ans=0.95 2023-06-25 16:31:54,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2026566.0, ans=0.1 2023-06-25 16:32:15,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2026566.0, ans=0.07 2023-06-25 16:32:30,816 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-25 16:32:31,105 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.590e+02 8.645e+02 1.249e+03 1.812e+03 4.519e+03, threshold=2.497e+03, percent-clipped=11.0 2023-06-25 16:32:33,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2026626.0, ans=0.125 2023-06-25 16:32:54,861 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.17 vs. limit=15.0 2023-06-25 16:33:07,260 INFO [train.py:996] (2/4) Epoch 12, batch 2350, loss[loss=0.2165, simple_loss=0.2926, pruned_loss=0.07023, over 21611.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.2997, pruned_loss=0.075, over 4268932.67 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:34:47,175 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.04 vs. limit=12.0 2023-06-25 16:34:59,663 INFO [train.py:996] (2/4) Epoch 12, batch 2400, loss[loss=0.2659, simple_loss=0.3316, pruned_loss=0.1001, over 21863.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3056, pruned_loss=0.07788, over 4262749.35 frames. ], batch size: 118, lr: 2.46e-03, grad_scale: 32.0 2023-06-25 16:35:00,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2027046.0, ans=0.125 2023-06-25 16:35:44,066 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.93 vs. limit=15.0 2023-06-25 16:35:59,661 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2027166.0, ans=0.2 2023-06-25 16:36:14,551 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2027166.0, ans=0.2 2023-06-25 16:36:28,140 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.619e+02 8.583e+02 1.300e+03 1.930e+03 5.128e+03, threshold=2.600e+03, percent-clipped=13.0 2023-06-25 16:36:32,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2027226.0, ans=0.125 2023-06-25 16:36:43,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2027286.0, ans=0.0 2023-06-25 16:37:04,199 INFO [train.py:996] (2/4) Epoch 12, batch 2450, loss[loss=0.189, simple_loss=0.2664, pruned_loss=0.05585, over 21619.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.3121, pruned_loss=0.08158, over 4272068.44 frames. ], batch size: 332, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:37:35,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2027406.0, ans=0.125 2023-06-25 16:38:27,087 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=15.42 vs. limit=22.5 2023-06-25 16:38:54,300 INFO [train.py:996] (2/4) Epoch 12, batch 2500, loss[loss=0.2013, simple_loss=0.2715, pruned_loss=0.06558, over 21727.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.3081, pruned_loss=0.07988, over 4277030.67 frames. ], batch size: 124, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:39:16,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2027706.0, ans=0.0 2023-06-25 16:39:18,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2027706.0, ans=0.2 2023-06-25 16:39:26,941 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.52 vs. limit=15.0 2023-06-25 16:40:06,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2027826.0, ans=0.125 2023-06-25 16:40:09,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2027826.0, ans=0.125 2023-06-25 16:40:11,313 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.584e+02 1.091e+03 1.592e+03 2.289e+03 5.240e+03, threshold=3.184e+03, percent-clipped=19.0 2023-06-25 16:40:44,936 INFO [train.py:996] (2/4) Epoch 12, batch 2550, loss[loss=0.2066, simple_loss=0.3317, pruned_loss=0.04078, over 20796.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3059, pruned_loss=0.07779, over 4270473.29 frames. ], batch size: 607, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:41:07,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2028006.0, ans=0.1 2023-06-25 16:42:14,113 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2028186.0, ans=0.1 2023-06-25 16:42:26,153 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.34 vs. limit=15.0 2023-06-25 16:42:26,392 INFO [train.py:996] (2/4) Epoch 12, batch 2600, loss[loss=0.2152, simple_loss=0.2822, pruned_loss=0.07408, over 21578.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.305, pruned_loss=0.07765, over 4248005.00 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:43:12,797 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2028366.0, ans=0.125 2023-06-25 16:43:41,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2028426.0, ans=0.025 2023-06-25 16:43:44,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2028426.0, ans=0.125 2023-06-25 16:43:49,904 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.618e+02 9.833e+02 1.370e+03 2.330e+03 4.697e+03, threshold=2.739e+03, percent-clipped=12.0 2023-06-25 16:44:06,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2028486.0, ans=0.125 2023-06-25 16:44:25,802 INFO [train.py:996] (2/4) Epoch 12, batch 2650, loss[loss=0.2726, simple_loss=0.3537, pruned_loss=0.09573, over 21771.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3087, pruned_loss=0.08049, over 4260556.20 frames. ], batch size: 414, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:44:32,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2028546.0, ans=0.125 2023-06-25 16:44:44,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2028606.0, ans=0.125 2023-06-25 16:45:08,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2028666.0, ans=0.125 2023-06-25 16:45:32,018 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.76 vs. limit=15.0 2023-06-25 16:45:35,084 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-25 16:45:54,102 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.min_positive, batch_count=2028786.0, ans=0.025 2023-06-25 16:46:18,364 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=7.72 vs. limit=10.0 2023-06-25 16:46:18,785 INFO [train.py:996] (2/4) Epoch 12, batch 2700, loss[loss=0.2336, simple_loss=0.3028, pruned_loss=0.08223, over 21830.00 frames. ], tot_loss[loss=0.2342, simple_loss=0.3086, pruned_loss=0.07984, over 4265088.94 frames. ], batch size: 298, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:46:28,229 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-25 16:46:31,987 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.42 vs. limit=22.5 2023-06-25 16:47:04,266 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.26 vs. limit=6.0 2023-06-25 16:47:35,851 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.499e+02 8.174e+02 1.316e+03 1.880e+03 3.948e+03, threshold=2.631e+03, percent-clipped=11.0 2023-06-25 16:48:09,315 INFO [train.py:996] (2/4) Epoch 12, batch 2750, loss[loss=0.2129, simple_loss=0.3044, pruned_loss=0.06067, over 21787.00 frames. ], tot_loss[loss=0.2326, simple_loss=0.3067, pruned_loss=0.07922, over 4269532.66 frames. ], batch size: 282, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:48:24,184 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2029146.0, ans=0.125 2023-06-25 16:49:55,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2029386.0, ans=0.125 2023-06-25 16:50:00,327 INFO [train.py:996] (2/4) Epoch 12, batch 2800, loss[loss=0.2405, simple_loss=0.3221, pruned_loss=0.07951, over 21467.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3112, pruned_loss=0.08018, over 4274340.65 frames. ], batch size: 211, lr: 2.46e-03, grad_scale: 32.0 2023-06-25 16:50:36,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2029506.0, ans=0.125 2023-06-25 16:50:56,128 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=15.0 2023-06-25 16:51:14,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2029626.0, ans=0.0 2023-06-25 16:51:16,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2029626.0, ans=0.125 2023-06-25 16:51:26,719 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.572e+02 1.005e+03 1.362e+03 2.253e+03 4.999e+03, threshold=2.724e+03, percent-clipped=18.0 2023-06-25 16:51:53,167 INFO [train.py:996] (2/4) Epoch 12, batch 2850, loss[loss=0.178, simple_loss=0.2341, pruned_loss=0.06091, over 21277.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3111, pruned_loss=0.08146, over 4274025.35 frames. ], batch size: 176, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:52:44,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2029866.0, ans=0.05 2023-06-25 16:52:56,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2029926.0, ans=0.125 2023-06-25 16:53:22,614 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.81 vs. limit=22.5 2023-06-25 16:53:38,286 INFO [train.py:996] (2/4) Epoch 12, batch 2900, loss[loss=0.1952, simple_loss=0.2656, pruned_loss=0.06239, over 21830.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3064, pruned_loss=0.07984, over 4283395.62 frames. ], batch size: 282, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:54:56,898 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.061e+02 9.423e+02 1.341e+03 2.226e+03 4.607e+03, threshold=2.681e+03, percent-clipped=12.0 2023-06-25 16:55:12,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2030286.0, ans=0.2 2023-06-25 16:55:28,740 INFO [train.py:996] (2/4) Epoch 12, batch 2950, loss[loss=0.334, simple_loss=0.428, pruned_loss=0.12, over 21268.00 frames. ], tot_loss[loss=0.234, simple_loss=0.3081, pruned_loss=0.07994, over 4288011.36 frames. ], batch size: 548, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:55:37,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2030346.0, ans=0.125 2023-06-25 16:55:39,592 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2030346.0, ans=10.0 2023-06-25 16:55:48,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2030346.0, ans=0.0 2023-06-25 16:55:52,270 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 16:56:10,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2030406.0, ans=0.0 2023-06-25 16:56:11,014 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.54 vs. limit=15.0 2023-06-25 16:56:17,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2030466.0, ans=0.1 2023-06-25 16:56:31,627 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2030526.0, ans=0.125 2023-06-25 16:56:57,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2030586.0, ans=0.125 2023-06-25 16:57:04,945 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.02 vs. limit=15.0 2023-06-25 16:57:14,420 INFO [train.py:996] (2/4) Epoch 12, batch 3000, loss[loss=0.267, simple_loss=0.3465, pruned_loss=0.09381, over 21751.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3142, pruned_loss=0.08098, over 4291702.85 frames. ], batch size: 124, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:57:14,421 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 16:57:41,089 INFO [train.py:1028] (2/4) Epoch 12, validation: loss=0.2513, simple_loss=0.3439, pruned_loss=0.07939, over 1796401.00 frames. 2023-06-25 16:57:41,090 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-25 16:57:54,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2030646.0, ans=0.125 2023-06-25 16:58:43,637 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.58 vs. limit=15.0 2023-06-25 16:58:51,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2030826.0, ans=0.2 2023-06-25 16:58:52,817 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.014e+02 9.167e+02 1.270e+03 1.811e+03 4.329e+03, threshold=2.541e+03, percent-clipped=6.0 2023-06-25 16:59:25,854 INFO [train.py:996] (2/4) Epoch 12, batch 3050, loss[loss=0.2259, simple_loss=0.3067, pruned_loss=0.0725, over 21782.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3138, pruned_loss=0.07958, over 4296265.32 frames. ], batch size: 332, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 16:59:35,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2030946.0, ans=0.125 2023-06-25 16:59:36,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2030946.0, ans=0.0 2023-06-25 16:59:52,458 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2031006.0, ans=0.1 2023-06-25 16:59:53,498 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.60 vs. limit=15.0 2023-06-25 16:59:56,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2031006.0, ans=0.1 2023-06-25 17:00:41,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2031126.0, ans=0.1 2023-06-25 17:00:44,869 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2031126.0, ans=0.0 2023-06-25 17:01:18,219 INFO [train.py:996] (2/4) Epoch 12, batch 3100, loss[loss=0.2294, simple_loss=0.322, pruned_loss=0.06834, over 21640.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3132, pruned_loss=0.07883, over 4297566.73 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:01:36,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2031306.0, ans=0.1 2023-06-25 17:01:43,658 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2031306.0, ans=0.0 2023-06-25 17:01:49,757 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-25 17:01:56,759 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2031306.0, ans=0.1 2023-06-25 17:02:21,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2031426.0, ans=0.0 2023-06-25 17:02:25,536 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.476e+02 8.628e+02 1.556e+03 2.270e+03 3.749e+03, threshold=3.112e+03, percent-clipped=16.0 2023-06-25 17:02:54,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2031486.0, ans=0.04949747468305833 2023-06-25 17:03:06,726 INFO [train.py:996] (2/4) Epoch 12, batch 3150, loss[loss=0.3415, simple_loss=0.3943, pruned_loss=0.1444, over 21457.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.314, pruned_loss=0.07931, over 4294582.18 frames. ], batch size: 510, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:03:08,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2031546.0, ans=0.125 2023-06-25 17:03:32,655 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:03:52,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2031666.0, ans=0.2 2023-06-25 17:03:59,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2031666.0, ans=0.125 2023-06-25 17:04:02,907 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:04:51,735 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.20 vs. limit=12.0 2023-06-25 17:04:53,981 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.01 vs. limit=15.0 2023-06-25 17:04:55,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2031786.0, ans=0.1 2023-06-25 17:05:01,262 INFO [train.py:996] (2/4) Epoch 12, batch 3200, loss[loss=0.2651, simple_loss=0.3471, pruned_loss=0.09155, over 21724.00 frames. ], tot_loss[loss=0.2373, simple_loss=0.3153, pruned_loss=0.07969, over 4292992.23 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 32.0 2023-06-25 17:05:27,779 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-25 17:05:44,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2031966.0, ans=0.1 2023-06-25 17:06:03,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2031966.0, ans=0.125 2023-06-25 17:06:21,722 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.588e+02 9.417e+02 1.305e+03 2.002e+03 3.314e+03, threshold=2.610e+03, percent-clipped=4.0 2023-06-25 17:06:45,605 INFO [train.py:996] (2/4) Epoch 12, batch 3250, loss[loss=0.248, simple_loss=0.3062, pruned_loss=0.09492, over 21222.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3164, pruned_loss=0.08086, over 4293905.34 frames. ], batch size: 176, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:07:13,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2032206.0, ans=0.1 2023-06-25 17:07:19,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2032206.0, ans=0.2 2023-06-25 17:07:36,492 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2032266.0, ans=0.1 2023-06-25 17:07:38,014 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2032266.0, ans=0.125 2023-06-25 17:08:13,438 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2032386.0, ans=0.125 2023-06-25 17:08:17,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2032386.0, ans=0.125 2023-06-25 17:08:24,454 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2032386.0, ans=0.09899494936611666 2023-06-25 17:08:38,545 INFO [train.py:996] (2/4) Epoch 12, batch 3300, loss[loss=0.2804, simple_loss=0.3723, pruned_loss=0.09429, over 21452.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3112, pruned_loss=0.08031, over 4290017.17 frames. ], batch size: 507, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:09:25,063 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.27 vs. limit=10.0 2023-06-25 17:09:31,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2032566.0, ans=0.0 2023-06-25 17:09:56,048 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.590e+02 8.857e+02 1.384e+03 2.131e+03 4.581e+03, threshold=2.768e+03, percent-clipped=14.0 2023-06-25 17:10:20,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2032686.0, ans=0.125 2023-06-25 17:10:28,072 INFO [train.py:996] (2/4) Epoch 12, batch 3350, loss[loss=0.2486, simple_loss=0.3212, pruned_loss=0.088, over 21330.00 frames. ], tot_loss[loss=0.2372, simple_loss=0.3127, pruned_loss=0.08089, over 4283792.99 frames. ], batch size: 176, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:10:35,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2032746.0, ans=0.025 2023-06-25 17:10:44,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2032806.0, ans=0.125 2023-06-25 17:10:50,564 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.13 vs. limit=10.0 2023-06-25 17:11:29,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2032866.0, ans=0.125 2023-06-25 17:11:56,950 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=22.5 2023-06-25 17:12:17,403 INFO [train.py:996] (2/4) Epoch 12, batch 3400, loss[loss=0.2233, simple_loss=0.319, pruned_loss=0.06386, over 21884.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3144, pruned_loss=0.08217, over 4282837.73 frames. ], batch size: 371, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:13:48,683 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.332e+02 9.099e+02 1.349e+03 1.796e+03 3.997e+03, threshold=2.698e+03, percent-clipped=5.0 2023-06-25 17:13:59,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2033286.0, ans=0.5 2023-06-25 17:14:13,575 INFO [train.py:996] (2/4) Epoch 12, batch 3450, loss[loss=0.2371, simple_loss=0.3088, pruned_loss=0.08268, over 21737.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3105, pruned_loss=0.08215, over 4281625.90 frames. ], batch size: 316, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:14:32,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2033346.0, ans=0.2 2023-06-25 17:14:48,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2033406.0, ans=0.1 2023-06-25 17:16:10,340 INFO [train.py:996] (2/4) Epoch 12, batch 3500, loss[loss=0.2813, simple_loss=0.3596, pruned_loss=0.1015, over 21565.00 frames. ], tot_loss[loss=0.2448, simple_loss=0.3188, pruned_loss=0.0854, over 4276510.38 frames. ], batch size: 230, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:16:38,288 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:16:47,427 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=15.0 2023-06-25 17:17:32,087 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.249e+02 1.085e+03 1.477e+03 2.105e+03 4.608e+03, threshold=2.953e+03, percent-clipped=10.0 2023-06-25 17:17:39,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2033886.0, ans=0.0 2023-06-25 17:18:17,208 INFO [train.py:996] (2/4) Epoch 12, batch 3550, loss[loss=0.2528, simple_loss=0.3147, pruned_loss=0.09547, over 21757.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.3218, pruned_loss=0.08643, over 4272944.20 frames. ], batch size: 351, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:18:54,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2034066.0, ans=0.0 2023-06-25 17:19:09,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2034066.0, ans=0.125 2023-06-25 17:19:21,446 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2034126.0, ans=0.125 2023-06-25 17:19:37,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2034186.0, ans=0.125 2023-06-25 17:19:39,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=2034186.0, ans=0.05 2023-06-25 17:20:00,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2034186.0, ans=0.125 2023-06-25 17:20:01,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2034186.0, ans=0.125 2023-06-25 17:20:05,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2034186.0, ans=0.0 2023-06-25 17:20:12,197 INFO [train.py:996] (2/4) Epoch 12, batch 3600, loss[loss=0.3426, simple_loss=0.4387, pruned_loss=0.1233, over 21608.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.317, pruned_loss=0.08544, over 4274719.26 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:20:24,692 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2034246.0, ans=0.125 2023-06-25 17:20:34,682 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.34 vs. limit=10.0 2023-06-25 17:20:59,671 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=17.40 vs. limit=22.5 2023-06-25 17:21:04,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2034366.0, ans=0.025 2023-06-25 17:21:27,850 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.714e+02 9.639e+02 1.612e+03 2.381e+03 4.879e+03, threshold=3.225e+03, percent-clipped=14.0 2023-06-25 17:21:51,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2034486.0, ans=0.125 2023-06-25 17:22:02,917 INFO [train.py:996] (2/4) Epoch 12, batch 3650, loss[loss=0.2338, simple_loss=0.3438, pruned_loss=0.0619, over 20951.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3165, pruned_loss=0.08566, over 4280533.88 frames. ], batch size: 608, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:23:21,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2034786.0, ans=0.125 2023-06-25 17:23:29,973 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.22 vs. limit=15.0 2023-06-25 17:23:51,400 INFO [train.py:996] (2/4) Epoch 12, batch 3700, loss[loss=0.2245, simple_loss=0.2957, pruned_loss=0.07661, over 21840.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3137, pruned_loss=0.08411, over 4282059.50 frames. ], batch size: 247, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:23:55,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2034846.0, ans=0.0 2023-06-25 17:25:05,421 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.252e+02 8.433e+02 1.159e+03 1.645e+03 2.871e+03, threshold=2.319e+03, percent-clipped=0.0 2023-06-25 17:25:07,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2035026.0, ans=0.0 2023-06-25 17:25:39,744 INFO [train.py:996] (2/4) Epoch 12, batch 3750, loss[loss=0.219, simple_loss=0.2919, pruned_loss=0.07299, over 21823.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3129, pruned_loss=0.08348, over 4284624.59 frames. ], batch size: 112, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:25:42,278 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2035146.0, ans=0.125 2023-06-25 17:25:42,821 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-25 17:25:50,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2035146.0, ans=0.125 2023-06-25 17:25:59,432 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=22.5 2023-06-25 17:26:26,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2035266.0, ans=0.125 2023-06-25 17:27:28,859 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:27:31,426 INFO [train.py:996] (2/4) Epoch 12, batch 3800, loss[loss=0.2789, simple_loss=0.357, pruned_loss=0.1004, over 21801.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3117, pruned_loss=0.0825, over 4284133.13 frames. ], batch size: 124, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:28:34,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2035626.0, ans=0.0 2023-06-25 17:29:02,407 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.463e+02 9.532e+02 1.366e+03 2.202e+03 4.372e+03, threshold=2.732e+03, percent-clipped=24.0 2023-06-25 17:29:02,984 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2035626.0, ans=0.1 2023-06-25 17:29:23,964 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.75 vs. limit=15.0 2023-06-25 17:29:24,676 INFO [train.py:996] (2/4) Epoch 12, batch 3850, loss[loss=0.2524, simple_loss=0.318, pruned_loss=0.09336, over 21391.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3092, pruned_loss=0.08199, over 4286760.57 frames. ], batch size: 211, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:30:01,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2035806.0, ans=0.1 2023-06-25 17:30:06,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2035866.0, ans=0.07 2023-06-25 17:30:09,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2035866.0, ans=0.0 2023-06-25 17:30:29,100 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.42 vs. limit=22.5 2023-06-25 17:31:07,060 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=12.19 vs. limit=15.0 2023-06-25 17:31:07,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2035986.0, ans=0.125 2023-06-25 17:31:15,999 INFO [train.py:996] (2/4) Epoch 12, batch 3900, loss[loss=0.2294, simple_loss=0.2875, pruned_loss=0.08563, over 21490.00 frames. ], tot_loss[loss=0.2365, simple_loss=0.3081, pruned_loss=0.08246, over 4289334.03 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:32:41,378 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.65 vs. limit=15.0 2023-06-25 17:32:41,850 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.422e+02 7.358e+02 1.058e+03 1.629e+03 3.913e+03, threshold=2.115e+03, percent-clipped=2.0 2023-06-25 17:33:04,662 INFO [train.py:996] (2/4) Epoch 12, batch 3950, loss[loss=0.176, simple_loss=0.2604, pruned_loss=0.04578, over 21342.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3084, pruned_loss=0.08107, over 4284084.67 frames. ], batch size: 194, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:34:04,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2036466.0, ans=0.1 2023-06-25 17:34:40,259 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-25 17:34:56,653 INFO [train.py:996] (2/4) Epoch 12, batch 4000, loss[loss=0.2355, simple_loss=0.2929, pruned_loss=0.08903, over 15124.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3015, pruned_loss=0.07794, over 4272179.04 frames. ], batch size: 62, lr: 2.46e-03, grad_scale: 32.0 2023-06-25 17:35:54,697 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.59 vs. limit=10.0 2023-06-25 17:36:05,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2036766.0, ans=0.1 2023-06-25 17:36:16,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2036826.0, ans=0.125 2023-06-25 17:36:29,331 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=10.50 vs. limit=15.0 2023-06-25 17:36:31,375 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.824e+02 9.183e+02 1.353e+03 2.384e+03 4.707e+03, threshold=2.707e+03, percent-clipped=29.0 2023-06-25 17:36:51,622 INFO [train.py:996] (2/4) Epoch 12, batch 4050, loss[loss=0.2121, simple_loss=0.293, pruned_loss=0.06559, over 21852.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3004, pruned_loss=0.07605, over 4274027.29 frames. ], batch size: 371, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:36:52,621 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=15.0 2023-06-25 17:37:14,780 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.53 vs. limit=22.5 2023-06-25 17:37:22,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2037006.0, ans=0.2 2023-06-25 17:37:22,527 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 17:37:27,876 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2037006.0, ans=0.1 2023-06-25 17:37:46,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2037066.0, ans=0.0 2023-06-25 17:38:41,584 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=7.48 vs. limit=15.0 2023-06-25 17:38:44,031 INFO [train.py:996] (2/4) Epoch 12, batch 4100, loss[loss=0.2222, simple_loss=0.3001, pruned_loss=0.07219, over 21266.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.303, pruned_loss=0.07698, over 4275933.31 frames. ], batch size: 159, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:39:21,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2037306.0, ans=0.125 2023-06-25 17:39:24,195 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.26 vs. limit=10.0 2023-06-25 17:40:15,661 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.136e+02 8.322e+02 1.179e+03 1.731e+03 4.243e+03, threshold=2.358e+03, percent-clipped=9.0 2023-06-25 17:40:37,178 INFO [train.py:996] (2/4) Epoch 12, batch 4150, loss[loss=0.2092, simple_loss=0.285, pruned_loss=0.06666, over 21053.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3025, pruned_loss=0.07486, over 4277253.22 frames. ], batch size: 143, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:40:48,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2037546.0, ans=0.2 2023-06-25 17:41:26,989 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.45 vs. limit=15.0 2023-06-25 17:42:31,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2037786.0, ans=0.07 2023-06-25 17:42:31,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2037786.0, ans=0.1 2023-06-25 17:42:39,772 INFO [train.py:996] (2/4) Epoch 12, batch 4200, loss[loss=0.1925, simple_loss=0.2622, pruned_loss=0.0614, over 21482.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.3028, pruned_loss=0.07393, over 4275868.57 frames. ], batch size: 230, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:42:55,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2037846.0, ans=0.125 2023-06-25 17:43:47,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2038026.0, ans=0.04949747468305833 2023-06-25 17:43:54,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2038026.0, ans=0.0 2023-06-25 17:43:58,895 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.587e+02 1.298e+03 1.935e+03 2.585e+03 6.035e+03, threshold=3.870e+03, percent-clipped=37.0 2023-06-25 17:44:20,924 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.49 vs. limit=15.0 2023-06-25 17:44:21,886 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2038086.0, ans=0.125 2023-06-25 17:44:26,639 INFO [train.py:996] (2/4) Epoch 12, batch 4250, loss[loss=0.2694, simple_loss=0.3484, pruned_loss=0.0952, over 21760.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3105, pruned_loss=0.07639, over 4273421.08 frames. ], batch size: 351, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:44:47,272 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2038146.0, ans=0.125 2023-06-25 17:44:48,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2038206.0, ans=0.015 2023-06-25 17:45:01,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2038206.0, ans=0.125 2023-06-25 17:45:16,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2038266.0, ans=0.125 2023-06-25 17:45:27,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2038266.0, ans=0.1 2023-06-25 17:46:20,938 INFO [train.py:996] (2/4) Epoch 12, batch 4300, loss[loss=0.2256, simple_loss=0.3371, pruned_loss=0.05708, over 21627.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3165, pruned_loss=0.07806, over 4272200.58 frames. ], batch size: 389, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:46:54,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2038506.0, ans=0.0 2023-06-25 17:46:58,840 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2038506.0, ans=0.95 2023-06-25 17:47:46,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2038626.0, ans=0.2 2023-06-25 17:47:49,730 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.415e+02 1.013e+03 1.560e+03 2.387e+03 5.571e+03, threshold=3.121e+03, percent-clipped=6.0 2023-06-25 17:48:21,411 INFO [train.py:996] (2/4) Epoch 12, batch 4350, loss[loss=0.2487, simple_loss=0.3233, pruned_loss=0.08708, over 21506.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3163, pruned_loss=0.07723, over 4272626.84 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:48:32,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2038746.0, ans=0.1 2023-06-25 17:48:35,893 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2038746.0, ans=0.0 2023-06-25 17:49:21,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2038866.0, ans=0.0 2023-06-25 17:49:52,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2038986.0, ans=0.1 2023-06-25 17:49:53,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2038986.0, ans=0.0 2023-06-25 17:49:56,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2038986.0, ans=0.1 2023-06-25 17:50:21,671 INFO [train.py:996] (2/4) Epoch 12, batch 4400, loss[loss=0.2739, simple_loss=0.3636, pruned_loss=0.09207, over 21574.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3129, pruned_loss=0.07691, over 4260094.11 frames. ], batch size: 441, lr: 2.46e-03, grad_scale: 32.0 2023-06-25 17:51:17,260 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.39 vs. limit=22.5 2023-06-25 17:51:49,920 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.648e+02 9.673e+02 1.616e+03 2.348e+03 4.188e+03, threshold=3.231e+03, percent-clipped=7.0 2023-06-25 17:52:07,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2039286.0, ans=0.125 2023-06-25 17:52:17,330 INFO [train.py:996] (2/4) Epoch 12, batch 4450, loss[loss=0.2462, simple_loss=0.3344, pruned_loss=0.07898, over 21720.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.321, pruned_loss=0.07892, over 4267098.58 frames. ], batch size: 263, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:52:23,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2039346.0, ans=0.125 2023-06-25 17:52:30,623 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.94 vs. limit=15.0 2023-06-25 17:52:37,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2039346.0, ans=0.04949747468305833 2023-06-25 17:52:54,764 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2039406.0, ans=0.125 2023-06-25 17:53:52,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2039586.0, ans=0.0 2023-06-25 17:54:08,236 INFO [train.py:996] (2/4) Epoch 12, batch 4500, loss[loss=0.2427, simple_loss=0.3175, pruned_loss=0.08393, over 21908.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3209, pruned_loss=0.08039, over 4268165.54 frames. ], batch size: 107, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:54:11,544 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.05 vs. limit=15.0 2023-06-25 17:54:33,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2039706.0, ans=0.2 2023-06-25 17:55:07,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2039766.0, ans=0.125 2023-06-25 17:55:20,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2039826.0, ans=0.125 2023-06-25 17:55:36,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2039826.0, ans=0.1 2023-06-25 17:55:41,747 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.020e+02 9.450e+02 1.326e+03 2.168e+03 4.753e+03, threshold=2.653e+03, percent-clipped=7.0 2023-06-25 17:55:45,084 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.99 vs. limit=15.0 2023-06-25 17:56:06,449 INFO [train.py:996] (2/4) Epoch 12, batch 4550, loss[loss=0.2712, simple_loss=0.3494, pruned_loss=0.09654, over 21449.00 frames. ], tot_loss[loss=0.2426, simple_loss=0.3236, pruned_loss=0.08078, over 4275581.01 frames. ], batch size: 131, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:56:19,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2039946.0, ans=0.05 2023-06-25 17:56:24,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2040006.0, ans=10.0 2023-06-25 17:56:53,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2040066.0, ans=0.0 2023-06-25 17:57:14,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2040066.0, ans=0.035 2023-06-25 17:57:59,513 INFO [train.py:996] (2/4) Epoch 12, batch 4600, loss[loss=0.2713, simple_loss=0.3404, pruned_loss=0.1011, over 21312.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3238, pruned_loss=0.08177, over 4280785.36 frames. ], batch size: 143, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:58:00,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2040246.0, ans=0.125 2023-06-25 17:58:34,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2040306.0, ans=0.0 2023-06-25 17:59:34,148 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.831e+02 1.036e+03 1.391e+03 1.906e+03 3.939e+03, threshold=2.783e+03, percent-clipped=3.0 2023-06-25 17:59:53,338 INFO [train.py:996] (2/4) Epoch 12, batch 4650, loss[loss=0.2117, simple_loss=0.3309, pruned_loss=0.04622, over 20914.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3192, pruned_loss=0.08047, over 4285921.50 frames. ], batch size: 608, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 17:59:57,481 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2040546.0, ans=0.1 2023-06-25 18:01:39,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=2040786.0, ans=0.95 2023-06-25 18:01:39,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2040786.0, ans=0.07 2023-06-25 18:01:47,031 INFO [train.py:996] (2/4) Epoch 12, batch 4700, loss[loss=0.2203, simple_loss=0.2812, pruned_loss=0.07966, over 21607.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3115, pruned_loss=0.07852, over 4286235.07 frames. ], batch size: 247, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 18:01:48,038 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2040846.0, ans=0.09899494936611666 2023-06-25 18:01:51,608 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2040846.0, ans=0.125 2023-06-25 18:02:02,877 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2040846.0, ans=10.0 2023-06-25 18:02:45,657 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.37 vs. limit=10.0 2023-06-25 18:03:01,187 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2041026.0, ans=0.0 2023-06-25 18:03:01,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2041026.0, ans=0.04949747468305833 2023-06-25 18:03:19,778 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.793e+02 9.702e+02 1.366e+03 2.356e+03 4.995e+03, threshold=2.732e+03, percent-clipped=18.0 2023-06-25 18:03:39,177 INFO [train.py:996] (2/4) Epoch 12, batch 4750, loss[loss=0.2662, simple_loss=0.4037, pruned_loss=0.06432, over 20753.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3059, pruned_loss=0.07785, over 4287951.46 frames. ], batch size: 607, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 18:04:33,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2041266.0, ans=0.0 2023-06-25 18:04:37,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2041266.0, ans=0.1 2023-06-25 18:05:29,029 INFO [train.py:996] (2/4) Epoch 12, batch 4800, loss[loss=0.2102, simple_loss=0.2867, pruned_loss=0.06683, over 21319.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3042, pruned_loss=0.07783, over 4292621.97 frames. ], batch size: 159, lr: 2.46e-03, grad_scale: 32.0 2023-06-25 18:06:27,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2041566.0, ans=0.0 2023-06-25 18:06:38,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer2.prob, batch_count=2041626.0, ans=0.125 2023-06-25 18:06:38,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2041626.0, ans=0.0 2023-06-25 18:06:56,309 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.862e+02 9.240e+02 1.237e+03 1.888e+03 3.806e+03, threshold=2.475e+03, percent-clipped=7.0 2023-06-25 18:06:58,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2041686.0, ans=0.125 2023-06-25 18:07:00,265 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2041686.0, ans=0.125 2023-06-25 18:07:13,731 INFO [train.py:996] (2/4) Epoch 12, batch 4850, loss[loss=0.2253, simple_loss=0.3012, pruned_loss=0.0747, over 21835.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3055, pruned_loss=0.077, over 4297931.53 frames. ], batch size: 332, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 18:07:28,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2041746.0, ans=0.0 2023-06-25 18:07:29,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2041806.0, ans=0.0 2023-06-25 18:08:00,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2041866.0, ans=0.1 2023-06-25 18:08:28,980 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.82 vs. limit=15.0 2023-06-25 18:09:03,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2041986.0, ans=0.2 2023-06-25 18:09:03,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2041986.0, ans=0.125 2023-06-25 18:09:06,965 INFO [train.py:996] (2/4) Epoch 12, batch 4900, loss[loss=0.2339, simple_loss=0.3049, pruned_loss=0.08142, over 16694.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3073, pruned_loss=0.07827, over 4286966.10 frames. ], batch size: 63, lr: 2.46e-03, grad_scale: 16.0 2023-06-25 18:09:27,119 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=15.0 2023-06-25 18:09:38,662 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2042106.0, ans=0.1 2023-06-25 18:09:55,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2042166.0, ans=0.0 2023-06-25 18:09:56,791 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2042166.0, ans=0.1 2023-06-25 18:10:09,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2042166.0, ans=0.125 2023-06-25 18:10:35,319 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2042286.0, ans=0.0 2023-06-25 18:10:36,709 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.253e+02 9.394e+02 1.371e+03 2.232e+03 4.474e+03, threshold=2.741e+03, percent-clipped=21.0 2023-06-25 18:10:38,834 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=6.71 vs. limit=15.0 2023-06-25 18:10:55,747 INFO [train.py:996] (2/4) Epoch 12, batch 4950, loss[loss=0.1923, simple_loss=0.2735, pruned_loss=0.05556, over 20823.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3104, pruned_loss=0.07667, over 4279183.63 frames. ], batch size: 607, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:10:56,321 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2042346.0, ans=0.0 2023-06-25 18:11:18,693 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.08 vs. limit=15.0 2023-06-25 18:11:56,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2042466.0, ans=0.125 2023-06-25 18:12:19,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2042526.0, ans=0.125 2023-06-25 18:12:33,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2042586.0, ans=0.125 2023-06-25 18:12:45,030 INFO [train.py:996] (2/4) Epoch 12, batch 5000, loss[loss=0.2264, simple_loss=0.3026, pruned_loss=0.0751, over 21863.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3086, pruned_loss=0.07334, over 4281032.82 frames. ], batch size: 351, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:13:17,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2042706.0, ans=0.125 2023-06-25 18:13:45,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2042766.0, ans=0.125 2023-06-25 18:14:19,288 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.147e+02 9.354e+02 1.620e+03 2.183e+03 4.157e+03, threshold=3.240e+03, percent-clipped=13.0 2023-06-25 18:14:23,658 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.72 vs. limit=12.0 2023-06-25 18:14:25,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2042886.0, ans=0.125 2023-06-25 18:14:34,572 INFO [train.py:996] (2/4) Epoch 12, batch 5050, loss[loss=0.2309, simple_loss=0.3038, pruned_loss=0.07902, over 21351.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.308, pruned_loss=0.07517, over 4288427.15 frames. ], batch size: 176, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:15:10,305 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-25 18:16:23,004 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2043246.0, ans=0.1 2023-06-25 18:16:24,408 INFO [train.py:996] (2/4) Epoch 12, batch 5100, loss[loss=0.2516, simple_loss=0.3133, pruned_loss=0.09495, over 21370.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.3068, pruned_loss=0.07626, over 4296945.47 frames. ], batch size: 143, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:16:51,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys.whitening_limit, batch_count=2043306.0, ans=6.0 2023-06-25 18:18:00,377 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.858e+02 8.212e+02 1.178e+03 1.481e+03 2.753e+03, threshold=2.355e+03, percent-clipped=0.0 2023-06-25 18:18:15,854 INFO [train.py:996] (2/4) Epoch 12, batch 5150, loss[loss=0.231, simple_loss=0.2974, pruned_loss=0.08231, over 21639.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3043, pruned_loss=0.07676, over 4299377.97 frames. ], batch size: 230, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:18:21,282 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2043546.0, ans=0.125 2023-06-25 18:19:19,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2043666.0, ans=0.125 2023-06-25 18:19:46,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2043726.0, ans=0.125 2023-06-25 18:20:07,387 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2043786.0, ans=0.1 2023-06-25 18:20:10,112 INFO [train.py:996] (2/4) Epoch 12, batch 5200, loss[loss=0.263, simple_loss=0.3638, pruned_loss=0.08108, over 21705.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.307, pruned_loss=0.07787, over 4291415.84 frames. ], batch size: 414, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:20:10,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2043846.0, ans=0.125 2023-06-25 18:20:43,068 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.61 vs. limit=12.0 2023-06-25 18:20:46,058 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2043906.0, ans=0.125 2023-06-25 18:21:07,183 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2043966.0, ans=0.125 2023-06-25 18:21:42,280 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2044086.0, ans=0.1 2023-06-25 18:21:42,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2044086.0, ans=0.2 2023-06-25 18:21:44,984 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.871e+02 1.084e+03 1.828e+03 2.790e+03 5.969e+03, threshold=3.657e+03, percent-clipped=36.0 2023-06-25 18:22:00,092 INFO [train.py:996] (2/4) Epoch 12, batch 5250, loss[loss=0.2111, simple_loss=0.2885, pruned_loss=0.06689, over 21874.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3091, pruned_loss=0.07576, over 4284129.03 frames. ], batch size: 107, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:23:07,412 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.01 vs. limit=22.5 2023-06-25 18:23:10,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2044326.0, ans=0.2 2023-06-25 18:23:39,765 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2044386.0, ans=0.125 2023-06-25 18:23:52,664 INFO [train.py:996] (2/4) Epoch 12, batch 5300, loss[loss=0.2465, simple_loss=0.3158, pruned_loss=0.08856, over 21933.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3099, pruned_loss=0.07692, over 4289364.71 frames. ], batch size: 333, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:24:20,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2044506.0, ans=0.125 2023-06-25 18:25:19,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2044686.0, ans=0.0 2023-06-25 18:25:20,852 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.228e+02 8.444e+02 1.453e+03 2.194e+03 4.490e+03, threshold=2.906e+03, percent-clipped=2.0 2023-06-25 18:25:39,050 INFO [train.py:996] (2/4) Epoch 12, batch 5350, loss[loss=0.2239, simple_loss=0.2936, pruned_loss=0.07707, over 21941.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3082, pruned_loss=0.07747, over 4294091.88 frames. ], batch size: 333, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:25:43,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2044746.0, ans=0.125 2023-06-25 18:25:52,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2044746.0, ans=0.125 2023-06-25 18:26:26,531 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2044866.0, ans=0.1 2023-06-25 18:27:08,929 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-06-25 18:27:28,468 INFO [train.py:996] (2/4) Epoch 12, batch 5400, loss[loss=0.2248, simple_loss=0.3049, pruned_loss=0.07235, over 21695.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3073, pruned_loss=0.07861, over 4293755.00 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:27:54,400 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:27:58,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2045106.0, ans=0.0 2023-06-25 18:28:06,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2045106.0, ans=0.015 2023-06-25 18:28:15,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2045166.0, ans=0.2 2023-06-25 18:28:39,507 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.81 vs. limit=15.0 2023-06-25 18:28:43,057 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.64 vs. limit=22.5 2023-06-25 18:29:03,878 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.530e+02 9.524e+02 1.484e+03 2.155e+03 3.165e+03, threshold=2.968e+03, percent-clipped=5.0 2023-06-25 18:29:13,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2045286.0, ans=0.125 2023-06-25 18:29:17,837 INFO [train.py:996] (2/4) Epoch 12, batch 5450, loss[loss=0.2732, simple_loss=0.3658, pruned_loss=0.0903, over 21276.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.3072, pruned_loss=0.07678, over 4297849.65 frames. ], batch size: 548, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:29:18,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2045346.0, ans=0.0 2023-06-25 18:30:06,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2045406.0, ans=0.95 2023-06-25 18:30:08,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2045466.0, ans=0.05 2023-06-25 18:30:16,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2045466.0, ans=0.2 2023-06-25 18:30:47,134 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=15.0 2023-06-25 18:30:48,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2045586.0, ans=0.125 2023-06-25 18:30:51,629 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2045586.0, ans=0.0 2023-06-25 18:30:53,732 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2045586.0, ans=0.125 2023-06-25 18:31:20,613 INFO [train.py:996] (2/4) Epoch 12, batch 5500, loss[loss=0.2684, simple_loss=0.365, pruned_loss=0.0859, over 21680.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3134, pruned_loss=0.07417, over 4296454.82 frames. ], batch size: 414, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:31:27,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2045646.0, ans=0.07 2023-06-25 18:31:30,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2045646.0, ans=0.2 2023-06-25 18:31:48,713 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.48 vs. limit=15.0 2023-06-25 18:31:51,649 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2045706.0, ans=0.0 2023-06-25 18:32:54,283 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.103e+02 8.816e+02 1.255e+03 2.207e+03 4.493e+03, threshold=2.511e+03, percent-clipped=9.0 2023-06-25 18:33:15,534 INFO [train.py:996] (2/4) Epoch 12, batch 5550, loss[loss=0.2018, simple_loss=0.299, pruned_loss=0.05234, over 21760.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3133, pruned_loss=0.07167, over 4290500.75 frames. ], batch size: 371, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 18:33:34,851 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2045946.0, ans=0.125 2023-06-25 18:35:15,789 INFO [train.py:996] (2/4) Epoch 12, batch 5600, loss[loss=0.2461, simple_loss=0.3461, pruned_loss=0.07306, over 21707.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.31, pruned_loss=0.06851, over 4284092.95 frames. ], batch size: 298, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:35:23,826 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2046246.0, ans=0.0 2023-06-25 18:35:30,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2046246.0, ans=0.125 2023-06-25 18:35:51,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2046306.0, ans=0.125 2023-06-25 18:36:05,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=2046366.0, ans=0.05 2023-06-25 18:36:42,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2046486.0, ans=0.125 2023-06-25 18:36:43,819 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.414e+02 9.096e+02 1.468e+03 2.250e+03 5.132e+03, threshold=2.936e+03, percent-clipped=21.0 2023-06-25 18:36:44,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2046486.0, ans=0.1 2023-06-25 18:37:03,767 INFO [train.py:996] (2/4) Epoch 12, batch 5650, loss[loss=0.2477, simple_loss=0.3048, pruned_loss=0.09525, over 21251.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3143, pruned_loss=0.07196, over 4285897.78 frames. ], batch size: 608, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:38:53,206 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=2046846.0, ans=10.0 2023-06-25 18:38:53,838 INFO [train.py:996] (2/4) Epoch 12, batch 5700, loss[loss=0.2078, simple_loss=0.3127, pruned_loss=0.05151, over 21237.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3137, pruned_loss=0.07381, over 4290888.06 frames. ], batch size: 548, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:39:45,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2046966.0, ans=0.1 2023-06-25 18:40:02,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2046966.0, ans=0.1 2023-06-25 18:40:32,250 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=12.0 2023-06-25 18:40:34,240 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.374e+02 7.754e+02 1.090e+03 1.653e+03 4.716e+03, threshold=2.180e+03, percent-clipped=6.0 2023-06-25 18:40:48,815 INFO [train.py:996] (2/4) Epoch 12, batch 5750, loss[loss=0.1722, simple_loss=0.2693, pruned_loss=0.03753, over 21802.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3092, pruned_loss=0.07031, over 4287680.50 frames. ], batch size: 371, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:42:04,030 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.58 vs. limit=15.0 2023-06-25 18:42:40,640 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=15.0 2023-06-25 18:42:46,117 INFO [train.py:996] (2/4) Epoch 12, batch 5800, loss[loss=0.2353, simple_loss=0.3279, pruned_loss=0.07138, over 21631.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3089, pruned_loss=0.06943, over 4289679.71 frames. ], batch size: 263, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:43:24,653 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.43 vs. limit=6.0 2023-06-25 18:43:49,058 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.27 vs. limit=22.5 2023-06-25 18:44:17,450 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.217e+02 1.046e+03 1.663e+03 2.546e+03 5.272e+03, threshold=3.326e+03, percent-clipped=34.0 2023-06-25 18:44:42,285 INFO [train.py:996] (2/4) Epoch 12, batch 5850, loss[loss=0.2575, simple_loss=0.3412, pruned_loss=0.08688, over 20157.00 frames. ], tot_loss[loss=0.2202, simple_loss=0.3081, pruned_loss=0.06617, over 4282750.26 frames. ], batch size: 702, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:44:42,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2047746.0, ans=0.05 2023-06-25 18:44:44,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2047746.0, ans=0.2 2023-06-25 18:45:17,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2047806.0, ans=0.07 2023-06-25 18:46:12,697 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.61 vs. limit=10.0 2023-06-25 18:46:29,674 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2047986.0, ans=0.0 2023-06-25 18:46:34,296 INFO [train.py:996] (2/4) Epoch 12, batch 5900, loss[loss=0.1612, simple_loss=0.2606, pruned_loss=0.03095, over 21822.00 frames. ], tot_loss[loss=0.2106, simple_loss=0.2998, pruned_loss=0.06065, over 4287818.37 frames. ], batch size: 316, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:46:35,091 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2048046.0, ans=0.0 2023-06-25 18:47:23,102 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2023-06-25 18:47:29,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2048166.0, ans=0.1 2023-06-25 18:47:42,019 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.00 vs. limit=15.0 2023-06-25 18:48:00,967 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.261e+02 8.415e+02 1.286e+03 1.722e+03 5.284e+03, threshold=2.571e+03, percent-clipped=3.0 2023-06-25 18:48:22,676 INFO [train.py:996] (2/4) Epoch 12, batch 5950, loss[loss=0.1979, simple_loss=0.2617, pruned_loss=0.06709, over 21627.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2997, pruned_loss=0.06364, over 4293479.13 frames. ], batch size: 263, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:49:02,521 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=15.0 2023-06-25 18:49:08,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2048466.0, ans=0.0 2023-06-25 18:49:10,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2048466.0, ans=0.125 2023-06-25 18:49:20,630 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.98 vs. limit=6.0 2023-06-25 18:49:50,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2048586.0, ans=0.07 2023-06-25 18:50:19,472 INFO [train.py:996] (2/4) Epoch 12, batch 6000, loss[loss=0.2012, simple_loss=0.2633, pruned_loss=0.0695, over 21746.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2958, pruned_loss=0.06652, over 4283419.21 frames. ], batch size: 300, lr: 2.45e-03, grad_scale: 32.0 2023-06-25 18:50:19,473 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 18:50:36,771 INFO [train.py:1028] (2/4) Epoch 12, validation: loss=0.2561, simple_loss=0.3516, pruned_loss=0.08031, over 1796401.00 frames. 2023-06-25 18:50:36,772 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-25 18:50:54,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2048646.0, ans=0.07 2023-06-25 18:51:12,940 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2048706.0, ans=0.2 2023-06-25 18:51:21,043 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=15.0 2023-06-25 18:51:52,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2048826.0, ans=0.2 2023-06-25 18:52:09,094 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2048886.0, ans=0.0 2023-06-25 18:52:13,644 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.382e+02 1.134e+03 1.580e+03 2.203e+03 4.281e+03, threshold=3.160e+03, percent-clipped=13.0 2023-06-25 18:52:25,730 INFO [train.py:996] (2/4) Epoch 12, batch 6050, loss[loss=0.1984, simple_loss=0.2654, pruned_loss=0.06564, over 22025.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2914, pruned_loss=0.06798, over 4280108.45 frames. ], batch size: 103, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:52:28,545 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2048946.0, ans=0.125 2023-06-25 18:53:27,812 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 18:53:51,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2049126.0, ans=0.1 2023-06-25 18:54:14,006 INFO [train.py:996] (2/4) Epoch 12, batch 6100, loss[loss=0.2008, simple_loss=0.2846, pruned_loss=0.05849, over 21735.00 frames. ], tot_loss[loss=0.2129, simple_loss=0.2912, pruned_loss=0.06734, over 4270388.93 frames. ], batch size: 247, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:54:28,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2049246.0, ans=0.125 2023-06-25 18:54:39,119 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2049306.0, ans=0.05 2023-06-25 18:54:53,877 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.07 vs. limit=12.0 2023-06-25 18:55:09,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2049366.0, ans=0.125 2023-06-25 18:55:36,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2049426.0, ans=0.125 2023-06-25 18:55:51,217 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.240e+02 1.094e+03 1.641e+03 2.565e+03 7.490e+03, threshold=3.281e+03, percent-clipped=16.0 2023-06-25 18:56:03,314 INFO [train.py:996] (2/4) Epoch 12, batch 6150, loss[loss=0.2222, simple_loss=0.2898, pruned_loss=0.07725, over 21901.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2929, pruned_loss=0.0696, over 4284019.92 frames. ], batch size: 98, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:57:16,444 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2049726.0, ans=0.2 2023-06-25 18:57:56,203 INFO [train.py:996] (2/4) Epoch 12, batch 6200, loss[loss=0.2054, simple_loss=0.2962, pruned_loss=0.05731, over 21696.00 frames. ], tot_loss[loss=0.2187, simple_loss=0.296, pruned_loss=0.07074, over 4284174.79 frames. ], batch size: 247, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:58:17,414 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2049906.0, ans=0.125 2023-06-25 18:58:50,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2049966.0, ans=0.0 2023-06-25 18:59:22,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2050086.0, ans=0.1 2023-06-25 18:59:28,923 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.378e+02 1.039e+03 1.350e+03 2.072e+03 4.321e+03, threshold=2.700e+03, percent-clipped=2.0 2023-06-25 18:59:36,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2050086.0, ans=0.0 2023-06-25 18:59:45,868 INFO [train.py:996] (2/4) Epoch 12, batch 6250, loss[loss=0.2704, simple_loss=0.3713, pruned_loss=0.08472, over 21523.00 frames. ], tot_loss[loss=0.2223, simple_loss=0.3021, pruned_loss=0.07119, over 4282488.78 frames. ], batch size: 471, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 18:59:46,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2050146.0, ans=0.125 2023-06-25 19:00:34,503 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.41 vs. limit=15.0 2023-06-25 19:01:00,215 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2050326.0, ans=0.1 2023-06-25 19:01:29,982 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=12.0 2023-06-25 19:01:34,300 INFO [train.py:996] (2/4) Epoch 12, batch 6300, loss[loss=0.2177, simple_loss=0.2884, pruned_loss=0.07354, over 20002.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3064, pruned_loss=0.06957, over 4281028.71 frames. ], batch size: 702, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:01:40,172 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2050446.0, ans=0.0 2023-06-25 19:02:59,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2050626.0, ans=0.2 2023-06-25 19:03:04,426 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2050686.0, ans=0.1 2023-06-25 19:03:12,172 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.168e+02 9.742e+02 1.547e+03 2.074e+03 3.988e+03, threshold=3.093e+03, percent-clipped=9.0 2023-06-25 19:03:22,548 INFO [train.py:996] (2/4) Epoch 12, batch 6350, loss[loss=0.2936, simple_loss=0.3595, pruned_loss=0.1139, over 21504.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3089, pruned_loss=0.07328, over 4291382.74 frames. ], batch size: 471, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:04:05,557 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2050806.0, ans=0.2 2023-06-25 19:04:32,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2050926.0, ans=0.0 2023-06-25 19:05:15,824 INFO [train.py:996] (2/4) Epoch 12, batch 6400, loss[loss=0.2591, simple_loss=0.3464, pruned_loss=0.08592, over 21794.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3152, pruned_loss=0.0782, over 4291520.21 frames. ], batch size: 124, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:05:16,249 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2051046.0, ans=0.125 2023-06-25 19:05:29,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2051046.0, ans=0.125 2023-06-25 19:06:41,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2051226.0, ans=0.125 2023-06-25 19:06:45,400 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.97 vs. limit=12.0 2023-06-25 19:06:54,049 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.078e+02 9.080e+02 1.264e+03 1.614e+03 4.055e+03, threshold=2.529e+03, percent-clipped=6.0 2023-06-25 19:06:54,925 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2051286.0, ans=0.1 2023-06-25 19:07:09,054 INFO [train.py:996] (2/4) Epoch 12, batch 6450, loss[loss=0.2126, simple_loss=0.3032, pruned_loss=0.06098, over 21561.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3167, pruned_loss=0.07752, over 4282228.61 frames. ], batch size: 389, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:08:35,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2051586.0, ans=0.125 2023-06-25 19:08:59,128 INFO [train.py:996] (2/4) Epoch 12, batch 6500, loss[loss=0.2466, simple_loss=0.3016, pruned_loss=0.09584, over 21247.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3089, pruned_loss=0.07643, over 4277693.38 frames. ], batch size: 471, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:09:01,395 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2051646.0, ans=0.05 2023-06-25 19:09:56,909 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2051766.0, ans=0.0 2023-06-25 19:10:35,314 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.489e+02 7.750e+02 1.055e+03 1.738e+03 4.061e+03, threshold=2.109e+03, percent-clipped=10.0 2023-06-25 19:10:36,364 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=12.0 2023-06-25 19:10:52,982 INFO [train.py:996] (2/4) Epoch 12, batch 6550, loss[loss=0.2494, simple_loss=0.3203, pruned_loss=0.08927, over 21723.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3074, pruned_loss=0.07614, over 4275567.94 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:10:53,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2051946.0, ans=0.125 2023-06-25 19:11:16,532 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2052006.0, ans=0.1 2023-06-25 19:11:39,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2052066.0, ans=0.125 2023-06-25 19:11:41,865 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2052066.0, ans=0.2 2023-06-25 19:11:55,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2052126.0, ans=0.125 2023-06-25 19:12:05,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2052126.0, ans=0.0 2023-06-25 19:12:12,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2052186.0, ans=0.125 2023-06-25 19:12:42,078 INFO [train.py:996] (2/4) Epoch 12, batch 6600, loss[loss=0.1795, simple_loss=0.2443, pruned_loss=0.0573, over 21642.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3015, pruned_loss=0.07567, over 4277616.26 frames. ], batch size: 247, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:13:19,511 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-25 19:13:37,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2052366.0, ans=0.125 2023-06-25 19:13:44,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2052426.0, ans=0.0 2023-06-25 19:13:46,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2052426.0, ans=0.125 2023-06-25 19:14:22,503 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.983e+02 6.750e+02 1.048e+03 1.422e+03 4.566e+03, threshold=2.096e+03, percent-clipped=9.0 2023-06-25 19:14:26,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2052486.0, ans=0.025 2023-06-25 19:14:36,289 INFO [train.py:996] (2/4) Epoch 12, batch 6650, loss[loss=0.1813, simple_loss=0.2725, pruned_loss=0.04508, over 21727.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2938, pruned_loss=0.0722, over 4275605.14 frames. ], batch size: 333, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:15:06,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2052606.0, ans=0.0 2023-06-25 19:16:24,132 INFO [train.py:996] (2/4) Epoch 12, batch 6700, loss[loss=0.2245, simple_loss=0.3131, pruned_loss=0.06794, over 21099.00 frames. ], tot_loss[loss=0.2169, simple_loss=0.2901, pruned_loss=0.07189, over 4279556.84 frames. ], batch size: 608, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:17:04,628 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2052966.0, ans=0.125 2023-06-25 19:17:05,425 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.80 vs. limit=10.0 2023-06-25 19:17:59,533 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.379e+02 9.118e+02 1.454e+03 2.169e+03 5.767e+03, threshold=2.907e+03, percent-clipped=27.0 2023-06-25 19:18:13,682 INFO [train.py:996] (2/4) Epoch 12, batch 6750, loss[loss=0.2035, simple_loss=0.2738, pruned_loss=0.06664, over 21805.00 frames. ], tot_loss[loss=0.2171, simple_loss=0.2893, pruned_loss=0.07243, over 4287098.72 frames. ], batch size: 102, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:18:58,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2053266.0, ans=0.125 2023-06-25 19:19:19,200 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2053326.0, ans=0.125 2023-06-25 19:20:02,654 INFO [train.py:996] (2/4) Epoch 12, batch 6800, loss[loss=0.2371, simple_loss=0.2976, pruned_loss=0.08835, over 21731.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.292, pruned_loss=0.07455, over 4278111.32 frames. ], batch size: 333, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:20:08,252 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2053446.0, ans=0.1 2023-06-25 19:20:16,414 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:20:29,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2053506.0, ans=0.1 2023-06-25 19:20:59,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2053626.0, ans=0.125 2023-06-25 19:20:59,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2053626.0, ans=0.2 2023-06-25 19:21:21,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2053686.0, ans=0.2 2023-06-25 19:21:27,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2053686.0, ans=0.125 2023-06-25 19:21:28,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2053686.0, ans=0.2 2023-06-25 19:21:29,042 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.142e+02 1.053e+03 1.480e+03 2.142e+03 3.474e+03, threshold=2.960e+03, percent-clipped=10.0 2023-06-25 19:21:42,875 INFO [train.py:996] (2/4) Epoch 12, batch 6850, loss[loss=0.2512, simple_loss=0.312, pruned_loss=0.09518, over 21427.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2903, pruned_loss=0.07582, over 4277626.57 frames. ], batch size: 159, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:21:50,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2053746.0, ans=0.0 2023-06-25 19:22:41,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2053866.0, ans=0.0 2023-06-25 19:22:45,425 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2053926.0, ans=0.125 2023-06-25 19:22:53,450 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=6.16 vs. limit=15.0 2023-06-25 19:23:39,689 INFO [train.py:996] (2/4) Epoch 12, batch 6900, loss[loss=0.2302, simple_loss=0.3065, pruned_loss=0.0769, over 21787.00 frames. ], tot_loss[loss=0.2205, simple_loss=0.2903, pruned_loss=0.07537, over 4283238.81 frames. ], batch size: 441, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:23:47,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2054046.0, ans=0.1 2023-06-25 19:24:10,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2054106.0, ans=0.125 2023-06-25 19:24:15,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2054106.0, ans=0.07 2023-06-25 19:25:03,908 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.99 vs. limit=15.0 2023-06-25 19:25:14,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2054286.0, ans=0.125 2023-06-25 19:25:17,278 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.66 vs. limit=5.0 2023-06-25 19:25:20,642 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.318e+02 8.019e+02 1.199e+03 1.523e+03 3.693e+03, threshold=2.398e+03, percent-clipped=1.0 2023-06-25 19:25:28,995 INFO [train.py:996] (2/4) Epoch 12, batch 6950, loss[loss=0.2306, simple_loss=0.3046, pruned_loss=0.07833, over 21211.00 frames. ], tot_loss[loss=0.2181, simple_loss=0.2919, pruned_loss=0.07213, over 4272954.19 frames. ], batch size: 159, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:25:50,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2054406.0, ans=0.0 2023-06-25 19:26:04,180 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:27:17,490 INFO [train.py:996] (2/4) Epoch 12, batch 7000, loss[loss=0.2222, simple_loss=0.276, pruned_loss=0.08426, over 21241.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.295, pruned_loss=0.0741, over 4278498.75 frames. ], batch size: 159, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:27:19,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2054646.0, ans=0.125 2023-06-25 19:27:37,902 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2054706.0, ans=0.1 2023-06-25 19:27:55,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2054766.0, ans=0.1 2023-06-25 19:28:06,262 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.85 vs. limit=15.0 2023-06-25 19:28:47,665 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.66 vs. limit=15.0 2023-06-25 19:28:53,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2054886.0, ans=0.0 2023-06-25 19:28:56,677 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.600e+02 9.157e+02 1.451e+03 1.908e+03 5.399e+03, threshold=2.901e+03, percent-clipped=14.0 2023-06-25 19:29:05,357 INFO [train.py:996] (2/4) Epoch 12, batch 7050, loss[loss=0.2452, simple_loss=0.3328, pruned_loss=0.07876, over 21447.00 frames. ], tot_loss[loss=0.2197, simple_loss=0.2934, pruned_loss=0.07303, over 4275192.27 frames. ], batch size: 471, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:30:02,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2055066.0, ans=0.2 2023-06-25 19:30:27,394 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2055126.0, ans=0.125 2023-06-25 19:30:27,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2055126.0, ans=0.0 2023-06-25 19:30:54,085 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.09 vs. limit=15.0 2023-06-25 19:30:58,059 INFO [train.py:996] (2/4) Epoch 12, batch 7100, loss[loss=0.2354, simple_loss=0.3125, pruned_loss=0.07918, over 21776.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2974, pruned_loss=0.0741, over 4276102.79 frames. ], batch size: 124, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:31:50,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2055366.0, ans=0.1 2023-06-25 19:32:32,644 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.001e+02 9.554e+02 1.238e+03 1.810e+03 4.265e+03, threshold=2.476e+03, percent-clipped=5.0 2023-06-25 19:32:44,680 INFO [train.py:996] (2/4) Epoch 12, batch 7150, loss[loss=0.211, simple_loss=0.2919, pruned_loss=0.06498, over 21820.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2947, pruned_loss=0.07153, over 4274945.46 frames. ], batch size: 282, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:32:49,240 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.88 vs. limit=15.0 2023-06-25 19:32:49,386 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.74 vs. limit=15.0 2023-06-25 19:34:05,438 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-25 19:34:30,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2055846.0, ans=0.125 2023-06-25 19:34:31,906 INFO [train.py:996] (2/4) Epoch 12, batch 7200, loss[loss=0.1947, simple_loss=0.2668, pruned_loss=0.06127, over 21630.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3014, pruned_loss=0.07528, over 4268004.27 frames. ], batch size: 298, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:35:40,124 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2056026.0, ans=0.1 2023-06-25 19:35:42,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2056026.0, ans=0.125 2023-06-25 19:36:16,736 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=6.0 2023-06-25 19:36:17,065 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.460e+02 1.008e+03 1.554e+03 2.511e+03 5.348e+03, threshold=3.107e+03, percent-clipped=25.0 2023-06-25 19:36:20,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2056146.0, ans=10.0 2023-06-25 19:36:21,735 INFO [train.py:996] (2/4) Epoch 12, batch 7250, loss[loss=0.1841, simple_loss=0.2424, pruned_loss=0.06292, over 21330.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2966, pruned_loss=0.07566, over 4264223.49 frames. ], batch size: 551, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:37:03,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2056206.0, ans=0.125 2023-06-25 19:37:25,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2056266.0, ans=0.125 2023-06-25 19:37:55,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2056386.0, ans=0.0 2023-06-25 19:38:09,782 INFO [train.py:996] (2/4) Epoch 12, batch 7300, loss[loss=0.1785, simple_loss=0.2459, pruned_loss=0.05557, over 21460.00 frames. ], tot_loss[loss=0.2214, simple_loss=0.2917, pruned_loss=0.07556, over 4268791.23 frames. ], batch size: 195, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:38:15,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2056446.0, ans=10.0 2023-06-25 19:38:47,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2056506.0, ans=10.0 2023-06-25 19:39:54,694 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.105e+02 8.292e+02 1.311e+03 1.863e+03 3.675e+03, threshold=2.622e+03, percent-clipped=4.0 2023-06-25 19:39:59,538 INFO [train.py:996] (2/4) Epoch 12, batch 7350, loss[loss=0.2594, simple_loss=0.3258, pruned_loss=0.09643, over 21710.00 frames. ], tot_loss[loss=0.2208, simple_loss=0.2895, pruned_loss=0.076, over 4270543.38 frames. ], batch size: 332, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:40:30,782 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.54 vs. limit=12.0 2023-06-25 19:41:03,681 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:41:20,622 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=15.0 2023-06-25 19:41:25,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2056926.0, ans=0.125 2023-06-25 19:42:01,538 INFO [train.py:996] (2/4) Epoch 12, batch 7400, loss[loss=0.2576, simple_loss=0.3481, pruned_loss=0.08354, over 21458.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2965, pruned_loss=0.07757, over 4269283.31 frames. ], batch size: 471, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:42:15,656 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2057046.0, ans=0.2 2023-06-25 19:43:02,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2057226.0, ans=0.035 2023-06-25 19:43:30,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2057286.0, ans=0.0 2023-06-25 19:43:43,651 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.378e+02 8.354e+02 1.413e+03 2.372e+03 4.608e+03, threshold=2.826e+03, percent-clipped=17.0 2023-06-25 19:43:48,734 INFO [train.py:996] (2/4) Epoch 12, batch 7450, loss[loss=0.1999, simple_loss=0.2765, pruned_loss=0.06166, over 21669.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.2974, pruned_loss=0.07712, over 4270889.85 frames. ], batch size: 333, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:44:19,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2057406.0, ans=0.125 2023-06-25 19:44:54,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2057526.0, ans=0.125 2023-06-25 19:45:39,199 INFO [train.py:996] (2/4) Epoch 12, batch 7500, loss[loss=0.2973, simple_loss=0.3936, pruned_loss=0.1005, over 21640.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3016, pruned_loss=0.07824, over 4269761.78 frames. ], batch size: 414, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:45:46,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2057646.0, ans=0.1 2023-06-25 19:47:23,858 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.023e+02 9.552e+02 1.495e+03 2.068e+03 4.249e+03, threshold=2.990e+03, percent-clipped=12.0 2023-06-25 19:47:36,215 INFO [train.py:996] (2/4) Epoch 12, batch 7550, loss[loss=0.2068, simple_loss=0.313, pruned_loss=0.05036, over 21656.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3081, pruned_loss=0.07657, over 4272415.69 frames. ], batch size: 414, lr: 2.45e-03, grad_scale: 8.0 2023-06-25 19:48:56,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2058126.0, ans=0.125 2023-06-25 19:49:17,683 INFO [train.py:996] (2/4) Epoch 12, batch 7600, loss[loss=0.2273, simple_loss=0.3064, pruned_loss=0.07416, over 22070.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3049, pruned_loss=0.0748, over 4277075.81 frames. ], batch size: 119, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:49:18,136 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2058246.0, ans=0.1 2023-06-25 19:49:52,966 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2058306.0, ans=0.0 2023-06-25 19:50:12,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2058366.0, ans=10.0 2023-06-25 19:50:24,610 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2058426.0, ans=0.0 2023-06-25 19:51:01,884 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.519e+02 8.115e+02 1.091e+03 1.750e+03 3.922e+03, threshold=2.181e+03, percent-clipped=5.0 2023-06-25 19:51:12,221 INFO [train.py:996] (2/4) Epoch 12, batch 7650, loss[loss=0.2088, simple_loss=0.2779, pruned_loss=0.06979, over 22022.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3046, pruned_loss=0.07634, over 4280577.98 frames. ], batch size: 300, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:51:14,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2058546.0, ans=0.125 2023-06-25 19:51:53,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2058666.0, ans=0.0 2023-06-25 19:53:03,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2058846.0, ans=0.07 2023-06-25 19:53:04,309 INFO [train.py:996] (2/4) Epoch 12, batch 7700, loss[loss=0.2116, simple_loss=0.2738, pruned_loss=0.07468, over 20173.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3072, pruned_loss=0.07909, over 4287027.37 frames. ], batch size: 702, lr: 2.45e-03, grad_scale: 16.0 2023-06-25 19:53:21,103 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 19:53:28,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2058906.0, ans=0.125 2023-06-25 19:53:28,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2058906.0, ans=0.125 2023-06-25 19:54:52,162 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.071e+02 8.849e+02 1.361e+03 2.010e+03 4.988e+03, threshold=2.722e+03, percent-clipped=19.0 2023-06-25 19:55:00,908 INFO [train.py:996] (2/4) Epoch 12, batch 7750, loss[loss=0.254, simple_loss=0.3586, pruned_loss=0.07463, over 21739.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3131, pruned_loss=0.07937, over 4287207.23 frames. ], batch size: 351, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 19:55:18,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2059206.0, ans=0.125 2023-06-25 19:56:08,780 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2059326.0, ans=0.0 2023-06-25 19:56:50,896 INFO [train.py:996] (2/4) Epoch 12, batch 7800, loss[loss=0.243, simple_loss=0.3733, pruned_loss=0.05632, over 19792.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3146, pruned_loss=0.0795, over 4274547.76 frames. ], batch size: 702, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 19:57:06,013 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.28 vs. limit=22.5 2023-06-25 19:57:12,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2059506.0, ans=0.125 2023-06-25 19:57:48,259 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten.whitening_limit, batch_count=2059566.0, ans=15.0 2023-06-25 19:58:09,125 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=15.0 2023-06-25 19:58:10,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.min_positive, batch_count=2059626.0, ans=0.025 2023-06-25 19:58:11,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2059626.0, ans=0.125 2023-06-25 19:58:37,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2059686.0, ans=0.125 2023-06-25 19:58:38,502 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.766e+02 1.116e+03 1.579e+03 2.542e+03 4.639e+03, threshold=3.158e+03, percent-clipped=18.0 2023-06-25 19:58:42,205 INFO [train.py:996] (2/4) Epoch 12, batch 7850, loss[loss=0.2244, simple_loss=0.29, pruned_loss=0.07942, over 21557.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3101, pruned_loss=0.07921, over 4263747.91 frames. ], batch size: 391, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 19:58:45,972 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2059746.0, ans=0.125 2023-06-25 19:59:50,414 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:00:01,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2059926.0, ans=0.125 2023-06-25 20:00:26,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2059986.0, ans=0.0 2023-06-25 20:00:32,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2060046.0, ans=0.125 2023-06-25 20:00:33,863 INFO [train.py:996] (2/4) Epoch 12, batch 7900, loss[loss=0.1975, simple_loss=0.2637, pruned_loss=0.06564, over 21460.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3052, pruned_loss=0.07791, over 4262768.34 frames. ], batch size: 212, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:01:15,037 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2060106.0, ans=0.125 2023-06-25 20:01:47,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2060226.0, ans=0.0 2023-06-25 20:02:23,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2060286.0, ans=0.0 2023-06-25 20:02:23,986 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.116e+02 9.466e+02 1.589e+03 2.565e+03 7.022e+03, threshold=3.178e+03, percent-clipped=11.0 2023-06-25 20:02:27,439 INFO [train.py:996] (2/4) Epoch 12, batch 7950, loss[loss=0.2414, simple_loss=0.3205, pruned_loss=0.08115, over 21813.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3092, pruned_loss=0.07796, over 4266094.69 frames. ], batch size: 247, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:02:46,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2060346.0, ans=0.0 2023-06-25 20:03:03,162 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-06-25 20:03:46,465 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2060526.0, ans=0.09899494936611666 2023-06-25 20:04:21,422 INFO [train.py:996] (2/4) Epoch 12, batch 8000, loss[loss=0.2552, simple_loss=0.3513, pruned_loss=0.07952, over 21642.00 frames. ], tot_loss[loss=0.2385, simple_loss=0.3151, pruned_loss=0.08089, over 4265533.04 frames. ], batch size: 414, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:04:39,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2060646.0, ans=0.2 2023-06-25 20:04:51,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2060706.0, ans=0.0 2023-06-25 20:06:16,540 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.154e+02 1.106e+03 1.585e+03 2.816e+03 6.535e+03, threshold=3.170e+03, percent-clipped=17.0 2023-06-25 20:06:25,762 INFO [train.py:996] (2/4) Epoch 12, batch 8050, loss[loss=0.1904, simple_loss=0.2494, pruned_loss=0.06565, over 21430.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3176, pruned_loss=0.08044, over 4261697.40 frames. ], batch size: 131, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:06:31,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2060946.0, ans=0.09899494936611666 2023-06-25 20:06:53,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2061006.0, ans=0.125 2023-06-25 20:06:54,390 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.92 vs. limit=10.0 2023-06-25 20:07:51,327 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2061186.0, ans=0.1 2023-06-25 20:08:01,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2061186.0, ans=0.0 2023-06-25 20:08:15,659 INFO [train.py:996] (2/4) Epoch 12, batch 8100, loss[loss=0.2281, simple_loss=0.2987, pruned_loss=0.07876, over 21929.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3148, pruned_loss=0.08057, over 4267462.88 frames. ], batch size: 316, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:08:16,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2061246.0, ans=0.125 2023-06-25 20:08:41,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2061306.0, ans=0.125 2023-06-25 20:08:51,048 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2061306.0, ans=0.09899494936611666 2023-06-25 20:09:22,968 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.08 vs. limit=12.0 2023-06-25 20:09:31,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2061426.0, ans=0.2 2023-06-25 20:10:11,167 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.662e+02 1.217e+03 1.664e+03 2.613e+03 6.278e+03, threshold=3.327e+03, percent-clipped=14.0 2023-06-25 20:10:14,535 INFO [train.py:996] (2/4) Epoch 12, batch 8150, loss[loss=0.2851, simple_loss=0.3915, pruned_loss=0.08941, over 21572.00 frames. ], tot_loss[loss=0.2444, simple_loss=0.3252, pruned_loss=0.08175, over 4269419.96 frames. ], batch size: 441, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:11:16,316 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.96 vs. limit=12.0 2023-06-25 20:11:28,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2061726.0, ans=0.0 2023-06-25 20:12:03,278 INFO [train.py:996] (2/4) Epoch 12, batch 8200, loss[loss=0.1843, simple_loss=0.2547, pruned_loss=0.05701, over 21364.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3161, pruned_loss=0.07946, over 4257425.91 frames. ], batch size: 131, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:12:17,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2061846.0, ans=0.125 2023-06-25 20:13:50,890 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.075e+02 8.315e+02 1.225e+03 2.435e+03 5.897e+03, threshold=2.449e+03, percent-clipped=16.0 2023-06-25 20:13:54,410 INFO [train.py:996] (2/4) Epoch 12, batch 8250, loss[loss=0.2309, simple_loss=0.3229, pruned_loss=0.06952, over 21382.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3151, pruned_loss=0.07879, over 4263134.20 frames. ], batch size: 194, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:14:08,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2062146.0, ans=0.125 2023-06-25 20:14:34,687 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2062206.0, ans=0.125 2023-06-25 20:14:35,212 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.25 vs. limit=15.0 2023-06-25 20:14:43,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2062266.0, ans=0.0 2023-06-25 20:14:43,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=2062266.0, ans=15.0 2023-06-25 20:14:48,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2062266.0, ans=0.125 2023-06-25 20:15:34,633 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.85 vs. limit=22.5 2023-06-25 20:15:44,047 INFO [train.py:996] (2/4) Epoch 12, batch 8300, loss[loss=0.2181, simple_loss=0.3062, pruned_loss=0.06497, over 21812.00 frames. ], tot_loss[loss=0.233, simple_loss=0.3129, pruned_loss=0.07649, over 4263953.31 frames. ], batch size: 371, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:15:45,525 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.37 vs. limit=6.0 2023-06-25 20:17:28,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2062686.0, ans=0.0 2023-06-25 20:17:29,298 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.578e+02 9.193e+02 1.531e+03 2.133e+03 6.656e+03, threshold=3.063e+03, percent-clipped=18.0 2023-06-25 20:17:32,752 INFO [train.py:996] (2/4) Epoch 12, batch 8350, loss[loss=0.2217, simple_loss=0.3078, pruned_loss=0.06778, over 20027.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3134, pruned_loss=0.07489, over 4261761.73 frames. ], batch size: 703, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:17:36,673 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer_ff3.min_abs, batch_count=2062746.0, ans=0.2 2023-06-25 20:17:46,231 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=15.0 2023-06-25 20:17:58,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2062806.0, ans=0.07 2023-06-25 20:17:58,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2062806.0, ans=0.1 2023-06-25 20:18:20,041 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2062866.0, ans=0.2 2023-06-25 20:19:14,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2062986.0, ans=0.125 2023-06-25 20:19:26,209 INFO [train.py:996] (2/4) Epoch 12, batch 8400, loss[loss=0.1827, simple_loss=0.2643, pruned_loss=0.05059, over 21432.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3109, pruned_loss=0.07322, over 4249169.00 frames. ], batch size: 131, lr: 2.44e-03, grad_scale: 32.0 2023-06-25 20:20:03,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2063106.0, ans=0.07 2023-06-25 20:20:09,982 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2063166.0, ans=0.125 2023-06-25 20:20:33,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2063226.0, ans=0.0 2023-06-25 20:20:36,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2063226.0, ans=0.125 2023-06-25 20:20:49,519 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.02 vs. limit=12.0 2023-06-25 20:20:59,896 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=12.81 vs. limit=15.0 2023-06-25 20:21:14,092 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.481e+02 1.044e+03 1.681e+03 2.727e+03 5.790e+03, threshold=3.363e+03, percent-clipped=19.0 2023-06-25 20:21:14,122 INFO [train.py:996] (2/4) Epoch 12, batch 8450, loss[loss=0.2028, simple_loss=0.2769, pruned_loss=0.06432, over 21755.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.3074, pruned_loss=0.07187, over 4254624.42 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:21:38,446 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.05 vs. limit=15.0 2023-06-25 20:22:06,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2063466.0, ans=0.125 2023-06-25 20:22:33,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2063526.0, ans=0.2 2023-06-25 20:22:54,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2063586.0, ans=0.1 2023-06-25 20:23:01,380 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2063586.0, ans=0.125 2023-06-25 20:23:01,420 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:23:04,270 INFO [train.py:996] (2/4) Epoch 12, batch 8500, loss[loss=0.1981, simple_loss=0.2694, pruned_loss=0.0634, over 21760.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.3022, pruned_loss=0.0725, over 4264395.13 frames. ], batch size: 316, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:23:21,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2063646.0, ans=0.125 2023-06-25 20:23:35,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2063706.0, ans=0.125 2023-06-25 20:23:50,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2063766.0, ans=0.035 2023-06-25 20:24:25,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2063826.0, ans=0.0 2023-06-25 20:24:36,480 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2063826.0, ans=0.1 2023-06-25 20:24:58,601 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.271e+02 9.731e+02 1.377e+03 2.109e+03 5.965e+03, threshold=2.755e+03, percent-clipped=8.0 2023-06-25 20:24:58,626 INFO [train.py:996] (2/4) Epoch 12, batch 8550, loss[loss=0.2788, simple_loss=0.3662, pruned_loss=0.09565, over 21626.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3077, pruned_loss=0.07565, over 4266412.63 frames. ], batch size: 389, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:25:15,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2063946.0, ans=0.125 2023-06-25 20:25:44,548 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.04 vs. limit=15.0 2023-06-25 20:25:47,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2064066.0, ans=0.0 2023-06-25 20:26:10,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2064066.0, ans=0.125 2023-06-25 20:26:21,455 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2064126.0, ans=0.125 2023-06-25 20:26:57,766 INFO [train.py:996] (2/4) Epoch 12, batch 8600, loss[loss=0.287, simple_loss=0.3532, pruned_loss=0.1104, over 21711.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.314, pruned_loss=0.07771, over 4268717.82 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:27:01,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2064246.0, ans=0.0 2023-06-25 20:27:49,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2064366.0, ans=0.2 2023-06-25 20:28:04,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2064366.0, ans=0.125 2023-06-25 20:28:30,856 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2064486.0, ans=0.125 2023-06-25 20:28:53,064 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.484e+02 8.709e+02 1.194e+03 1.940e+03 4.638e+03, threshold=2.389e+03, percent-clipped=12.0 2023-06-25 20:28:53,095 INFO [train.py:996] (2/4) Epoch 12, batch 8650, loss[loss=0.2966, simple_loss=0.3814, pruned_loss=0.1059, over 21469.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3189, pruned_loss=0.07898, over 4269940.80 frames. ], batch size: 507, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:29:08,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2064606.0, ans=0.1 2023-06-25 20:29:39,214 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2064666.0, ans=0.04949747468305833 2023-06-25 20:29:51,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2064666.0, ans=0.125 2023-06-25 20:30:40,446 INFO [train.py:996] (2/4) Epoch 12, batch 8700, loss[loss=0.2194, simple_loss=0.2831, pruned_loss=0.07788, over 21867.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3111, pruned_loss=0.07616, over 4270204.65 frames. ], batch size: 107, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:31:30,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2064966.0, ans=0.125 2023-06-25 20:31:35,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2064966.0, ans=0.0 2023-06-25 20:31:35,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2064966.0, ans=0.1 2023-06-25 20:32:05,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2065086.0, ans=0.125 2023-06-25 20:32:25,097 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.287e+02 9.432e+02 1.371e+03 2.158e+03 4.053e+03, threshold=2.743e+03, percent-clipped=19.0 2023-06-25 20:32:25,131 INFO [train.py:996] (2/4) Epoch 12, batch 8750, loss[loss=0.2253, simple_loss=0.2912, pruned_loss=0.07974, over 21986.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3063, pruned_loss=0.07628, over 4274680.65 frames. ], batch size: 103, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:32:50,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2065206.0, ans=0.125 2023-06-25 20:33:21,180 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.82 vs. limit=8.0 2023-06-25 20:33:58,550 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.73 vs. limit=15.0 2023-06-25 20:34:01,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2065386.0, ans=0.125 2023-06-25 20:34:21,907 INFO [train.py:996] (2/4) Epoch 12, batch 8800, loss[loss=0.2555, simple_loss=0.3343, pruned_loss=0.0884, over 21537.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3138, pruned_loss=0.07902, over 4280165.65 frames. ], batch size: 194, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:34:56,297 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2065506.0, ans=0.2 2023-06-25 20:34:57,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2065506.0, ans=0.2 2023-06-25 20:35:01,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2065506.0, ans=0.0 2023-06-25 20:35:13,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2065566.0, ans=0.0 2023-06-25 20:35:14,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2065566.0, ans=0.09899494936611666 2023-06-25 20:35:26,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.max_abs, batch_count=2065626.0, ans=10.0 2023-06-25 20:36:12,848 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.737e+02 1.070e+03 1.502e+03 2.029e+03 4.867e+03, threshold=3.004e+03, percent-clipped=9.0 2023-06-25 20:36:12,878 INFO [train.py:996] (2/4) Epoch 12, batch 8850, loss[loss=0.2835, simple_loss=0.3452, pruned_loss=0.1109, over 21348.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.3188, pruned_loss=0.08035, over 4267941.85 frames. ], batch size: 471, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:36:17,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2065746.0, ans=0.1 2023-06-25 20:36:33,419 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2065746.0, ans=0.0 2023-06-25 20:36:35,312 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2065746.0, ans=0.0 2023-06-25 20:38:14,918 INFO [train.py:996] (2/4) Epoch 12, batch 8900, loss[loss=0.2081, simple_loss=0.279, pruned_loss=0.06864, over 21594.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3137, pruned_loss=0.07951, over 4270205.07 frames. ], batch size: 415, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:39:00,072 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.69 vs. limit=15.0 2023-06-25 20:40:02,771 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2066286.0, ans=0.125 2023-06-25 20:40:05,317 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.51 vs. limit=10.0 2023-06-25 20:40:09,285 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.332e+02 9.490e+02 1.490e+03 1.972e+03 6.536e+03, threshold=2.979e+03, percent-clipped=12.0 2023-06-25 20:40:09,315 INFO [train.py:996] (2/4) Epoch 12, batch 8950, loss[loss=0.2407, simple_loss=0.2994, pruned_loss=0.09098, over 21834.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.3152, pruned_loss=0.07832, over 4275200.67 frames. ], batch size: 98, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:40:33,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2066406.0, ans=0.125 2023-06-25 20:40:43,615 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:41:04,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2066466.0, ans=0.125 2023-06-25 20:41:09,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2066466.0, ans=0.125 2023-06-25 20:41:30,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2066526.0, ans=0.0 2023-06-25 20:41:38,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2066586.0, ans=0.1 2023-06-25 20:41:57,042 INFO [train.py:996] (2/4) Epoch 12, batch 9000, loss[loss=0.2145, simple_loss=0.2772, pruned_loss=0.07595, over 21672.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3095, pruned_loss=0.07811, over 4271261.72 frames. ], batch size: 248, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:41:57,043 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 20:42:15,060 INFO [train.py:1028] (2/4) Epoch 12, validation: loss=0.2658, simple_loss=0.3589, pruned_loss=0.08634, over 1796401.00 frames. 2023-06-25 20:42:15,061 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-25 20:42:40,968 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.81 vs. limit=15.0 2023-06-25 20:42:46,317 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.75 vs. limit=15.0 2023-06-25 20:42:55,171 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=15.0 2023-06-25 20:43:05,002 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.22 vs. limit=15.0 2023-06-25 20:43:56,543 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.03 vs. limit=12.0 2023-06-25 20:44:02,542 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.409e+02 9.189e+02 1.168e+03 1.657e+03 3.184e+03, threshold=2.336e+03, percent-clipped=2.0 2023-06-25 20:44:02,573 INFO [train.py:996] (2/4) Epoch 12, batch 9050, loss[loss=0.2445, simple_loss=0.3153, pruned_loss=0.08683, over 21584.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3052, pruned_loss=0.07457, over 4276231.30 frames. ], batch size: 263, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:44:48,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2067066.0, ans=0.125 2023-06-25 20:45:43,462 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.47 vs. limit=22.5 2023-06-25 20:45:59,884 INFO [train.py:996] (2/4) Epoch 12, batch 9100, loss[loss=0.212, simple_loss=0.3162, pruned_loss=0.05389, over 21682.00 frames. ], tot_loss[loss=0.2328, simple_loss=0.3112, pruned_loss=0.07716, over 4286329.31 frames. ], batch size: 351, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:46:24,975 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2067306.0, ans=0.09899494936611666 2023-06-25 20:47:03,746 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2067366.0, ans=0.125 2023-06-25 20:47:12,827 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 20:47:56,027 INFO [train.py:996] (2/4) Epoch 12, batch 9150, loss[loss=0.2215, simple_loss=0.3069, pruned_loss=0.06807, over 21736.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3172, pruned_loss=0.07578, over 4286059.76 frames. ], batch size: 247, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 20:47:57,729 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.614e+02 8.747e+02 1.495e+03 2.212e+03 5.275e+03, threshold=2.990e+03, percent-clipped=21.0 2023-06-25 20:49:28,045 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2067786.0, ans=0.0 2023-06-25 20:49:44,681 INFO [train.py:996] (2/4) Epoch 12, batch 9200, loss[loss=0.3782, simple_loss=0.4216, pruned_loss=0.1674, over 21354.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3193, pruned_loss=0.07537, over 4287847.01 frames. ], batch size: 507, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:50:43,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2067966.0, ans=0.0 2023-06-25 20:51:08,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=2068086.0, ans=6.0 2023-06-25 20:51:23,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2068086.0, ans=0.0 2023-06-25 20:51:31,761 INFO [train.py:996] (2/4) Epoch 12, batch 9250, loss[loss=0.1942, simple_loss=0.2799, pruned_loss=0.05428, over 20767.00 frames. ], tot_loss[loss=0.2376, simple_loss=0.32, pruned_loss=0.07759, over 4289191.13 frames. ], batch size: 607, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:51:33,445 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.008e+02 1.206e+03 1.769e+03 2.365e+03 5.380e+03, threshold=3.537e+03, percent-clipped=9.0 2023-06-25 20:51:54,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2068206.0, ans=0.2 2023-06-25 20:52:00,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2068206.0, ans=0.07 2023-06-25 20:53:22,068 INFO [train.py:996] (2/4) Epoch 12, batch 9300, loss[loss=0.3068, simple_loss=0.3577, pruned_loss=0.128, over 21275.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3164, pruned_loss=0.07719, over 4273845.36 frames. ], batch size: 471, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:54:32,095 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2068626.0, ans=0.1 2023-06-25 20:54:56,030 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.99 vs. limit=15.0 2023-06-25 20:55:18,008 INFO [train.py:996] (2/4) Epoch 12, batch 9350, loss[loss=0.2446, simple_loss=0.3267, pruned_loss=0.08121, over 21428.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3208, pruned_loss=0.07789, over 4277367.23 frames. ], batch size: 131, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:55:19,928 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.869e+02 1.391e+03 2.110e+03 3.228e+03 6.570e+03, threshold=4.220e+03, percent-clipped=18.0 2023-06-25 20:55:31,246 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2068746.0, ans=0.0 2023-06-25 20:55:41,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2068806.0, ans=0.0 2023-06-25 20:55:57,620 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2068866.0, ans=0.125 2023-06-25 20:56:45,198 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2068986.0, ans=0.125 2023-06-25 20:56:56,217 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=3.99 vs. limit=15.0 2023-06-25 20:57:09,382 INFO [train.py:996] (2/4) Epoch 12, batch 9400, loss[loss=0.2359, simple_loss=0.3019, pruned_loss=0.08496, over 21764.00 frames. ], tot_loss[loss=0.2381, simple_loss=0.3204, pruned_loss=0.0779, over 4277520.21 frames. ], batch size: 351, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:57:55,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2069166.0, ans=0.125 2023-06-25 20:58:22,020 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.07 vs. limit=15.0 2023-06-25 20:58:34,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2069286.0, ans=0.07 2023-06-25 20:58:38,059 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=13.78 vs. limit=15.0 2023-06-25 20:58:57,596 INFO [train.py:996] (2/4) Epoch 12, batch 9450, loss[loss=0.2009, simple_loss=0.2721, pruned_loss=0.06482, over 21756.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3145, pruned_loss=0.07749, over 4255738.80 frames. ], batch size: 118, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 20:58:59,290 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.200e+02 9.646e+02 1.392e+03 2.055e+03 4.300e+03, threshold=2.785e+03, percent-clipped=2.0 2023-06-25 20:59:03,223 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2069346.0, ans=0.125 2023-06-25 20:59:03,293 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2069346.0, ans=0.04949747468305833 2023-06-25 20:59:41,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2069466.0, ans=0.07 2023-06-25 20:59:41,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2069466.0, ans=0.0 2023-06-25 20:59:55,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2069466.0, ans=0.5 2023-06-25 21:00:01,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2069526.0, ans=0.0 2023-06-25 21:00:25,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2069586.0, ans=0.0 2023-06-25 21:00:45,742 INFO [train.py:996] (2/4) Epoch 12, batch 9500, loss[loss=0.2113, simple_loss=0.274, pruned_loss=0.07435, over 21768.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3069, pruned_loss=0.07593, over 4264741.85 frames. ], batch size: 317, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 21:01:01,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2069646.0, ans=0.0 2023-06-25 21:01:13,484 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.62 vs. limit=15.0 2023-06-25 21:01:51,007 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.66 vs. limit=10.0 2023-06-25 21:02:06,669 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2069826.0, ans=0.125 2023-06-25 21:02:27,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2069886.0, ans=0.1 2023-06-25 21:02:37,425 INFO [train.py:996] (2/4) Epoch 12, batch 9550, loss[loss=0.2788, simple_loss=0.3571, pruned_loss=0.1003, over 21820.00 frames. ], tot_loss[loss=0.2322, simple_loss=0.3092, pruned_loss=0.07762, over 4260490.21 frames. ], batch size: 124, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 21:02:40,616 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.839e+02 1.229e+03 2.098e+03 2.991e+03 5.309e+03, threshold=4.197e+03, percent-clipped=32.0 2023-06-25 21:02:41,105 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2069946.0, ans=0.125 2023-06-25 21:03:17,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2070006.0, ans=0.1 2023-06-25 21:03:29,114 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=5.51 vs. limit=12.0 2023-06-25 21:04:23,440 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2070186.0, ans=0.125 2023-06-25 21:04:29,467 INFO [train.py:996] (2/4) Epoch 12, batch 9600, loss[loss=0.2331, simple_loss=0.3051, pruned_loss=0.08053, over 21775.00 frames. ], tot_loss[loss=0.2359, simple_loss=0.312, pruned_loss=0.07984, over 4272846.33 frames. ], batch size: 389, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:04:59,332 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.45 vs. limit=22.5 2023-06-25 21:05:12,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2070306.0, ans=0.125 2023-06-25 21:05:16,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2070366.0, ans=0.0 2023-06-25 21:05:17,888 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2070366.0, ans=0.1 2023-06-25 21:05:23,519 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.87 vs. limit=15.0 2023-06-25 21:06:11,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2070486.0, ans=0.125 2023-06-25 21:06:19,450 INFO [train.py:996] (2/4) Epoch 12, batch 9650, loss[loss=0.2668, simple_loss=0.3479, pruned_loss=0.09285, over 21804.00 frames. ], tot_loss[loss=0.237, simple_loss=0.3131, pruned_loss=0.08045, over 4277642.80 frames. ], batch size: 124, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:06:23,088 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.038e+02 1.081e+03 1.726e+03 2.542e+03 4.912e+03, threshold=3.453e+03, percent-clipped=2.0 2023-06-25 21:06:54,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2070606.0, ans=0.0 2023-06-25 21:07:58,508 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.95 vs. limit=15.0 2023-06-25 21:08:05,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=2070786.0, ans=0.5 2023-06-25 21:08:13,740 INFO [train.py:996] (2/4) Epoch 12, batch 9700, loss[loss=0.1926, simple_loss=0.2761, pruned_loss=0.0546, over 21719.00 frames. ], tot_loss[loss=0.2389, simple_loss=0.3161, pruned_loss=0.08082, over 4274083.54 frames. ], batch size: 247, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:08:37,538 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.16 vs. limit=15.0 2023-06-25 21:08:38,494 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2070906.0, ans=0.2 2023-06-25 21:08:47,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2070906.0, ans=0.2 2023-06-25 21:09:10,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2070966.0, ans=0.5 2023-06-25 21:09:48,498 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2071086.0, ans=0.07 2023-06-25 21:10:01,558 INFO [train.py:996] (2/4) Epoch 12, batch 9750, loss[loss=0.2225, simple_loss=0.2934, pruned_loss=0.0758, over 21390.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3115, pruned_loss=0.07948, over 4277090.48 frames. ], batch size: 131, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:10:04,507 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.823e+02 1.223e+03 1.776e+03 2.509e+03 4.467e+03, threshold=3.552e+03, percent-clipped=5.0 2023-06-25 21:10:05,785 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=22.5 2023-06-25 21:10:10,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2071146.0, ans=0.0 2023-06-25 21:10:51,695 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:11:33,213 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.64 vs. limit=15.0 2023-06-25 21:11:43,867 INFO [train.py:996] (2/4) Epoch 12, batch 9800, loss[loss=0.2012, simple_loss=0.25, pruned_loss=0.07623, over 20755.00 frames. ], tot_loss[loss=0.2337, simple_loss=0.3091, pruned_loss=0.07912, over 4281090.64 frames. ], batch size: 609, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:12:00,313 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.72 vs. limit=22.5 2023-06-25 21:12:21,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2071506.0, ans=0.1 2023-06-25 21:12:31,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2071566.0, ans=0.1 2023-06-25 21:12:39,293 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.13 vs. limit=15.0 2023-06-25 21:13:10,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2071626.0, ans=0.2 2023-06-25 21:13:32,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2071686.0, ans=0.125 2023-06-25 21:13:34,213 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2071746.0, ans=0.0 2023-06-25 21:13:35,343 INFO [train.py:996] (2/4) Epoch 12, batch 9850, loss[loss=0.2209, simple_loss=0.2866, pruned_loss=0.07763, over 21801.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3051, pruned_loss=0.07869, over 4281066.56 frames. ], batch size: 351, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:13:44,008 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.466e+02 9.389e+02 1.315e+03 1.721e+03 3.595e+03, threshold=2.631e+03, percent-clipped=2.0 2023-06-25 21:14:13,974 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2071806.0, ans=0.0 2023-06-25 21:14:56,698 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.20 vs. limit=15.0 2023-06-25 21:15:05,013 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2071926.0, ans=0.07 2023-06-25 21:15:21,985 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:15:31,661 INFO [train.py:996] (2/4) Epoch 12, batch 9900, loss[loss=0.2713, simple_loss=0.3452, pruned_loss=0.0987, over 21561.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3026, pruned_loss=0.07848, over 4275567.21 frames. ], batch size: 414, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:15:55,883 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.85 vs. limit=15.0 2023-06-25 21:17:14,123 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2072286.0, ans=0.125 2023-06-25 21:17:16,869 INFO [train.py:996] (2/4) Epoch 12, batch 9950, loss[loss=0.2004, simple_loss=0.2638, pruned_loss=0.06849, over 21579.00 frames. ], tot_loss[loss=0.2321, simple_loss=0.3034, pruned_loss=0.08037, over 4254200.92 frames. ], batch size: 231, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:17:25,542 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.617e+02 8.355e+02 1.275e+03 2.214e+03 4.972e+03, threshold=2.550e+03, percent-clipped=15.0 2023-06-25 21:18:26,529 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2072466.0, ans=0.1 2023-06-25 21:19:16,258 INFO [train.py:996] (2/4) Epoch 12, batch 10000, loss[loss=0.2501, simple_loss=0.3047, pruned_loss=0.09777, over 21528.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3008, pruned_loss=0.07967, over 4264946.09 frames. ], batch size: 441, lr: 2.44e-03, grad_scale: 32.0 2023-06-25 21:19:16,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2072646.0, ans=0.0 2023-06-25 21:19:39,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2072706.0, ans=0.2 2023-06-25 21:19:53,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2072706.0, ans=0.0 2023-06-25 21:19:55,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2072766.0, ans=0.125 2023-06-25 21:19:57,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2072766.0, ans=0.0 2023-06-25 21:20:00,040 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.99 vs. limit=6.0 2023-06-25 21:20:08,314 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2072766.0, ans=0.2 2023-06-25 21:20:34,703 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2072826.0, ans=0.0 2023-06-25 21:20:55,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2072886.0, ans=0.125 2023-06-25 21:20:55,707 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.91 vs. limit=22.5 2023-06-25 21:21:03,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.min_positive, batch_count=2072946.0, ans=0.05 2023-06-25 21:21:04,423 INFO [train.py:996] (2/4) Epoch 12, batch 10050, loss[loss=0.1786, simple_loss=0.2535, pruned_loss=0.05191, over 21335.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3031, pruned_loss=0.0804, over 4264901.88 frames. ], batch size: 194, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:21:16,487 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.942e+02 8.782e+02 1.169e+03 1.851e+03 4.390e+03, threshold=2.338e+03, percent-clipped=10.0 2023-06-25 21:21:22,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2072946.0, ans=0.1 2023-06-25 21:21:24,827 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.85 vs. limit=15.0 2023-06-25 21:21:32,567 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2073006.0, ans=0.0 2023-06-25 21:22:18,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2073126.0, ans=0.0 2023-06-25 21:22:45,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2073186.0, ans=0.1 2023-06-25 21:23:02,197 INFO [train.py:996] (2/4) Epoch 12, batch 10100, loss[loss=0.1534, simple_loss=0.2172, pruned_loss=0.04476, over 15671.00 frames. ], tot_loss[loss=0.2286, simple_loss=0.3014, pruned_loss=0.07787, over 4265567.70 frames. ], batch size: 60, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:23:44,983 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=3.43 vs. limit=12.0 2023-06-25 21:24:48,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2073486.0, ans=0.125 2023-06-25 21:24:49,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2073486.0, ans=0.2 2023-06-25 21:24:52,521 INFO [train.py:996] (2/4) Epoch 12, batch 10150, loss[loss=0.237, simple_loss=0.3035, pruned_loss=0.08524, over 21790.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.306, pruned_loss=0.07952, over 4260235.87 frames. ], batch size: 371, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:24:58,887 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.161e+02 9.851e+02 1.656e+03 2.564e+03 6.129e+03, threshold=3.312e+03, percent-clipped=27.0 2023-06-25 21:25:05,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2073546.0, ans=0.1 2023-06-25 21:25:07,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2073606.0, ans=0.125 2023-06-25 21:25:39,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2073666.0, ans=0.125 2023-06-25 21:25:55,456 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2073726.0, ans=0.0 2023-06-25 21:25:56,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2073726.0, ans=0.125 2023-06-25 21:26:41,526 INFO [train.py:996] (2/4) Epoch 12, batch 10200, loss[loss=0.2574, simple_loss=0.3381, pruned_loss=0.08836, over 21564.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3063, pruned_loss=0.07743, over 4263664.33 frames. ], batch size: 441, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 21:26:42,562 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.38 vs. limit=15.0 2023-06-25 21:26:47,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2073846.0, ans=0.125 2023-06-25 21:26:54,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2073846.0, ans=0.125 2023-06-25 21:27:41,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2073966.0, ans=0.125 2023-06-25 21:27:44,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2074026.0, ans=0.125 2023-06-25 21:28:28,906 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=22.5 2023-06-25 21:28:31,305 INFO [train.py:996] (2/4) Epoch 12, batch 10250, loss[loss=0.1835, simple_loss=0.2765, pruned_loss=0.04523, over 21657.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.3003, pruned_loss=0.0715, over 4263559.80 frames. ], batch size: 414, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 21:28:33,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2074146.0, ans=10.0 2023-06-25 21:28:38,278 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.344e+02 7.759e+02 1.163e+03 1.703e+03 3.224e+03, threshold=2.326e+03, percent-clipped=0.0 2023-06-25 21:28:45,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2074146.0, ans=0.125 2023-06-25 21:28:58,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2074206.0, ans=0.0 2023-06-25 21:29:15,012 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2074206.0, ans=0.1 2023-06-25 21:29:15,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2074206.0, ans=0.0 2023-06-25 21:29:52,142 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_positive, batch_count=2074326.0, ans=0.05 2023-06-25 21:30:02,590 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=15.0 2023-06-25 21:30:07,367 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.95 vs. limit=15.0 2023-06-25 21:30:24,259 INFO [train.py:996] (2/4) Epoch 12, batch 10300, loss[loss=0.2351, simple_loss=0.3358, pruned_loss=0.06716, over 21784.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3016, pruned_loss=0.07215, over 4268197.04 frames. ], batch size: 332, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 21:30:25,325 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.93 vs. limit=22.5 2023-06-25 21:31:40,370 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=13.36 vs. limit=15.0 2023-06-25 21:31:47,229 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.58 vs. limit=15.0 2023-06-25 21:32:00,443 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.33 vs. limit=22.5 2023-06-25 21:32:12,140 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2074686.0, ans=0.0 2023-06-25 21:32:24,452 INFO [train.py:996] (2/4) Epoch 12, batch 10350, loss[loss=0.2997, simple_loss=0.3803, pruned_loss=0.1095, over 21422.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3053, pruned_loss=0.07323, over 4269000.84 frames. ], batch size: 507, lr: 2.44e-03, grad_scale: 8.0 2023-06-25 21:32:31,429 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.446e+02 9.833e+02 1.578e+03 2.772e+03 4.867e+03, threshold=3.157e+03, percent-clipped=30.0 2023-06-25 21:32:41,433 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2074806.0, ans=0.2 2023-06-25 21:32:46,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2074806.0, ans=0.125 2023-06-25 21:33:30,819 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2074926.0, ans=0.2 2023-06-25 21:33:34,981 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.03 vs. limit=15.0 2023-06-25 21:34:19,099 INFO [train.py:996] (2/4) Epoch 12, batch 10400, loss[loss=0.1709, simple_loss=0.243, pruned_loss=0.04942, over 21657.00 frames. ], tot_loss[loss=0.2212, simple_loss=0.2986, pruned_loss=0.0719, over 4266365.26 frames. ], batch size: 230, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:34:46,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2075106.0, ans=0.125 2023-06-25 21:34:50,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2075106.0, ans=0.0 2023-06-25 21:35:31,340 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2075226.0, ans=0.125 2023-06-25 21:36:01,620 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.88 vs. limit=15.0 2023-06-25 21:36:14,871 INFO [train.py:996] (2/4) Epoch 12, batch 10450, loss[loss=0.2387, simple_loss=0.3379, pruned_loss=0.06971, over 20769.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3039, pruned_loss=0.07528, over 4271184.89 frames. ], batch size: 607, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:36:22,634 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.406e+02 1.055e+03 1.797e+03 3.015e+03 7.446e+03, threshold=3.594e+03, percent-clipped=22.0 2023-06-25 21:36:40,588 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2075406.0, ans=0.2 2023-06-25 21:37:05,287 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.11 vs. limit=15.0 2023-06-25 21:38:00,972 INFO [train.py:996] (2/4) Epoch 12, batch 10500, loss[loss=0.2257, simple_loss=0.2903, pruned_loss=0.08049, over 21646.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3054, pruned_loss=0.07403, over 4270082.13 frames. ], batch size: 298, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:38:25,173 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-25 21:38:56,570 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=5.70 vs. limit=15.0 2023-06-25 21:38:59,228 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2075766.0, ans=0.125 2023-06-25 21:39:45,028 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2075886.0, ans=0.0 2023-06-25 21:39:53,025 INFO [train.py:996] (2/4) Epoch 12, batch 10550, loss[loss=0.239, simple_loss=0.2959, pruned_loss=0.09109, over 21256.00 frames. ], tot_loss[loss=0.223, simple_loss=0.2998, pruned_loss=0.07314, over 4246945.09 frames. ], batch size: 471, lr: 2.44e-03, grad_scale: 16.0 2023-06-25 21:40:04,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2075946.0, ans=0.125 2023-06-25 21:40:05,046 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.958e+02 1.039e+03 1.480e+03 2.237e+03 5.696e+03, threshold=2.960e+03, percent-clipped=5.0 2023-06-25 21:40:44,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2076066.0, ans=0.0 2023-06-25 21:41:13,402 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2076126.0, ans=0.125 2023-06-25 21:41:44,556 INFO [train.py:996] (2/4) Epoch 12, batch 10600, loss[loss=0.1677, simple_loss=0.2801, pruned_loss=0.02766, over 19676.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.295, pruned_loss=0.07144, over 4248875.01 frames. ], batch size: 702, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 21:41:45,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2076246.0, ans=0.125 2023-06-25 21:41:56,558 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.34 vs. limit=10.0 2023-06-25 21:42:01,415 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 21:43:41,761 INFO [train.py:996] (2/4) Epoch 12, batch 10650, loss[loss=0.1879, simple_loss=0.2728, pruned_loss=0.05148, over 21704.00 frames. ], tot_loss[loss=0.2185, simple_loss=0.2965, pruned_loss=0.07026, over 4257904.58 frames. ], batch size: 298, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 21:43:48,913 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.385e+02 8.599e+02 1.360e+03 2.205e+03 4.639e+03, threshold=2.719e+03, percent-clipped=13.0 2023-06-25 21:45:17,420 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2076786.0, ans=0.125 2023-06-25 21:45:27,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2076786.0, ans=0.0 2023-06-25 21:45:31,632 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2076846.0, ans=0.2 2023-06-25 21:45:32,602 INFO [train.py:996] (2/4) Epoch 12, batch 10700, loss[loss=0.2746, simple_loss=0.3478, pruned_loss=0.1007, over 21361.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.2946, pruned_loss=0.07024, over 4254752.17 frames. ], batch size: 549, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 21:46:04,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2076906.0, ans=0.0 2023-06-25 21:46:29,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2076966.0, ans=0.04949747468305833 2023-06-25 21:46:46,130 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.99 vs. limit=10.0 2023-06-25 21:46:49,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2077026.0, ans=0.125 2023-06-25 21:46:55,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2077026.0, ans=0.05 2023-06-25 21:47:27,637 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.94 vs. limit=22.5 2023-06-25 21:47:29,867 INFO [train.py:996] (2/4) Epoch 12, batch 10750, loss[loss=0.2921, simple_loss=0.3877, pruned_loss=0.09824, over 21641.00 frames. ], tot_loss[loss=0.2259, simple_loss=0.3042, pruned_loss=0.07382, over 4259518.81 frames. ], batch size: 389, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 21:47:36,397 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.154e+02 8.331e+02 1.141e+03 1.485e+03 5.663e+03, threshold=2.282e+03, percent-clipped=4.0 2023-06-25 21:48:23,867 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.55 vs. limit=22.5 2023-06-25 21:48:56,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2077386.0, ans=0.125 2023-06-25 21:49:20,800 INFO [train.py:996] (2/4) Epoch 12, batch 10800, loss[loss=0.2334, simple_loss=0.3096, pruned_loss=0.07862, over 21381.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3092, pruned_loss=0.07443, over 4260066.87 frames. ], batch size: 159, lr: 2.43e-03, grad_scale: 32.0 2023-06-25 21:49:41,915 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2077446.0, ans=0.2 2023-06-25 21:49:47,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=2077446.0, ans=15.0 2023-06-25 21:50:23,850 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.63 vs. limit=12.0 2023-06-25 21:50:31,029 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.32 vs. limit=10.0 2023-06-25 21:51:03,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2077686.0, ans=0.2 2023-06-25 21:51:15,785 INFO [train.py:996] (2/4) Epoch 12, batch 10850, loss[loss=0.2026, simple_loss=0.2726, pruned_loss=0.06635, over 21802.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3107, pruned_loss=0.07551, over 4260105.39 frames. ], batch size: 317, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 21:51:31,521 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.495e+02 1.105e+03 1.666e+03 2.535e+03 5.598e+03, threshold=3.333e+03, percent-clipped=30.0 2023-06-25 21:51:35,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2077746.0, ans=0.125 2023-06-25 21:51:39,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2077806.0, ans=0.125 2023-06-25 21:51:45,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2077806.0, ans=0.125 2023-06-25 21:52:27,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2077926.0, ans=0.125 2023-06-25 21:53:05,529 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.83 vs. limit=15.0 2023-06-25 21:53:16,014 INFO [train.py:996] (2/4) Epoch 12, batch 10900, loss[loss=0.2025, simple_loss=0.2846, pruned_loss=0.06019, over 21375.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3051, pruned_loss=0.07463, over 4260298.26 frames. ], batch size: 194, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 21:54:04,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2078166.0, ans=0.1 2023-06-25 21:54:15,173 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2078226.0, ans=0.1 2023-06-25 21:55:00,805 INFO [train.py:996] (2/4) Epoch 12, batch 10950, loss[loss=0.3005, simple_loss=0.4316, pruned_loss=0.0847, over 19688.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3023, pruned_loss=0.07304, over 4259874.59 frames. ], batch size: 702, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 21:55:18,495 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.435e+02 8.397e+02 1.288e+03 2.017e+03 5.203e+03, threshold=2.576e+03, percent-clipped=6.0 2023-06-25 21:55:19,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2078346.0, ans=0.0 2023-06-25 21:55:24,742 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2078406.0, ans=0.1 2023-06-25 21:55:50,920 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2078466.0, ans=0.0 2023-06-25 21:55:54,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2078466.0, ans=0.125 2023-06-25 21:56:51,920 INFO [train.py:996] (2/4) Epoch 12, batch 11000, loss[loss=0.2375, simple_loss=0.3035, pruned_loss=0.08573, over 21836.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.3015, pruned_loss=0.07416, over 4252345.32 frames. ], batch size: 298, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 21:57:13,403 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2078706.0, ans=0.125 2023-06-25 21:57:25,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2078706.0, ans=0.125 2023-06-25 21:57:58,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2078826.0, ans=0.125 2023-06-25 21:58:38,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2078946.0, ans=0.0 2023-06-25 21:58:39,784 INFO [train.py:996] (2/4) Epoch 12, batch 11050, loss[loss=0.2147, simple_loss=0.274, pruned_loss=0.07768, over 21845.00 frames. ], tot_loss[loss=0.226, simple_loss=0.2999, pruned_loss=0.07605, over 4261505.49 frames. ], batch size: 98, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 21:58:42,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2078946.0, ans=0.125 2023-06-25 21:58:57,595 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.227e+02 8.641e+02 1.192e+03 1.708e+03 4.413e+03, threshold=2.383e+03, percent-clipped=10.0 2023-06-25 21:59:22,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2079006.0, ans=0.0 2023-06-25 21:59:49,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2079126.0, ans=0.0 2023-06-25 21:59:53,942 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:00:30,676 INFO [train.py:996] (2/4) Epoch 12, batch 11100, loss[loss=0.2286, simple_loss=0.3099, pruned_loss=0.07362, over 21572.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2975, pruned_loss=0.07614, over 4253447.47 frames. ], batch size: 263, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 22:00:49,167 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2079246.0, ans=0.125 2023-06-25 22:01:21,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2079366.0, ans=0.0 2023-06-25 22:02:07,789 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2079486.0, ans=0.0 2023-06-25 22:02:18,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2079486.0, ans=0.0 2023-06-25 22:02:21,207 INFO [train.py:996] (2/4) Epoch 12, batch 11150, loss[loss=0.1839, simple_loss=0.256, pruned_loss=0.05587, over 21009.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2945, pruned_loss=0.07598, over 4250384.98 frames. ], batch size: 608, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 22:02:38,402 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.021e+02 8.746e+02 1.220e+03 1.810e+03 4.073e+03, threshold=2.441e+03, percent-clipped=6.0 2023-06-25 22:03:55,471 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.attention_skip_rate, batch_count=2079786.0, ans=0.0 2023-06-25 22:04:09,075 INFO [train.py:996] (2/4) Epoch 12, batch 11200, loss[loss=0.2187, simple_loss=0.2796, pruned_loss=0.07893, over 21280.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2932, pruned_loss=0.07497, over 4253675.53 frames. ], batch size: 144, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:04:24,279 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2079846.0, ans=0.1 2023-06-25 22:04:39,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2079906.0, ans=0.125 2023-06-25 22:04:56,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2079966.0, ans=0.05 2023-06-25 22:05:22,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2080026.0, ans=0.125 2023-06-25 22:05:24,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2080026.0, ans=0.125 2023-06-25 22:05:48,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2080086.0, ans=0.125 2023-06-25 22:05:56,520 INFO [train.py:996] (2/4) Epoch 12, batch 11250, loss[loss=0.2439, simple_loss=0.324, pruned_loss=0.08194, over 21811.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2916, pruned_loss=0.07512, over 4259110.19 frames. ], batch size: 124, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:06:14,393 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.198e+02 9.018e+02 1.214e+03 1.830e+03 3.568e+03, threshold=2.429e+03, percent-clipped=9.0 2023-06-25 22:06:23,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.min_abs, batch_count=2080206.0, ans=0.5 2023-06-25 22:07:09,654 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.82 vs. limit=15.0 2023-06-25 22:07:46,502 INFO [train.py:996] (2/4) Epoch 12, batch 11300, loss[loss=0.2109, simple_loss=0.2858, pruned_loss=0.06805, over 21820.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2933, pruned_loss=0.07517, over 4262995.13 frames. ], batch size: 247, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:08:14,897 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2080506.0, ans=0.125 2023-06-25 22:08:37,919 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2080566.0, ans=0.07 2023-06-25 22:08:57,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2080626.0, ans=0.0 2023-06-25 22:09:25,080 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2080686.0, ans=0.0 2023-06-25 22:09:41,350 INFO [train.py:996] (2/4) Epoch 12, batch 11350, loss[loss=0.2542, simple_loss=0.3412, pruned_loss=0.08359, over 21253.00 frames. ], tot_loss[loss=0.2217, simple_loss=0.2941, pruned_loss=0.07462, over 4255023.03 frames. ], batch size: 548, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:09:54,130 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.747e+02 9.372e+02 1.256e+03 1.751e+03 3.011e+03, threshold=2.512e+03, percent-clipped=9.0 2023-06-25 22:10:01,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2080806.0, ans=0.07 2023-06-25 22:11:36,906 INFO [train.py:996] (2/4) Epoch 12, batch 11400, loss[loss=0.266, simple_loss=0.3477, pruned_loss=0.09219, over 21653.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3019, pruned_loss=0.07755, over 4262108.90 frames. ], batch size: 415, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:11:46,161 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2081046.0, ans=0.125 2023-06-25 22:11:53,148 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2081106.0, ans=0.1 2023-06-25 22:11:54,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2081106.0, ans=0.1 2023-06-25 22:11:54,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2081106.0, ans=0.0 2023-06-25 22:12:10,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2081106.0, ans=0.125 2023-06-25 22:12:26,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2081166.0, ans=0.1 2023-06-25 22:13:26,751 INFO [train.py:996] (2/4) Epoch 12, batch 11450, loss[loss=0.192, simple_loss=0.2735, pruned_loss=0.05525, over 20094.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3028, pruned_loss=0.07619, over 4251371.29 frames. ], batch size: 702, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 22:13:46,244 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.691e+02 8.654e+02 1.291e+03 1.959e+03 4.523e+03, threshold=2.583e+03, percent-clipped=10.0 2023-06-25 22:14:06,654 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2081406.0, ans=0.125 2023-06-25 22:14:16,217 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=6.96 vs. limit=15.0 2023-06-25 22:14:18,844 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2081466.0, ans=0.125 2023-06-25 22:14:18,855 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2081466.0, ans=0.125 2023-06-25 22:14:40,321 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=5.69 vs. limit=15.0 2023-06-25 22:14:50,430 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2081526.0, ans=0.125 2023-06-25 22:14:51,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2081526.0, ans=0.0 2023-06-25 22:14:58,751 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2081586.0, ans=0.125 2023-06-25 22:15:08,482 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=11.90 vs. limit=22.5 2023-06-25 22:15:15,769 INFO [train.py:996] (2/4) Epoch 12, batch 11500, loss[loss=0.2948, simple_loss=0.3532, pruned_loss=0.1182, over 21265.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3066, pruned_loss=0.07747, over 4257754.48 frames. ], batch size: 143, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 22:15:23,363 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2081646.0, ans=0.0 2023-06-25 22:15:23,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2081646.0, ans=0.0 2023-06-25 22:15:58,258 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:16:19,182 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2081766.0, ans=0.1 2023-06-25 22:16:21,295 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.15 vs. limit=15.0 2023-06-25 22:16:42,348 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2081826.0, ans=0.125 2023-06-25 22:16:52,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2081886.0, ans=0.2 2023-06-25 22:16:56,821 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=22.5 2023-06-25 22:17:10,042 INFO [train.py:996] (2/4) Epoch 12, batch 11550, loss[loss=0.2723, simple_loss=0.381, pruned_loss=0.08179, over 21279.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3117, pruned_loss=0.07721, over 4260818.16 frames. ], batch size: 548, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 22:17:10,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2081946.0, ans=0.125 2023-06-25 22:17:30,024 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.895e+02 1.016e+03 1.404e+03 2.173e+03 5.477e+03, threshold=2.808e+03, percent-clipped=17.0 2023-06-25 22:19:08,988 INFO [train.py:996] (2/4) Epoch 12, batch 11600, loss[loss=0.2304, simple_loss=0.3246, pruned_loss=0.06809, over 21398.00 frames. ], tot_loss[loss=0.2416, simple_loss=0.3263, pruned_loss=0.07848, over 4269361.38 frames. ], batch size: 131, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:19:44,640 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2082306.0, ans=0.0 2023-06-25 22:19:58,226 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.26 vs. limit=12.0 2023-06-25 22:20:02,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2082366.0, ans=0.0 2023-06-25 22:20:06,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2082366.0, ans=0.2 2023-06-25 22:20:12,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2082426.0, ans=0.125 2023-06-25 22:20:58,723 INFO [train.py:996] (2/4) Epoch 12, batch 11650, loss[loss=0.2251, simple_loss=0.3051, pruned_loss=0.07259, over 21737.00 frames. ], tot_loss[loss=0.2441, simple_loss=0.331, pruned_loss=0.07857, over 4262708.10 frames. ], batch size: 124, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:21:03,466 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.05 vs. limit=15.0 2023-06-25 22:21:17,678 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.996e+02 1.003e+03 1.550e+03 2.140e+03 4.406e+03, threshold=3.101e+03, percent-clipped=13.0 2023-06-25 22:21:59,637 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.70 vs. limit=15.0 2023-06-25 22:22:30,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2082786.0, ans=0.0 2023-06-25 22:22:47,749 INFO [train.py:996] (2/4) Epoch 12, batch 11700, loss[loss=0.2828, simple_loss=0.3409, pruned_loss=0.1124, over 21300.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.3245, pruned_loss=0.07855, over 4246993.85 frames. ], batch size: 471, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:23:49,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2082966.0, ans=0.125 2023-06-25 22:24:30,932 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.66 vs. limit=10.0 2023-06-25 22:24:38,045 INFO [train.py:996] (2/4) Epoch 12, batch 11750, loss[loss=0.2566, simple_loss=0.3225, pruned_loss=0.09535, over 21435.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3159, pruned_loss=0.07755, over 4257923.90 frames. ], batch size: 131, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:24:38,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2083146.0, ans=0.125 2023-06-25 22:24:57,458 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.375e+02 1.055e+03 1.810e+03 2.598e+03 5.314e+03, threshold=3.620e+03, percent-clipped=16.0 2023-06-25 22:24:59,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2083206.0, ans=0.125 2023-06-25 22:25:42,193 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2083326.0, ans=0.125 2023-06-25 22:26:24,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2083386.0, ans=0.2 2023-06-25 22:26:32,971 INFO [train.py:996] (2/4) Epoch 12, batch 11800, loss[loss=0.2868, simple_loss=0.384, pruned_loss=0.09476, over 21371.00 frames. ], tot_loss[loss=0.2378, simple_loss=0.3171, pruned_loss=0.07924, over 4261080.57 frames. ], batch size: 507, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:27:34,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2083626.0, ans=0.1 2023-06-25 22:28:23,937 INFO [train.py:996] (2/4) Epoch 12, batch 11850, loss[loss=0.2248, simple_loss=0.3194, pruned_loss=0.06504, over 21850.00 frames. ], tot_loss[loss=0.2357, simple_loss=0.3159, pruned_loss=0.07779, over 4263950.22 frames. ], batch size: 371, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:28:31,881 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.66 vs. limit=15.0 2023-06-25 22:28:37,199 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.904e+02 9.109e+02 1.402e+03 2.284e+03 4.807e+03, threshold=2.803e+03, percent-clipped=4.0 2023-06-25 22:29:32,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2083926.0, ans=0.125 2023-06-25 22:29:47,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2083926.0, ans=0.0 2023-06-25 22:30:13,540 INFO [train.py:996] (2/4) Epoch 12, batch 11900, loss[loss=0.2524, simple_loss=0.3346, pruned_loss=0.0851, over 21616.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3152, pruned_loss=0.07527, over 4265317.71 frames. ], batch size: 414, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:31:29,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.max_abs, batch_count=2084226.0, ans=10.0 2023-06-25 22:32:05,347 INFO [train.py:996] (2/4) Epoch 12, batch 11950, loss[loss=0.2005, simple_loss=0.2946, pruned_loss=0.05317, over 21666.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3159, pruned_loss=0.07278, over 4266669.76 frames. ], batch size: 247, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:32:24,562 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.025e+02 8.288e+02 1.263e+03 1.955e+03 4.768e+03, threshold=2.526e+03, percent-clipped=7.0 2023-06-25 22:32:43,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2084406.0, ans=0.035 2023-06-25 22:33:54,636 INFO [train.py:996] (2/4) Epoch 12, batch 12000, loss[loss=0.237, simple_loss=0.2927, pruned_loss=0.09066, over 21723.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3092, pruned_loss=0.07119, over 4260433.98 frames. ], batch size: 112, lr: 2.43e-03, grad_scale: 32.0 2023-06-25 22:33:54,637 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-25 22:34:17,937 INFO [train.py:1028] (2/4) Epoch 12, validation: loss=0.2583, simple_loss=0.3504, pruned_loss=0.08306, over 1796401.00 frames. 2023-06-25 22:34:17,938 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-25 22:34:29,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2084646.0, ans=0.0 2023-06-25 22:35:08,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2084766.0, ans=0.1 2023-06-25 22:35:15,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2084766.0, ans=0.125 2023-06-25 22:35:49,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2084886.0, ans=0.0 2023-06-25 22:36:03,690 INFO [train.py:996] (2/4) Epoch 12, batch 12050, loss[loss=0.2158, simple_loss=0.2821, pruned_loss=0.07478, over 21811.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3075, pruned_loss=0.07334, over 4259250.52 frames. ], batch size: 247, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:36:06,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2084946.0, ans=0.1 2023-06-25 22:36:18,976 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.769e+02 8.043e+02 1.290e+03 1.841e+03 5.321e+03, threshold=2.580e+03, percent-clipped=14.0 2023-06-25 22:36:38,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2085006.0, ans=0.2 2023-06-25 22:37:10,554 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2085126.0, ans=0.125 2023-06-25 22:37:14,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2085126.0, ans=0.0 2023-06-25 22:37:32,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2085186.0, ans=0.1 2023-06-25 22:37:43,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2085186.0, ans=0.1 2023-06-25 22:37:47,020 INFO [train.py:996] (2/4) Epoch 12, batch 12100, loss[loss=0.2527, simple_loss=0.3372, pruned_loss=0.08407, over 21437.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.313, pruned_loss=0.07817, over 4264538.33 frames. ], batch size: 131, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:37:50,405 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.55 vs. limit=15.0 2023-06-25 22:38:21,990 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2085306.0, ans=0.125 2023-06-25 22:38:47,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2085366.0, ans=0.125 2023-06-25 22:39:33,608 INFO [train.py:996] (2/4) Epoch 12, batch 12150, loss[loss=0.1286, simple_loss=0.1676, pruned_loss=0.04485, over 17164.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3154, pruned_loss=0.07769, over 4257377.55 frames. ], batch size: 62, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:39:40,958 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2085546.0, ans=0.125 2023-06-25 22:39:50,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2085546.0, ans=0.2 2023-06-25 22:39:52,144 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2085546.0, ans=0.125 2023-06-25 22:39:59,224 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2085546.0, ans=0.0 2023-06-25 22:40:00,401 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.100e+02 9.375e+02 1.398e+03 2.052e+03 3.851e+03, threshold=2.797e+03, percent-clipped=12.0 2023-06-25 22:40:16,562 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2085606.0, ans=0.125 2023-06-25 22:41:26,890 INFO [train.py:996] (2/4) Epoch 12, batch 12200, loss[loss=0.1966, simple_loss=0.2677, pruned_loss=0.06275, over 21434.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3128, pruned_loss=0.07698, over 4263523.23 frames. ], batch size: 389, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:41:34,253 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2085846.0, ans=0.125 2023-06-25 22:41:42,572 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2085906.0, ans=0.125 2023-06-25 22:43:08,637 INFO [train.py:996] (2/4) Epoch 12, batch 12250, loss[loss=0.1844, simple_loss=0.2568, pruned_loss=0.05601, over 21539.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3047, pruned_loss=0.07476, over 4265763.78 frames. ], batch size: 195, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:43:09,906 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.95 vs. limit=15.0 2023-06-25 22:43:24,139 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.778e+02 9.164e+02 1.305e+03 1.876e+03 4.319e+03, threshold=2.609e+03, percent-clipped=7.0 2023-06-25 22:44:43,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2086386.0, ans=0.125 2023-06-25 22:44:56,869 INFO [train.py:996] (2/4) Epoch 12, batch 12300, loss[loss=0.1704, simple_loss=0.2648, pruned_loss=0.03802, over 21755.00 frames. ], tot_loss[loss=0.2172, simple_loss=0.2965, pruned_loss=0.06899, over 4264044.48 frames. ], batch size: 351, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:45:50,989 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2086566.0, ans=0.1 2023-06-25 22:46:39,949 INFO [train.py:996] (2/4) Epoch 12, batch 12350, loss[loss=0.2144, simple_loss=0.298, pruned_loss=0.06539, over 21782.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.3, pruned_loss=0.0688, over 4269432.56 frames. ], batch size: 298, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:46:54,513 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2086746.0, ans=0.0 2023-06-25 22:47:02,272 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.943e+02 9.460e+02 1.568e+03 2.170e+03 4.986e+03, threshold=3.136e+03, percent-clipped=17.0 2023-06-25 22:47:14,647 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.68 vs. limit=15.0 2023-06-25 22:47:24,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2086866.0, ans=0.0 2023-06-25 22:47:32,740 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2086866.0, ans=0.125 2023-06-25 22:47:55,122 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 22:48:21,624 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2086986.0, ans=0.0 2023-06-25 22:48:33,558 INFO [train.py:996] (2/4) Epoch 12, batch 12400, loss[loss=0.2703, simple_loss=0.3243, pruned_loss=0.1081, over 21802.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.3028, pruned_loss=0.07251, over 4281588.88 frames. ], batch size: 508, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:48:37,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2087046.0, ans=0.125 2023-06-25 22:48:54,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2087046.0, ans=0.1 2023-06-25 22:49:34,712 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.25 vs. limit=6.0 2023-06-25 22:50:17,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2087286.0, ans=0.04949747468305833 2023-06-25 22:50:21,728 INFO [train.py:996] (2/4) Epoch 12, batch 12450, loss[loss=0.2455, simple_loss=0.3532, pruned_loss=0.06889, over 19638.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3092, pruned_loss=0.07618, over 4281995.27 frames. ], batch size: 702, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:50:37,833 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=22.5 2023-06-25 22:50:45,397 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.188e+02 9.401e+02 1.561e+03 2.264e+03 4.297e+03, threshold=3.121e+03, percent-clipped=10.0 2023-06-25 22:52:16,035 INFO [train.py:996] (2/4) Epoch 12, batch 12500, loss[loss=0.2804, simple_loss=0.3714, pruned_loss=0.09468, over 21602.00 frames. ], tot_loss[loss=0.2397, simple_loss=0.319, pruned_loss=0.08018, over 4279729.13 frames. ], batch size: 230, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:53:28,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2087826.0, ans=0.0 2023-06-25 22:53:54,958 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.38 vs. limit=22.5 2023-06-25 22:54:06,708 INFO [train.py:996] (2/4) Epoch 12, batch 12550, loss[loss=0.316, simple_loss=0.3833, pruned_loss=0.1243, over 21338.00 frames. ], tot_loss[loss=0.2459, simple_loss=0.325, pruned_loss=0.08337, over 4279980.15 frames. ], batch size: 507, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:54:32,466 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.421e+02 9.265e+02 1.193e+03 1.894e+03 3.118e+03, threshold=2.386e+03, percent-clipped=0.0 2023-06-25 22:54:52,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2088006.0, ans=0.0 2023-06-25 22:54:52,261 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2088006.0, ans=0.125 2023-06-25 22:56:05,550 INFO [train.py:996] (2/4) Epoch 12, batch 12600, loss[loss=0.1966, simple_loss=0.2864, pruned_loss=0.05339, over 21676.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3249, pruned_loss=0.08121, over 4279614.20 frames. ], batch size: 247, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:56:51,521 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=12.0 2023-06-25 22:57:51,804 INFO [train.py:996] (2/4) Epoch 12, batch 12650, loss[loss=0.2044, simple_loss=0.2779, pruned_loss=0.06539, over 21870.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3162, pruned_loss=0.07657, over 4282075.21 frames. ], batch size: 124, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:58:08,921 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.676e+02 8.577e+02 1.170e+03 1.835e+03 4.573e+03, threshold=2.341e+03, percent-clipped=16.0 2023-06-25 22:58:54,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn2.whiten.whitening_limit, batch_count=2088726.0, ans=22.5 2023-06-25 22:59:40,849 INFO [train.py:996] (2/4) Epoch 12, batch 12700, loss[loss=0.342, simple_loss=0.3855, pruned_loss=0.1492, over 21403.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3147, pruned_loss=0.07846, over 4289405.51 frames. ], batch size: 507, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 22:59:46,883 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:00:32,242 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:00:44,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2088966.0, ans=0.1 2023-06-25 23:00:50,275 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.75 vs. limit=22.5 2023-06-25 23:01:04,712 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2089026.0, ans=0.1 2023-06-25 23:01:30,770 INFO [train.py:996] (2/4) Epoch 12, batch 12750, loss[loss=0.284, simple_loss=0.3493, pruned_loss=0.1093, over 21589.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3155, pruned_loss=0.07861, over 4287117.74 frames. ], batch size: 507, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:01:51,851 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.914e+02 9.146e+02 1.295e+03 2.167e+03 3.972e+03, threshold=2.590e+03, percent-clipped=20.0 2023-06-25 23:03:18,697 INFO [train.py:996] (2/4) Epoch 12, batch 12800, loss[loss=0.209, simple_loss=0.2797, pruned_loss=0.06917, over 21557.00 frames. ], tot_loss[loss=0.2353, simple_loss=0.3131, pruned_loss=0.07878, over 4282178.01 frames. ], batch size: 548, lr: 2.43e-03, grad_scale: 32.0 2023-06-25 23:03:23,324 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.28 vs. limit=15.0 2023-06-25 23:04:28,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2089626.0, ans=0.0 2023-06-25 23:04:35,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2089626.0, ans=0.1 2023-06-25 23:04:35,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2089626.0, ans=0.2 2023-06-25 23:05:01,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2089686.0, ans=0.0 2023-06-25 23:05:11,189 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2089686.0, ans=0.125 2023-06-25 23:05:12,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2089686.0, ans=0.125 2023-06-25 23:05:15,818 INFO [train.py:996] (2/4) Epoch 12, batch 12850, loss[loss=0.2272, simple_loss=0.3023, pruned_loss=0.07602, over 21373.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3153, pruned_loss=0.08057, over 4286171.09 frames. ], batch size: 159, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:05:41,771 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.120e+02 9.253e+02 1.228e+03 1.682e+03 4.174e+03, threshold=2.456e+03, percent-clipped=9.0 2023-06-25 23:06:27,250 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2089926.0, ans=0.125 2023-06-25 23:06:51,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2089986.0, ans=0.125 2023-06-25 23:07:15,448 INFO [train.py:996] (2/4) Epoch 12, batch 12900, loss[loss=0.2709, simple_loss=0.3679, pruned_loss=0.08698, over 21168.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.313, pruned_loss=0.0769, over 4279233.92 frames. ], batch size: 548, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:07:21,040 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2090046.0, ans=0.2 2023-06-25 23:07:46,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2090106.0, ans=0.1 2023-06-25 23:08:33,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2090226.0, ans=0.0 2023-06-25 23:08:41,767 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2090286.0, ans=0.125 2023-06-25 23:08:58,874 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2090286.0, ans=0.1 2023-06-25 23:09:04,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2090346.0, ans=0.0 2023-06-25 23:09:05,318 INFO [train.py:996] (2/4) Epoch 12, batch 12950, loss[loss=0.243, simple_loss=0.3219, pruned_loss=0.08203, over 21592.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3108, pruned_loss=0.07471, over 4283367.51 frames. ], batch size: 414, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:09:27,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2090406.0, ans=0.125 2023-06-25 23:09:30,341 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.722e+02 9.455e+02 1.379e+03 2.078e+03 4.999e+03, threshold=2.758e+03, percent-clipped=14.0 2023-06-25 23:09:39,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2090406.0, ans=0.125 2023-06-25 23:10:29,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2090586.0, ans=0.125 2023-06-25 23:10:43,435 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.04 vs. limit=10.0 2023-06-25 23:11:00,086 INFO [train.py:996] (2/4) Epoch 12, batch 13000, loss[loss=0.1448, simple_loss=0.2083, pruned_loss=0.04065, over 21808.00 frames. ], tot_loss[loss=0.2325, simple_loss=0.3127, pruned_loss=0.0761, over 4270849.65 frames. ], batch size: 98, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:11:41,726 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2090766.0, ans=0.0 2023-06-25 23:12:09,635 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.89 vs. limit=5.0 2023-06-25 23:12:12,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2090826.0, ans=0.125 2023-06-25 23:12:41,973 INFO [train.py:996] (2/4) Epoch 12, batch 13050, loss[loss=0.2692, simple_loss=0.3269, pruned_loss=0.1058, over 21668.00 frames. ], tot_loss[loss=0.2277, simple_loss=0.3076, pruned_loss=0.07396, over 4270517.25 frames. ], batch size: 471, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:12:52,022 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2023-06-25 23:13:03,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2091006.0, ans=0.2 2023-06-25 23:13:06,374 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.723e+02 9.569e+02 1.242e+03 1.863e+03 3.264e+03, threshold=2.484e+03, percent-clipped=2.0 2023-06-25 23:13:35,355 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.45 vs. limit=15.0 2023-06-25 23:14:38,567 INFO [train.py:996] (2/4) Epoch 12, batch 13100, loss[loss=0.2694, simple_loss=0.3444, pruned_loss=0.09716, over 21842.00 frames. ], tot_loss[loss=0.2273, simple_loss=0.3073, pruned_loss=0.07362, over 4278696.13 frames. ], batch size: 124, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:14:44,462 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2091246.0, ans=0.125 2023-06-25 23:16:30,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2091546.0, ans=0.125 2023-06-25 23:16:31,076 INFO [train.py:996] (2/4) Epoch 12, batch 13150, loss[loss=0.2122, simple_loss=0.2876, pruned_loss=0.06841, over 21566.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3104, pruned_loss=0.07684, over 4275683.90 frames. ], batch size: 263, lr: 2.43e-03, grad_scale: 16.0 2023-06-25 23:16:55,956 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.201e+02 8.745e+02 1.410e+03 2.124e+03 5.219e+03, threshold=2.820e+03, percent-clipped=16.0 2023-06-25 23:16:56,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2091606.0, ans=0.2 2023-06-25 23:17:30,415 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2091666.0, ans=0.1 2023-06-25 23:18:24,358 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.51 vs. limit=6.0 2023-06-25 23:18:28,223 INFO [train.py:996] (2/4) Epoch 12, batch 13200, loss[loss=0.2556, simple_loss=0.3281, pruned_loss=0.09153, over 21803.00 frames. ], tot_loss[loss=0.2311, simple_loss=0.3094, pruned_loss=0.07645, over 4275976.13 frames. ], batch size: 441, lr: 2.43e-03, grad_scale: 32.0 2023-06-25 23:18:52,374 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2091906.0, ans=0.125 2023-06-25 23:19:20,869 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.71 vs. limit=22.5 2023-06-25 23:19:57,039 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.63 vs. limit=15.0 2023-06-25 23:20:16,894 INFO [train.py:996] (2/4) Epoch 12, batch 13250, loss[loss=0.2542, simple_loss=0.3256, pruned_loss=0.09143, over 21864.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.309, pruned_loss=0.07786, over 4280083.65 frames. ], batch size: 107, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 23:20:35,779 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.67 vs. limit=10.0 2023-06-25 23:20:45,177 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.712e+02 9.106e+02 1.656e+03 2.561e+03 5.361e+03, threshold=3.313e+03, percent-clipped=20.0 2023-06-25 23:22:09,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2092386.0, ans=0.125 2023-06-25 23:22:12,991 INFO [train.py:996] (2/4) Epoch 12, batch 13300, loss[loss=0.2275, simple_loss=0.3, pruned_loss=0.07745, over 21261.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3138, pruned_loss=0.07749, over 4282167.11 frames. ], batch size: 159, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 23:22:23,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2092446.0, ans=0.1 2023-06-25 23:22:29,437 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten.whitening_limit, batch_count=2092506.0, ans=15.0 2023-06-25 23:23:00,302 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2092566.0, ans=0.125 2023-06-25 23:23:33,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2092626.0, ans=0.0 2023-06-25 23:24:01,799 INFO [train.py:996] (2/4) Epoch 12, batch 13350, loss[loss=0.2698, simple_loss=0.3438, pruned_loss=0.09788, over 21461.00 frames. ], tot_loss[loss=0.2411, simple_loss=0.3206, pruned_loss=0.08076, over 4281300.70 frames. ], batch size: 211, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 23:24:37,310 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.381e+02 8.931e+02 1.432e+03 2.029e+03 4.000e+03, threshold=2.864e+03, percent-clipped=9.0 2023-06-25 23:25:21,227 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2092926.0, ans=0.0 2023-06-25 23:25:35,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2092986.0, ans=0.07 2023-06-25 23:25:37,652 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1.whitening_limit, batch_count=2092986.0, ans=10.0 2023-06-25 23:25:39,281 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=6.98 vs. limit=15.0 2023-06-25 23:25:51,917 INFO [train.py:996] (2/4) Epoch 12, batch 13400, loss[loss=0.2296, simple_loss=0.2964, pruned_loss=0.08137, over 21339.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.321, pruned_loss=0.08273, over 4281730.15 frames. ], batch size: 143, lr: 2.43e-03, grad_scale: 8.0 2023-06-25 23:25:54,504 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2093046.0, ans=0.125 2023-06-25 23:25:54,570 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2093046.0, ans=0.2 2023-06-25 23:26:07,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2093046.0, ans=0.0 2023-06-25 23:26:19,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2093046.0, ans=0.125 2023-06-25 23:27:12,970 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2093226.0, ans=0.0 2023-06-25 23:27:16,062 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2093226.0, ans=0.125 2023-06-25 23:27:25,794 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.77 vs. limit=15.0 2023-06-25 23:27:50,016 INFO [train.py:996] (2/4) Epoch 12, batch 13450, loss[loss=0.2083, simple_loss=0.2882, pruned_loss=0.06426, over 21678.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3222, pruned_loss=0.08532, over 4282923.67 frames. ], batch size: 332, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:27:56,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2093346.0, ans=0.0 2023-06-25 23:28:07,344 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.92 vs. limit=12.0 2023-06-25 23:28:12,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2093406.0, ans=0.0 2023-06-25 23:28:18,590 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.861e+02 9.284e+02 1.216e+03 1.797e+03 3.595e+03, threshold=2.431e+03, percent-clipped=4.0 2023-06-25 23:28:19,668 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.45 vs. limit=15.0 2023-06-25 23:28:27,999 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2093406.0, ans=0.1 2023-06-25 23:28:31,949 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.12 vs. limit=15.0 2023-06-25 23:29:00,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2093526.0, ans=0.0 2023-06-25 23:29:02,636 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-06-25 23:29:47,109 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.80 vs. limit=15.0 2023-06-25 23:29:47,941 INFO [train.py:996] (2/4) Epoch 12, batch 13500, loss[loss=0.2723, simple_loss=0.3467, pruned_loss=0.09893, over 21700.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3121, pruned_loss=0.0819, over 4275636.54 frames. ], batch size: 391, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:30:40,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2093766.0, ans=0.0 2023-06-25 23:30:52,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2093826.0, ans=0.0 2023-06-25 23:30:56,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2093826.0, ans=0.1 2023-06-25 23:31:25,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2093886.0, ans=0.0 2023-06-25 23:31:38,024 INFO [train.py:996] (2/4) Epoch 12, batch 13550, loss[loss=0.2963, simple_loss=0.3881, pruned_loss=0.1023, over 21794.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3173, pruned_loss=0.08062, over 4275597.76 frames. ], batch size: 332, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:31:51,009 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2093946.0, ans=0.2 2023-06-25 23:32:07,687 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.160e+02 9.861e+02 1.411e+03 2.332e+03 4.219e+03, threshold=2.822e+03, percent-clipped=19.0 2023-06-25 23:32:39,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2094066.0, ans=0.0 2023-06-25 23:33:15,071 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2094186.0, ans=0.125 2023-06-25 23:33:28,568 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=5.213e-03 2023-06-25 23:33:29,551 INFO [train.py:996] (2/4) Epoch 12, batch 13600, loss[loss=0.2186, simple_loss=0.3069, pruned_loss=0.06511, over 21815.00 frames. ], tot_loss[loss=0.2396, simple_loss=0.3185, pruned_loss=0.08034, over 4278811.01 frames. ], batch size: 414, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:33:33,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2094246.0, ans=0.0 2023-06-25 23:34:31,168 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2094366.0, ans=0.125 2023-06-25 23:34:43,054 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.38 vs. limit=10.0 2023-06-25 23:35:06,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.whiten.whitening_limit, batch_count=2094486.0, ans=12.0 2023-06-25 23:35:20,542 INFO [train.py:996] (2/4) Epoch 12, batch 13650, loss[loss=0.2273, simple_loss=0.29, pruned_loss=0.08231, over 21844.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3136, pruned_loss=0.07763, over 4282115.59 frames. ], batch size: 98, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:35:48,255 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.01 vs. limit=15.0 2023-06-25 23:35:50,286 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.803e+02 6.699e+02 9.977e+02 1.659e+03 4.040e+03, threshold=1.995e+03, percent-clipped=8.0 2023-06-25 23:35:54,467 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2094606.0, ans=0.2 2023-06-25 23:36:22,720 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2094726.0, ans=0.125 2023-06-25 23:36:52,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2094786.0, ans=0.125 2023-06-25 23:36:53,149 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-06-25 23:37:04,045 INFO [train.py:996] (2/4) Epoch 12, batch 13700, loss[loss=0.19, simple_loss=0.2498, pruned_loss=0.06511, over 15425.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3069, pruned_loss=0.07748, over 4273968.72 frames. ], batch size: 60, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:37:10,056 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2094846.0, ans=0.0 2023-06-25 23:37:24,032 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2094846.0, ans=0.2 2023-06-25 23:37:30,242 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=15.0 2023-06-25 23:38:21,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2095026.0, ans=0.125 2023-06-25 23:38:39,269 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.30 vs. limit=15.0 2023-06-25 23:38:42,412 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2095086.0, ans=0.09899494936611666 2023-06-25 23:39:04,380 INFO [train.py:996] (2/4) Epoch 12, batch 13750, loss[loss=0.1562, simple_loss=0.2196, pruned_loss=0.04637, over 21733.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3054, pruned_loss=0.07715, over 4269795.15 frames. ], batch size: 124, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:39:33,881 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.606e+02 1.009e+03 1.585e+03 2.865e+03 5.412e+03, threshold=3.169e+03, percent-clipped=34.0 2023-06-25 23:40:16,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2095326.0, ans=0.125 2023-06-25 23:40:54,907 INFO [train.py:996] (2/4) Epoch 12, batch 13800, loss[loss=0.2618, simple_loss=0.3853, pruned_loss=0.06915, over 19746.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3126, pruned_loss=0.07607, over 4262288.21 frames. ], batch size: 702, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:40:57,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2095446.0, ans=0.125 2023-06-25 23:41:10,308 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2095446.0, ans=0.125 2023-06-25 23:41:39,936 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.52 vs. limit=15.0 2023-06-25 23:41:44,640 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.39 vs. limit=22.5 2023-06-25 23:42:29,307 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.45 vs. limit=15.0 2023-06-25 23:42:52,756 INFO [train.py:996] (2/4) Epoch 12, batch 13850, loss[loss=0.2732, simple_loss=0.3443, pruned_loss=0.1011, over 21803.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3192, pruned_loss=0.07678, over 4256747.38 frames. ], batch size: 282, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:43:19,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=2095746.0, ans=0.5 2023-06-25 23:43:28,959 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.539e+02 1.083e+03 1.524e+03 2.080e+03 5.261e+03, threshold=3.047e+03, percent-clipped=9.0 2023-06-25 23:44:13,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2095926.0, ans=0.125 2023-06-25 23:44:31,880 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.20 vs. limit=12.0 2023-06-25 23:44:34,853 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2095986.0, ans=0.1 2023-06-25 23:44:42,286 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=9.31 vs. limit=15.0 2023-06-25 23:44:49,827 INFO [train.py:996] (2/4) Epoch 12, batch 13900, loss[loss=0.2532, simple_loss=0.3236, pruned_loss=0.0914, over 21676.00 frames. ], tot_loss[loss=0.2394, simple_loss=0.3211, pruned_loss=0.07881, over 4261417.66 frames. ], batch size: 389, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:45:11,705 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.41 vs. limit=15.0 2023-06-25 23:46:32,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2096286.0, ans=0.125 2023-06-25 23:46:35,822 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:46:38,611 INFO [train.py:996] (2/4) Epoch 12, batch 13950, loss[loss=0.2459, simple_loss=0.3163, pruned_loss=0.08769, over 21635.00 frames. ], tot_loss[loss=0.2415, simple_loss=0.3208, pruned_loss=0.08113, over 4268039.19 frames. ], batch size: 230, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:46:44,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2096346.0, ans=0.2 2023-06-25 23:47:08,273 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.165e+02 9.219e+02 1.176e+03 2.067e+03 4.872e+03, threshold=2.352e+03, percent-clipped=8.0 2023-06-25 23:47:49,489 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2096526.0, ans=0.0 2023-06-25 23:47:50,343 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.81 vs. limit=15.0 2023-06-25 23:48:08,118 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2096586.0, ans=0.1 2023-06-25 23:48:25,189 INFO [train.py:996] (2/4) Epoch 12, batch 14000, loss[loss=0.1941, simple_loss=0.2851, pruned_loss=0.05162, over 21644.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3174, pruned_loss=0.07914, over 4262581.70 frames. ], batch size: 263, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:48:37,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2096646.0, ans=0.1 2023-06-25 23:48:43,286 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2096646.0, ans=0.125 2023-06-25 23:48:43,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2096646.0, ans=0.0 2023-06-25 23:49:29,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2096826.0, ans=0.2 2023-06-25 23:49:32,797 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=12.23 vs. limit=22.5 2023-06-25 23:50:12,365 INFO [train.py:996] (2/4) Epoch 12, batch 14050, loss[loss=0.2048, simple_loss=0.2743, pruned_loss=0.06765, over 21809.00 frames. ], tot_loss[loss=0.2305, simple_loss=0.3111, pruned_loss=0.07492, over 4266223.53 frames. ], batch size: 118, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:50:13,401 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.28 vs. limit=15.0 2023-06-25 23:50:39,213 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.20 vs. limit=10.0 2023-06-25 23:50:41,379 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.215e+02 8.595e+02 1.187e+03 1.606e+03 3.647e+03, threshold=2.374e+03, percent-clipped=9.0 2023-06-25 23:51:53,046 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-25 23:52:05,849 INFO [train.py:996] (2/4) Epoch 12, batch 14100, loss[loss=0.2209, simple_loss=0.2905, pruned_loss=0.07567, over 21644.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.3027, pruned_loss=0.07434, over 4262584.71 frames. ], batch size: 298, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:52:12,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2097246.0, ans=0.125 2023-06-25 23:52:54,025 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2097366.0, ans=0.125 2023-06-25 23:53:25,147 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2097486.0, ans=0.125 2023-06-25 23:53:28,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2097486.0, ans=0.125 2023-06-25 23:53:40,202 INFO [train.py:996] (2/4) Epoch 12, batch 14150, loss[loss=0.235, simple_loss=0.3183, pruned_loss=0.07588, over 21419.00 frames. ], tot_loss[loss=0.2291, simple_loss=0.3067, pruned_loss=0.07574, over 4273824.66 frames. ], batch size: 194, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:53:49,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2097546.0, ans=0.2 2023-06-25 23:54:11,310 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.56 vs. limit=15.0 2023-06-25 23:54:17,201 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.654e+02 9.098e+02 1.301e+03 1.898e+03 3.994e+03, threshold=2.602e+03, percent-clipped=15.0 2023-06-25 23:54:19,865 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.13 vs. limit=12.0 2023-06-25 23:54:33,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2097666.0, ans=0.1 2023-06-25 23:54:56,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2097726.0, ans=0.0 2023-06-25 23:54:56,160 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2097726.0, ans=0.0 2023-06-25 23:55:25,907 INFO [train.py:996] (2/4) Epoch 12, batch 14200, loss[loss=0.2114, simple_loss=0.2887, pruned_loss=0.0671, over 21331.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3068, pruned_loss=0.075, over 4265243.11 frames. ], batch size: 159, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:56:14,148 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=11.83 vs. limit=15.0 2023-06-25 23:56:21,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2097966.0, ans=0.2 2023-06-25 23:56:32,875 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2098026.0, ans=0.1 2023-06-25 23:57:10,465 INFO [train.py:996] (2/4) Epoch 12, batch 14250, loss[loss=0.1903, simple_loss=0.2628, pruned_loss=0.05883, over 21569.00 frames. ], tot_loss[loss=0.2243, simple_loss=0.2997, pruned_loss=0.0744, over 4261074.64 frames. ], batch size: 263, lr: 2.42e-03, grad_scale: 16.0 2023-06-25 23:57:53,449 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.950e+02 7.589e+02 1.025e+03 1.608e+03 3.154e+03, threshold=2.050e+03, percent-clipped=3.0 2023-06-25 23:58:03,934 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.85 vs. limit=15.0 2023-06-25 23:58:09,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2098266.0, ans=0.1 2023-06-25 23:58:20,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2098326.0, ans=0.2 2023-06-25 23:58:39,248 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=14.60 vs. limit=15.0 2023-06-25 23:59:01,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2098386.0, ans=0.2 2023-06-25 23:59:06,531 INFO [train.py:996] (2/4) Epoch 12, batch 14300, loss[loss=0.2688, simple_loss=0.3589, pruned_loss=0.08935, over 21617.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.3011, pruned_loss=0.07237, over 4260570.07 frames. ], batch size: 263, lr: 2.42e-03, grad_scale: 8.0 2023-06-25 23:59:22,641 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=6.54 vs. limit=10.0 2023-06-25 23:59:43,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2098506.0, ans=0.125 2023-06-26 00:00:25,263 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.32 vs. limit=22.5 2023-06-26 00:00:25,364 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.05 vs. limit=10.0 2023-06-26 00:00:57,835 INFO [train.py:996] (2/4) Epoch 12, batch 14350, loss[loss=0.212, simple_loss=0.2862, pruned_loss=0.06885, over 21458.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3086, pruned_loss=0.07383, over 4241605.67 frames. ], batch size: 194, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:01:35,786 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.907e+02 9.524e+02 1.577e+03 2.595e+03 6.111e+03, threshold=3.154e+03, percent-clipped=35.0 2023-06-26 00:01:43,502 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2098866.0, ans=0.125 2023-06-26 00:02:02,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2098926.0, ans=0.125 2023-06-26 00:02:22,373 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2098926.0, ans=0.2 2023-06-26 00:02:33,589 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=2098986.0, ans=0.05 2023-06-26 00:02:35,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2098986.0, ans=0.125 2023-06-26 00:02:52,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2099046.0, ans=0.125 2023-06-26 00:02:53,133 INFO [train.py:996] (2/4) Epoch 12, batch 14400, loss[loss=0.2256, simple_loss=0.2976, pruned_loss=0.07683, over 21820.00 frames. ], tot_loss[loss=0.2294, simple_loss=0.3077, pruned_loss=0.07552, over 4253464.14 frames. ], batch size: 316, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:02:53,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2099046.0, ans=0.125 2023-06-26 00:02:56,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2099046.0, ans=0.125 2023-06-26 00:03:17,493 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.87 vs. limit=15.0 2023-06-26 00:04:30,952 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.02 vs. limit=15.0 2023-06-26 00:04:38,676 INFO [train.py:996] (2/4) Epoch 12, batch 14450, loss[loss=0.2028, simple_loss=0.2646, pruned_loss=0.07049, over 21528.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3022, pruned_loss=0.07647, over 4263350.08 frames. ], batch size: 212, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:04:47,802 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2099346.0, ans=0.125 2023-06-26 00:05:06,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2099406.0, ans=0.125 2023-06-26 00:05:10,384 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.479e+02 8.096e+02 1.121e+03 1.806e+03 4.464e+03, threshold=2.243e+03, percent-clipped=9.0 2023-06-26 00:05:50,147 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.35 vs. limit=22.5 2023-06-26 00:05:58,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2099586.0, ans=0.0 2023-06-26 00:06:30,374 INFO [train.py:996] (2/4) Epoch 12, batch 14500, loss[loss=0.2072, simple_loss=0.284, pruned_loss=0.06519, over 21773.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2979, pruned_loss=0.07602, over 4272998.34 frames. ], batch size: 98, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:07:08,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2099706.0, ans=0.04949747468305833 2023-06-26 00:07:22,121 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2099766.0, ans=0.0 2023-06-26 00:07:45,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2099826.0, ans=0.1 2023-06-26 00:07:48,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2099826.0, ans=0.125 2023-06-26 00:08:10,079 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.56 vs. limit=15.0 2023-06-26 00:08:26,785 INFO [train.py:996] (2/4) Epoch 12, batch 14550, loss[loss=0.2929, simple_loss=0.359, pruned_loss=0.1134, over 21434.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3016, pruned_loss=0.07741, over 4274957.44 frames. ], batch size: 471, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:08:34,794 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.44 vs. limit=6.0 2023-06-26 00:08:41,612 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2099946.0, ans=0.125 2023-06-26 00:08:46,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2100006.0, ans=0.0 2023-06-26 00:08:52,864 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.442e+02 9.003e+02 1.476e+03 2.430e+03 4.476e+03, threshold=2.953e+03, percent-clipped=26.0 2023-06-26 00:09:28,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2100126.0, ans=0.125 2023-06-26 00:10:17,051 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.41 vs. limit=22.5 2023-06-26 00:10:17,564 INFO [train.py:996] (2/4) Epoch 12, batch 14600, loss[loss=0.2637, simple_loss=0.3454, pruned_loss=0.09097, over 21744.00 frames. ], tot_loss[loss=0.2371, simple_loss=0.3105, pruned_loss=0.08189, over 4279122.05 frames. ], batch size: 441, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:10:43,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2100306.0, ans=0.1 2023-06-26 00:10:53,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2100366.0, ans=0.0 2023-06-26 00:12:06,926 INFO [train.py:996] (2/4) Epoch 12, batch 14650, loss[loss=0.2249, simple_loss=0.3008, pruned_loss=0.07448, over 21264.00 frames. ], tot_loss[loss=0.2387, simple_loss=0.3143, pruned_loss=0.08161, over 4261391.17 frames. ], batch size: 159, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:12:07,560 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2100546.0, ans=0.125 2023-06-26 00:12:32,488 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.972e+02 9.038e+02 1.259e+03 1.802e+03 4.365e+03, threshold=2.519e+03, percent-clipped=6.0 2023-06-26 00:12:38,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2100606.0, ans=0.125 2023-06-26 00:12:48,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2100666.0, ans=0.125 2023-06-26 00:13:47,228 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.46 vs. limit=6.0 2023-06-26 00:13:57,525 INFO [train.py:996] (2/4) Epoch 12, batch 14700, loss[loss=0.2105, simple_loss=0.2894, pruned_loss=0.06582, over 21274.00 frames. ], tot_loss[loss=0.2306, simple_loss=0.3091, pruned_loss=0.07609, over 4247531.68 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:14:06,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2100846.0, ans=0.125 2023-06-26 00:14:14,258 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2100906.0, ans=0.125 2023-06-26 00:14:48,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2100966.0, ans=0.125 2023-06-26 00:14:48,983 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.04 vs. limit=12.0 2023-06-26 00:15:49,413 INFO [train.py:996] (2/4) Epoch 12, batch 14750, loss[loss=0.3321, simple_loss=0.3991, pruned_loss=0.1326, over 21482.00 frames. ], tot_loss[loss=0.2356, simple_loss=0.3144, pruned_loss=0.07836, over 4250689.66 frames. ], batch size: 471, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:15:51,834 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2101146.0, ans=0.125 2023-06-26 00:16:07,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2101206.0, ans=0.0 2023-06-26 00:16:21,758 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.018e+02 8.627e+02 1.543e+03 2.178e+03 4.695e+03, threshold=3.085e+03, percent-clipped=15.0 2023-06-26 00:16:41,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2101266.0, ans=0.125 2023-06-26 00:16:50,044 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.71 vs. limit=15.0 2023-06-26 00:16:55,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2101266.0, ans=0.1 2023-06-26 00:17:17,400 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2101326.0, ans=0.125 2023-06-26 00:17:26,393 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.04 vs. limit=6.0 2023-06-26 00:17:28,148 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.17 vs. limit=22.5 2023-06-26 00:17:28,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2101386.0, ans=0.125 2023-06-26 00:17:40,011 INFO [train.py:996] (2/4) Epoch 12, batch 14800, loss[loss=0.2179, simple_loss=0.283, pruned_loss=0.07638, over 21117.00 frames. ], tot_loss[loss=0.2474, simple_loss=0.3271, pruned_loss=0.08389, over 4256808.52 frames. ], batch size: 143, lr: 2.42e-03, grad_scale: 32.0 2023-06-26 00:17:47,376 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2101446.0, ans=0.0 2023-06-26 00:17:55,517 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2101506.0, ans=0.0 2023-06-26 00:17:56,132 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.63 vs. limit=10.0 2023-06-26 00:18:40,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2101566.0, ans=0.0 2023-06-26 00:19:07,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2101626.0, ans=0.125 2023-06-26 00:19:18,669 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=9.88 vs. limit=10.0 2023-06-26 00:19:32,068 INFO [train.py:996] (2/4) Epoch 12, batch 14850, loss[loss=0.2119, simple_loss=0.2744, pruned_loss=0.07469, over 21102.00 frames. ], tot_loss[loss=0.2435, simple_loss=0.3205, pruned_loss=0.0832, over 4250542.24 frames. ], batch size: 143, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:19:59,186 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2101806.0, ans=0.1 2023-06-26 00:20:16,291 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.961e+02 9.941e+02 1.371e+03 2.279e+03 6.206e+03, threshold=2.743e+03, percent-clipped=9.0 2023-06-26 00:20:30,271 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2101866.0, ans=0.125 2023-06-26 00:20:54,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2101926.0, ans=0.125 2023-06-26 00:20:58,424 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2101926.0, ans=0.125 2023-06-26 00:21:19,485 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2101986.0, ans=0.125 2023-06-26 00:21:23,446 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=5.07 vs. limit=8.0 2023-06-26 00:21:32,280 INFO [train.py:996] (2/4) Epoch 12, batch 14900, loss[loss=0.2274, simple_loss=0.3033, pruned_loss=0.07576, over 21523.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3211, pruned_loss=0.08344, over 4256162.77 frames. ], batch size: 194, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:22:17,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2102166.0, ans=0.0 2023-06-26 00:23:09,362 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.32 vs. limit=6.0 2023-06-26 00:23:29,196 INFO [train.py:996] (2/4) Epoch 12, batch 14950, loss[loss=0.2035, simple_loss=0.2959, pruned_loss=0.05557, over 21891.00 frames. ], tot_loss[loss=0.2443, simple_loss=0.3219, pruned_loss=0.08334, over 4262473.07 frames. ], batch size: 317, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:23:30,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2102346.0, ans=0.2 2023-06-26 00:23:56,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2102406.0, ans=0.0 2023-06-26 00:24:01,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2102406.0, ans=0.1 2023-06-26 00:24:02,299 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.432e+02 8.420e+02 1.194e+03 1.609e+03 3.792e+03, threshold=2.388e+03, percent-clipped=5.0 2023-06-26 00:24:15,499 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=12.0 2023-06-26 00:24:26,998 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2102526.0, ans=0.0 2023-06-26 00:24:58,613 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=12.0 2023-06-26 00:25:04,822 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2102586.0, ans=0.2 2023-06-26 00:25:16,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2102586.0, ans=0.1 2023-06-26 00:25:18,775 INFO [train.py:996] (2/4) Epoch 12, batch 15000, loss[loss=0.2097, simple_loss=0.2834, pruned_loss=0.06801, over 21674.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3226, pruned_loss=0.08399, over 4267290.86 frames. ], batch size: 263, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:25:18,775 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-26 00:25:42,271 INFO [train.py:1028] (2/4) Epoch 12, validation: loss=0.2582, simple_loss=0.348, pruned_loss=0.08425, over 1796401.00 frames. 2023-06-26 00:25:42,272 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-26 00:26:01,362 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer_ff3.min_abs, batch_count=2102706.0, ans=0.2 2023-06-26 00:26:11,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2102706.0, ans=0.125 2023-06-26 00:26:13,352 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2102706.0, ans=0.125 2023-06-26 00:26:22,104 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2102766.0, ans=0.0 2023-06-26 00:26:22,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2102766.0, ans=0.1 2023-06-26 00:27:29,932 INFO [train.py:996] (2/4) Epoch 12, batch 15050, loss[loss=0.2354, simple_loss=0.3193, pruned_loss=0.07569, over 21594.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3219, pruned_loss=0.08426, over 4266171.52 frames. ], batch size: 230, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:27:51,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2103006.0, ans=0.0 2023-06-26 00:27:56,942 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.042e+02 9.488e+02 1.374e+03 1.981e+03 5.080e+03, threshold=2.749e+03, percent-clipped=16.0 2023-06-26 00:29:16,837 INFO [train.py:996] (2/4) Epoch 12, batch 15100, loss[loss=0.2415, simple_loss=0.3352, pruned_loss=0.07387, over 19897.00 frames. ], tot_loss[loss=0.2462, simple_loss=0.3244, pruned_loss=0.08398, over 4265566.86 frames. ], batch size: 703, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:29:22,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2103246.0, ans=0.04949747468305833 2023-06-26 00:30:51,764 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:31:07,973 INFO [train.py:996] (2/4) Epoch 12, batch 15150, loss[loss=0.238, simple_loss=0.2972, pruned_loss=0.08946, over 21814.00 frames. ], tot_loss[loss=0.2438, simple_loss=0.3199, pruned_loss=0.08391, over 4257380.86 frames. ], batch size: 372, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:31:22,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2103546.0, ans=0.125 2023-06-26 00:31:32,382 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.67 vs. limit=15.0 2023-06-26 00:31:44,815 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.420e+02 7.677e+02 1.025e+03 1.452e+03 2.770e+03, threshold=2.050e+03, percent-clipped=1.0 2023-06-26 00:32:18,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2103726.0, ans=0.125 2023-06-26 00:32:42,981 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 00:32:57,473 INFO [train.py:996] (2/4) Epoch 12, batch 15200, loss[loss=0.1965, simple_loss=0.2717, pruned_loss=0.06066, over 21145.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3103, pruned_loss=0.07989, over 4262984.94 frames. ], batch size: 143, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:33:06,059 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2103846.0, ans=0.125 2023-06-26 00:33:07,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2103846.0, ans=0.2 2023-06-26 00:34:45,777 INFO [train.py:996] (2/4) Epoch 12, batch 15250, loss[loss=0.2456, simple_loss=0.3075, pruned_loss=0.09183, over 21187.00 frames. ], tot_loss[loss=0.2312, simple_loss=0.3052, pruned_loss=0.07857, over 4255386.45 frames. ], batch size: 143, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:35:06,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2104206.0, ans=0.1 2023-06-26 00:35:12,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2104206.0, ans=0.0 2023-06-26 00:35:33,161 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.968e+02 8.853e+02 1.364e+03 1.967e+03 5.293e+03, threshold=2.727e+03, percent-clipped=20.0 2023-06-26 00:35:59,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2104266.0, ans=0.1 2023-06-26 00:36:10,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2104326.0, ans=0.0 2023-06-26 00:36:20,154 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2104386.0, ans=0.125 2023-06-26 00:36:35,473 INFO [train.py:996] (2/4) Epoch 12, batch 15300, loss[loss=0.2273, simple_loss=0.3097, pruned_loss=0.07247, over 21656.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3075, pruned_loss=0.08108, over 4261768.45 frames. ], batch size: 351, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:37:25,052 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2104566.0, ans=0.1 2023-06-26 00:37:49,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2104626.0, ans=0.125 2023-06-26 00:37:51,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2104626.0, ans=0.2 2023-06-26 00:38:22,891 INFO [train.py:996] (2/4) Epoch 12, batch 15350, loss[loss=0.2409, simple_loss=0.3327, pruned_loss=0.07456, over 21449.00 frames. ], tot_loss[loss=0.24, simple_loss=0.3131, pruned_loss=0.08346, over 4263821.13 frames. ], batch size: 194, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:38:39,127 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.89 vs. limit=12.0 2023-06-26 00:39:07,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2104806.0, ans=10.0 2023-06-26 00:39:10,398 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.965e+02 8.644e+02 1.167e+03 1.665e+03 4.882e+03, threshold=2.334e+03, percent-clipped=9.0 2023-06-26 00:39:16,602 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=12.0 2023-06-26 00:39:25,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2104866.0, ans=0.0 2023-06-26 00:39:40,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer1.prob, batch_count=2104926.0, ans=0.125 2023-06-26 00:39:59,892 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2104986.0, ans=0.0 2023-06-26 00:40:09,325 INFO [train.py:996] (2/4) Epoch 12, batch 15400, loss[loss=0.1911, simple_loss=0.2803, pruned_loss=0.05099, over 21918.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3154, pruned_loss=0.08254, over 4271626.68 frames. ], batch size: 316, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:41:15,716 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2105166.0, ans=0.0 2023-06-26 00:41:58,377 INFO [train.py:996] (2/4) Epoch 12, batch 15450, loss[loss=0.2125, simple_loss=0.3008, pruned_loss=0.06214, over 21820.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3127, pruned_loss=0.08159, over 4268488.52 frames. ], batch size: 332, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:42:27,336 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=2105406.0, ans=15.0 2023-06-26 00:42:40,787 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.072e+02 7.775e+02 1.085e+03 1.677e+03 3.243e+03, threshold=2.170e+03, percent-clipped=5.0 2023-06-26 00:42:41,270 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2105406.0, ans=0.125 2023-06-26 00:43:08,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2105466.0, ans=0.04949747468305833 2023-06-26 00:43:38,002 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.77 vs. limit=6.0 2023-06-26 00:43:47,776 INFO [train.py:996] (2/4) Epoch 12, batch 15500, loss[loss=0.3049, simple_loss=0.3751, pruned_loss=0.1173, over 21803.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3155, pruned_loss=0.08119, over 4263998.25 frames. ], batch size: 441, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:44:14,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2105706.0, ans=0.07 2023-06-26 00:45:00,795 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2105766.0, ans=0.0 2023-06-26 00:45:03,968 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2105826.0, ans=0.015 2023-06-26 00:45:23,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2105886.0, ans=0.125 2023-06-26 00:45:25,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2105886.0, ans=0.0 2023-06-26 00:45:39,680 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.60 vs. limit=10.0 2023-06-26 00:45:43,191 INFO [train.py:996] (2/4) Epoch 12, batch 15550, loss[loss=0.2407, simple_loss=0.3511, pruned_loss=0.06515, over 19772.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3144, pruned_loss=0.07967, over 4270756.22 frames. ], batch size: 703, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:46:08,535 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2106006.0, ans=0.125 2023-06-26 00:46:33,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2106006.0, ans=0.0 2023-06-26 00:46:34,336 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.226e+02 1.052e+03 1.309e+03 1.870e+03 4.327e+03, threshold=2.618e+03, percent-clipped=17.0 2023-06-26 00:46:36,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2106066.0, ans=0.125 2023-06-26 00:47:41,698 INFO [train.py:996] (2/4) Epoch 12, batch 15600, loss[loss=0.1931, simple_loss=0.2626, pruned_loss=0.06184, over 21611.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3084, pruned_loss=0.07783, over 4270622.52 frames. ], batch size: 298, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:48:31,889 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.33 vs. limit=15.0 2023-06-26 00:48:42,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2106426.0, ans=0.125 2023-06-26 00:48:56,585 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.08 vs. limit=10.0 2023-06-26 00:48:57,615 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2106486.0, ans=0.125 2023-06-26 00:48:57,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2106486.0, ans=0.0 2023-06-26 00:49:02,207 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2106486.0, ans=0.2 2023-06-26 00:49:20,601 INFO [train.py:996] (2/4) Epoch 12, batch 15650, loss[loss=0.2226, simple_loss=0.2889, pruned_loss=0.07821, over 21766.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3072, pruned_loss=0.07713, over 4264150.60 frames. ], batch size: 124, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 00:49:32,137 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2106546.0, ans=0.2 2023-06-26 00:50:07,696 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.319e+02 9.009e+02 1.257e+03 1.963e+03 4.655e+03, threshold=2.515e+03, percent-clipped=11.0 2023-06-26 00:50:35,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2106726.0, ans=0.025 2023-06-26 00:51:12,648 INFO [train.py:996] (2/4) Epoch 12, batch 15700, loss[loss=0.2385, simple_loss=0.3177, pruned_loss=0.0796, over 21764.00 frames. ], tot_loss[loss=0.2288, simple_loss=0.3049, pruned_loss=0.07638, over 4262721.59 frames. ], batch size: 351, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:51:14,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2106846.0, ans=0.125 2023-06-26 00:52:03,476 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2106966.0, ans=0.5 2023-06-26 00:52:08,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2106966.0, ans=0.125 2023-06-26 00:52:10,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2106966.0, ans=0.125 2023-06-26 00:52:55,628 INFO [train.py:996] (2/4) Epoch 12, batch 15750, loss[loss=0.2076, simple_loss=0.2827, pruned_loss=0.06625, over 21637.00 frames. ], tot_loss[loss=0.2258, simple_loss=0.3004, pruned_loss=0.07564, over 4269051.70 frames. ], batch size: 298, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:53:02,678 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2107146.0, ans=0.025 2023-06-26 00:53:38,552 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.389e+02 8.722e+02 1.383e+03 2.030e+03 4.451e+03, threshold=2.767e+03, percent-clipped=16.0 2023-06-26 00:54:41,607 INFO [train.py:996] (2/4) Epoch 12, batch 15800, loss[loss=0.2359, simple_loss=0.3007, pruned_loss=0.08558, over 19989.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2949, pruned_loss=0.07519, over 4267087.93 frames. ], batch size: 702, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:54:42,016 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2107446.0, ans=0.125 2023-06-26 00:55:12,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2107506.0, ans=0.2 2023-06-26 00:55:14,468 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.46 vs. limit=12.0 2023-06-26 00:55:27,344 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2107566.0, ans=0.125 2023-06-26 00:55:27,385 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2107566.0, ans=0.125 2023-06-26 00:55:43,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2107626.0, ans=0.1 2023-06-26 00:56:01,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2107686.0, ans=0.07 2023-06-26 00:56:30,891 INFO [train.py:996] (2/4) Epoch 12, batch 15850, loss[loss=0.2291, simple_loss=0.2919, pruned_loss=0.08309, over 21226.00 frames. ], tot_loss[loss=0.2253, simple_loss=0.2963, pruned_loss=0.07718, over 4261771.83 frames. ], batch size: 176, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:57:00,942 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2107806.0, ans=0.125 2023-06-26 00:57:07,555 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2107806.0, ans=0.0 2023-06-26 00:57:10,177 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.897e+02 9.764e+02 1.470e+03 2.269e+03 4.632e+03, threshold=2.939e+03, percent-clipped=9.0 2023-06-26 00:57:31,301 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2107926.0, ans=0.1 2023-06-26 00:57:56,565 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2107986.0, ans=0.2 2023-06-26 00:58:14,250 INFO [train.py:996] (2/4) Epoch 12, batch 15900, loss[loss=0.2522, simple_loss=0.3169, pruned_loss=0.09382, over 21323.00 frames. ], tot_loss[loss=0.2241, simple_loss=0.2934, pruned_loss=0.07743, over 4264996.38 frames. ], batch size: 507, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 00:58:39,229 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2108106.0, ans=0.025 2023-06-26 00:59:16,288 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.76 vs. limit=22.5 2023-06-26 00:59:18,155 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.28 vs. limit=6.0 2023-06-26 00:59:56,243 INFO [train.py:996] (2/4) Epoch 12, batch 15950, loss[loss=0.1772, simple_loss=0.2749, pruned_loss=0.03973, over 21860.00 frames. ], tot_loss[loss=0.2232, simple_loss=0.2956, pruned_loss=0.07538, over 4250479.80 frames. ], batch size: 316, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 01:00:16,507 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2108346.0, ans=0.125 2023-06-26 01:00:16,519 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2108346.0, ans=0.125 2023-06-26 01:00:19,081 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.70 vs. limit=15.0 2023-06-26 01:00:33,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2108406.0, ans=0.0 2023-06-26 01:00:41,310 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.627e+02 7.587e+02 1.024e+03 1.341e+03 2.731e+03, threshold=2.049e+03, percent-clipped=0.0 2023-06-26 01:00:41,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2108466.0, ans=0.125 2023-06-26 01:01:32,787 INFO [train.py:996] (2/4) Epoch 12, batch 16000, loss[loss=0.217, simple_loss=0.3026, pruned_loss=0.06571, over 21782.00 frames. ], tot_loss[loss=0.2209, simple_loss=0.2969, pruned_loss=0.07245, over 4251003.65 frames. ], batch size: 247, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 01:02:18,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2108706.0, ans=0.0 2023-06-26 01:02:45,800 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2108826.0, ans=0.1 2023-06-26 01:02:51,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2108826.0, ans=0.0 2023-06-26 01:03:05,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2108886.0, ans=0.035 2023-06-26 01:03:12,704 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2108886.0, ans=0.125 2023-06-26 01:03:26,078 INFO [train.py:996] (2/4) Epoch 12, batch 16050, loss[loss=0.2512, simple_loss=0.3578, pruned_loss=0.07225, over 21639.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2993, pruned_loss=0.07064, over 4262554.03 frames. ], batch size: 414, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 01:03:56,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2109006.0, ans=0.125 2023-06-26 01:04:15,700 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2109006.0, ans=0.0 2023-06-26 01:04:18,275 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.830e+02 8.237e+02 1.456e+03 2.962e+03 6.704e+03, threshold=2.913e+03, percent-clipped=34.0 2023-06-26 01:04:30,601 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2109066.0, ans=0.125 2023-06-26 01:04:47,736 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2109126.0, ans=0.05 2023-06-26 01:05:15,998 INFO [train.py:996] (2/4) Epoch 12, batch 16100, loss[loss=0.2164, simple_loss=0.282, pruned_loss=0.07544, over 21236.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.3035, pruned_loss=0.07297, over 4263570.55 frames. ], batch size: 608, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 01:05:25,521 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.67 vs. limit=15.0 2023-06-26 01:05:41,386 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2109306.0, ans=0.0 2023-06-26 01:06:16,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2109366.0, ans=0.125 2023-06-26 01:06:47,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2109486.0, ans=0.125 2023-06-26 01:07:03,540 INFO [train.py:996] (2/4) Epoch 12, batch 16150, loss[loss=0.253, simple_loss=0.3419, pruned_loss=0.08208, over 17617.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3031, pruned_loss=0.07546, over 4274546.15 frames. ], batch size: 60, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 01:07:55,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2109606.0, ans=0.0 2023-06-26 01:07:58,374 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.216e+02 8.957e+02 1.137e+03 1.849e+03 5.347e+03, threshold=2.275e+03, percent-clipped=8.0 2023-06-26 01:08:04,133 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2109666.0, ans=0.0 2023-06-26 01:08:38,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2109786.0, ans=0.125 2023-06-26 01:08:56,278 INFO [train.py:996] (2/4) Epoch 12, batch 16200, loss[loss=0.3105, simple_loss=0.375, pruned_loss=0.123, over 21454.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3108, pruned_loss=0.07771, over 4277681.63 frames. ], batch size: 131, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 01:09:38,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2109906.0, ans=0.0 2023-06-26 01:09:42,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2109906.0, ans=0.2 2023-06-26 01:10:04,050 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2109966.0, ans=0.125 2023-06-26 01:10:47,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2110086.0, ans=0.95 2023-06-26 01:10:52,256 INFO [train.py:996] (2/4) Epoch 12, batch 16250, loss[loss=0.1882, simple_loss=0.2667, pruned_loss=0.05487, over 21381.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3123, pruned_loss=0.07812, over 4272043.26 frames. ], batch size: 211, lr: 2.42e-03, grad_scale: 16.0 2023-06-26 01:10:56,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2110146.0, ans=0.125 2023-06-26 01:11:31,329 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.426e+02 1.023e+03 1.432e+03 2.136e+03 5.202e+03, threshold=2.864e+03, percent-clipped=19.0 2023-06-26 01:11:42,170 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2110266.0, ans=0.0 2023-06-26 01:11:54,183 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.07 vs. limit=22.5 2023-06-26 01:11:57,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2110326.0, ans=0.125 2023-06-26 01:12:34,519 INFO [train.py:996] (2/4) Epoch 12, batch 16300, loss[loss=0.1886, simple_loss=0.2716, pruned_loss=0.05285, over 21705.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.3046, pruned_loss=0.07388, over 4264527.55 frames. ], batch size: 351, lr: 2.42e-03, grad_scale: 8.0 2023-06-26 01:13:18,663 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2110566.0, ans=0.95 2023-06-26 01:13:29,158 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2110566.0, ans=0.1 2023-06-26 01:13:52,029 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=12.0 2023-06-26 01:13:54,294 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2110626.0, ans=0.1 2023-06-26 01:14:21,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2110686.0, ans=0.125 2023-06-26 01:14:22,129 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.34 vs. limit=15.0 2023-06-26 01:14:25,774 INFO [train.py:996] (2/4) Epoch 12, batch 16350, loss[loss=0.2439, simple_loss=0.3183, pruned_loss=0.08471, over 21701.00 frames. ], tot_loss[loss=0.2245, simple_loss=0.302, pruned_loss=0.07353, over 4264939.93 frames. ], batch size: 298, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 01:14:37,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2110746.0, ans=0.1 2023-06-26 01:15:07,545 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.769e+02 8.768e+02 1.143e+03 1.640e+03 3.455e+03, threshold=2.286e+03, percent-clipped=4.0 2023-06-26 01:15:25,771 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-26 01:15:38,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2110926.0, ans=0.1 2023-06-26 01:15:56,386 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:15:59,677 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2110986.0, ans=0.125 2023-06-26 01:16:21,199 INFO [train.py:996] (2/4) Epoch 12, batch 16400, loss[loss=0.2281, simple_loss=0.2965, pruned_loss=0.07988, over 21558.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3063, pruned_loss=0.07605, over 4270533.62 frames. ], batch size: 548, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:17:30,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2111226.0, ans=0.2 2023-06-26 01:18:04,804 INFO [train.py:996] (2/4) Epoch 12, batch 16450, loss[loss=0.208, simple_loss=0.2899, pruned_loss=0.06308, over 21509.00 frames. ], tot_loss[loss=0.231, simple_loss=0.307, pruned_loss=0.0775, over 4278357.62 frames. ], batch size: 211, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:18:23,977 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.50 vs. limit=15.0 2023-06-26 01:18:28,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2111406.0, ans=0.125 2023-06-26 01:18:39,419 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.567e+02 7.458e+02 1.043e+03 1.518e+03 3.613e+03, threshold=2.086e+03, percent-clipped=5.0 2023-06-26 01:19:28,520 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2111586.0, ans=0.0 2023-06-26 01:19:52,481 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.49 vs. limit=22.5 2023-06-26 01:19:54,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2111646.0, ans=0.125 2023-06-26 01:19:54,986 INFO [train.py:996] (2/4) Epoch 12, batch 16500, loss[loss=0.2192, simple_loss=0.3278, pruned_loss=0.05525, over 20809.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3055, pruned_loss=0.07716, over 4278190.93 frames. ], batch size: 608, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:20:27,613 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2111706.0, ans=0.125 2023-06-26 01:21:18,901 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.41 vs. limit=6.0 2023-06-26 01:21:18,972 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.18 vs. limit=22.5 2023-06-26 01:21:35,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2111886.0, ans=0.2 2023-06-26 01:21:37,316 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.74 vs. limit=22.5 2023-06-26 01:21:46,616 INFO [train.py:996] (2/4) Epoch 12, batch 16550, loss[loss=0.2452, simple_loss=0.3311, pruned_loss=0.07963, over 21652.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.303, pruned_loss=0.07415, over 4271866.41 frames. ], batch size: 389, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:21:48,310 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.34 vs. limit=22.5 2023-06-26 01:22:28,370 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.15 vs. limit=22.5 2023-06-26 01:22:28,604 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.316e+02 1.128e+03 1.805e+03 2.872e+03 7.168e+03, threshold=3.610e+03, percent-clipped=40.0 2023-06-26 01:22:31,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2112066.0, ans=10.0 2023-06-26 01:23:29,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2112186.0, ans=0.125 2023-06-26 01:23:39,447 INFO [train.py:996] (2/4) Epoch 12, batch 16600, loss[loss=0.2807, simple_loss=0.3776, pruned_loss=0.09193, over 21923.00 frames. ], tot_loss[loss=0.2323, simple_loss=0.3106, pruned_loss=0.07705, over 4265509.31 frames. ], batch size: 317, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:24:09,769 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2112306.0, ans=10.0 2023-06-26 01:24:11,468 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2112306.0, ans=0.1 2023-06-26 01:24:49,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2112366.0, ans=0.2 2023-06-26 01:24:57,566 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2112426.0, ans=0.125 2023-06-26 01:24:59,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2112426.0, ans=0.0 2023-06-26 01:25:29,299 INFO [train.py:996] (2/4) Epoch 12, batch 16650, loss[loss=0.2686, simple_loss=0.3454, pruned_loss=0.09596, over 21477.00 frames. ], tot_loss[loss=0.2408, simple_loss=0.321, pruned_loss=0.08033, over 4260906.67 frames. ], batch size: 211, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:26:25,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2112606.0, ans=0.125 2023-06-26 01:26:29,259 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.918e+02 9.386e+02 1.441e+03 2.110e+03 3.541e+03, threshold=2.881e+03, percent-clipped=0.0 2023-06-26 01:26:35,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2112666.0, ans=0.2 2023-06-26 01:26:35,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2112666.0, ans=0.0 2023-06-26 01:27:17,463 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2112786.0, ans=0.04949747468305833 2023-06-26 01:27:27,179 INFO [train.py:996] (2/4) Epoch 12, batch 16700, loss[loss=0.2914, simple_loss=0.3773, pruned_loss=0.1028, over 21514.00 frames. ], tot_loss[loss=0.2428, simple_loss=0.3232, pruned_loss=0.08123, over 4256404.86 frames. ], batch size: 471, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:27:31,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2112846.0, ans=0.0 2023-06-26 01:27:40,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2112846.0, ans=0.0 2023-06-26 01:28:04,305 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.46 vs. limit=22.5 2023-06-26 01:29:38,435 INFO [train.py:996] (2/4) Epoch 12, batch 16750, loss[loss=0.237, simple_loss=0.3306, pruned_loss=0.07167, over 19987.00 frames. ], tot_loss[loss=0.2469, simple_loss=0.3258, pruned_loss=0.08403, over 4254460.70 frames. ], batch size: 704, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 01:29:50,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2113146.0, ans=0.2 2023-06-26 01:29:51,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten.whitening_limit, batch_count=2113146.0, ans=15.0 2023-06-26 01:29:56,408 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2113206.0, ans=0.1 2023-06-26 01:30:21,351 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.824e+02 1.154e+03 1.795e+03 2.444e+03 4.443e+03, threshold=3.590e+03, percent-clipped=18.0 2023-06-26 01:30:45,078 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2113326.0, ans=0.2 2023-06-26 01:31:30,353 INFO [train.py:996] (2/4) Epoch 12, batch 16800, loss[loss=0.2912, simple_loss=0.3615, pruned_loss=0.1104, over 21619.00 frames. ], tot_loss[loss=0.2485, simple_loss=0.3289, pruned_loss=0.08403, over 4254198.00 frames. ], batch size: 471, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:31:39,749 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2113446.0, ans=0.125 2023-06-26 01:32:52,136 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.63 vs. limit=22.5 2023-06-26 01:33:07,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2113686.0, ans=0.125 2023-06-26 01:33:18,837 INFO [train.py:996] (2/4) Epoch 12, batch 16850, loss[loss=0.2354, simple_loss=0.3079, pruned_loss=0.08149, over 21485.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3253, pruned_loss=0.08389, over 4261117.17 frames. ], batch size: 131, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:33:45,858 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=17.65 vs. limit=15.0 2023-06-26 01:34:00,635 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.998e+02 1.125e+03 1.896e+03 2.663e+03 4.313e+03, threshold=3.792e+03, percent-clipped=10.0 2023-06-26 01:34:04,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2113866.0, ans=0.05 2023-06-26 01:34:08,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2113866.0, ans=0.1 2023-06-26 01:34:21,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2113926.0, ans=0.0 2023-06-26 01:34:22,962 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2113926.0, ans=0.125 2023-06-26 01:34:56,035 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2113986.0, ans=0.125 2023-06-26 01:34:56,564 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.38 vs. limit=15.0 2023-06-26 01:35:07,207 INFO [train.py:996] (2/4) Epoch 12, batch 16900, loss[loss=0.2128, simple_loss=0.2784, pruned_loss=0.07359, over 21840.00 frames. ], tot_loss[loss=0.2418, simple_loss=0.3194, pruned_loss=0.08214, over 4262833.71 frames. ], batch size: 102, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:35:14,754 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=8.68 vs. limit=15.0 2023-06-26 01:35:17,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2114046.0, ans=0.2 2023-06-26 01:35:27,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2114106.0, ans=0.0 2023-06-26 01:35:43,354 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-26 01:35:51,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2114166.0, ans=0.125 2023-06-26 01:36:20,953 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.02 vs. limit=15.0 2023-06-26 01:36:51,873 INFO [train.py:996] (2/4) Epoch 12, batch 16950, loss[loss=0.2011, simple_loss=0.271, pruned_loss=0.06559, over 21860.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3115, pruned_loss=0.08005, over 4267780.25 frames. ], batch size: 98, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:37:20,484 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2114406.0, ans=0.2 2023-06-26 01:37:25,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.attention_skip_rate, batch_count=2114406.0, ans=0.0 2023-06-26 01:37:33,324 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.589e+02 8.373e+02 1.013e+03 1.291e+03 3.071e+03, threshold=2.026e+03, percent-clipped=0.0 2023-06-26 01:37:44,372 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:37:56,537 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2114526.0, ans=0.125 2023-06-26 01:38:09,839 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2114526.0, ans=0.125 2023-06-26 01:38:41,255 INFO [train.py:996] (2/4) Epoch 12, batch 17000, loss[loss=0.196, simple_loss=0.2605, pruned_loss=0.06575, over 21243.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3078, pruned_loss=0.07991, over 4273389.66 frames. ], batch size: 608, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:39:21,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2114766.0, ans=0.1 2023-06-26 01:39:39,080 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.04 vs. limit=10.0 2023-06-26 01:39:54,963 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2114826.0, ans=0.125 2023-06-26 01:40:14,680 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2114886.0, ans=0.125 2023-06-26 01:40:24,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2114946.0, ans=0.2 2023-06-26 01:40:25,799 INFO [train.py:996] (2/4) Epoch 12, batch 17050, loss[loss=0.2865, simple_loss=0.3613, pruned_loss=0.1058, over 21746.00 frames. ], tot_loss[loss=0.2409, simple_loss=0.3156, pruned_loss=0.08314, over 4281060.53 frames. ], batch size: 441, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:40:57,278 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=10.70 vs. limit=15.0 2023-06-26 01:41:00,559 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.31 vs. limit=22.5 2023-06-26 01:41:05,979 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.789e+02 8.713e+02 1.354e+03 1.958e+03 3.911e+03, threshold=2.708e+03, percent-clipped=23.0 2023-06-26 01:41:35,577 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 01:41:35,591 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2115126.0, ans=0.125 2023-06-26 01:42:13,581 INFO [train.py:996] (2/4) Epoch 12, batch 17100, loss[loss=0.2154, simple_loss=0.2866, pruned_loss=0.07213, over 21830.00 frames. ], tot_loss[loss=0.241, simple_loss=0.3147, pruned_loss=0.08367, over 4282982.99 frames. ], batch size: 282, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:43:01,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2115366.0, ans=0.0 2023-06-26 01:43:24,914 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2115426.0, ans=0.07 2023-06-26 01:44:02,123 INFO [train.py:996] (2/4) Epoch 12, batch 17150, loss[loss=0.2072, simple_loss=0.2963, pruned_loss=0.05901, over 21701.00 frames. ], tot_loss[loss=0.2391, simple_loss=0.3113, pruned_loss=0.08344, over 4287408.03 frames. ], batch size: 389, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:44:16,275 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2115546.0, ans=0.125 2023-06-26 01:44:19,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2115546.0, ans=0.1 2023-06-26 01:44:29,122 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.22 vs. limit=10.0 2023-06-26 01:44:43,619 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.448e+02 7.663e+02 1.094e+03 1.326e+03 2.492e+03, threshold=2.188e+03, percent-clipped=0.0 2023-06-26 01:45:07,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer_ff2.min_abs, batch_count=2115666.0, ans=0.1 2023-06-26 01:45:37,454 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.08 vs. limit=15.0 2023-06-26 01:45:52,536 INFO [train.py:996] (2/4) Epoch 12, batch 17200, loss[loss=0.2619, simple_loss=0.3257, pruned_loss=0.09904, over 21329.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3114, pruned_loss=0.08354, over 4287022.92 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 32.0 2023-06-26 01:45:54,979 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2115846.0, ans=0.035 2023-06-26 01:46:40,088 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2115966.0, ans=0.125 2023-06-26 01:46:48,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2115966.0, ans=0.035 2023-06-26 01:47:09,315 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2116026.0, ans=0.1 2023-06-26 01:47:12,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2116026.0, ans=0.0 2023-06-26 01:47:45,190 INFO [train.py:996] (2/4) Epoch 12, batch 17250, loss[loss=0.2137, simple_loss=0.2981, pruned_loss=0.06463, over 21856.00 frames. ], tot_loss[loss=0.2424, simple_loss=0.3151, pruned_loss=0.08489, over 4287125.89 frames. ], batch size: 282, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:47:51,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2116146.0, ans=0.125 2023-06-26 01:48:42,520 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.045e+02 8.687e+02 1.216e+03 1.753e+03 5.268e+03, threshold=2.433e+03, percent-clipped=15.0 2023-06-26 01:49:35,573 INFO [train.py:996] (2/4) Epoch 12, batch 17300, loss[loss=0.2928, simple_loss=0.3626, pruned_loss=0.1115, over 21308.00 frames. ], tot_loss[loss=0.2494, simple_loss=0.3232, pruned_loss=0.08774, over 4281368.50 frames. ], batch size: 143, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:49:49,227 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.08 vs. limit=10.0 2023-06-26 01:50:20,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2116506.0, ans=0.05 2023-06-26 01:50:42,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2116566.0, ans=0.125 2023-06-26 01:50:44,722 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.73 vs. limit=15.0 2023-06-26 01:51:12,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2116686.0, ans=0.0 2023-06-26 01:51:18,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2116686.0, ans=0.1 2023-06-26 01:51:43,236 INFO [train.py:996] (2/4) Epoch 12, batch 17350, loss[loss=0.2454, simple_loss=0.3442, pruned_loss=0.0733, over 21510.00 frames. ], tot_loss[loss=0.2491, simple_loss=0.3251, pruned_loss=0.08651, over 4274532.21 frames. ], batch size: 471, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:51:55,525 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2116746.0, ans=0.025 2023-06-26 01:51:55,675 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2116746.0, ans=0.07 2023-06-26 01:52:02,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2116746.0, ans=0.0 2023-06-26 01:52:20,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2116806.0, ans=0.1 2023-06-26 01:52:33,990 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.805e+02 1.057e+03 1.442e+03 1.846e+03 4.357e+03, threshold=2.883e+03, percent-clipped=11.0 2023-06-26 01:52:40,564 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2116866.0, ans=0.125 2023-06-26 01:52:43,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2116866.0, ans=0.2 2023-06-26 01:52:45,079 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2116866.0, ans=0.125 2023-06-26 01:53:34,269 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2116986.0, ans=0.1 2023-06-26 01:53:38,859 INFO [train.py:996] (2/4) Epoch 12, batch 17400, loss[loss=0.2182, simple_loss=0.3035, pruned_loss=0.06648, over 21644.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.3208, pruned_loss=0.08329, over 4271147.71 frames. ], batch size: 247, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 01:54:28,377 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2117166.0, ans=0.0 2023-06-26 01:55:26,403 INFO [train.py:996] (2/4) Epoch 12, batch 17450, loss[loss=0.268, simple_loss=0.3435, pruned_loss=0.09628, over 21536.00 frames. ], tot_loss[loss=0.2379, simple_loss=0.3156, pruned_loss=0.08008, over 4273246.42 frames. ], batch size: 508, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 01:55:38,747 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2117346.0, ans=0.1 2023-06-26 01:55:48,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2117406.0, ans=0.125 2023-06-26 01:55:51,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2117406.0, ans=0.125 2023-06-26 01:56:11,461 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.427e+02 9.633e+02 1.716e+03 2.627e+03 5.192e+03, threshold=3.432e+03, percent-clipped=16.0 2023-06-26 01:56:42,170 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=8.15 vs. limit=15.0 2023-06-26 01:56:44,695 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2117526.0, ans=0.125 2023-06-26 01:57:06,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2117586.0, ans=0.2 2023-06-26 01:57:16,269 INFO [train.py:996] (2/4) Epoch 12, batch 17500, loss[loss=0.2249, simple_loss=0.2948, pruned_loss=0.07746, over 21808.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3108, pruned_loss=0.07795, over 4280797.44 frames. ], batch size: 282, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 01:57:39,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2117706.0, ans=0.125 2023-06-26 01:57:45,510 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2117706.0, ans=0.125 2023-06-26 01:58:05,885 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2117766.0, ans=0.2 2023-06-26 01:58:52,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2117886.0, ans=0.125 2023-06-26 01:59:02,059 INFO [train.py:996] (2/4) Epoch 12, batch 17550, loss[loss=0.2097, simple_loss=0.3039, pruned_loss=0.05776, over 21795.00 frames. ], tot_loss[loss=0.2316, simple_loss=0.3106, pruned_loss=0.07634, over 4270786.77 frames. ], batch size: 316, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 01:59:16,024 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2117946.0, ans=0.0 2023-06-26 01:59:44,841 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.370e+02 8.396e+02 1.240e+03 1.763e+03 3.484e+03, threshold=2.480e+03, percent-clipped=3.0 2023-06-26 01:59:47,053 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2118066.0, ans=0.2 2023-06-26 02:00:02,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2118126.0, ans=0.125 2023-06-26 02:00:09,857 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.13 vs. limit=12.0 2023-06-26 02:00:12,536 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2118126.0, ans=0.125 2023-06-26 02:00:47,352 INFO [train.py:996] (2/4) Epoch 12, batch 17600, loss[loss=0.2665, simple_loss=0.3358, pruned_loss=0.09856, over 21272.00 frames. ], tot_loss[loss=0.2341, simple_loss=0.3136, pruned_loss=0.07732, over 4264762.86 frames. ], batch size: 159, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:00:59,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2118246.0, ans=0.125 2023-06-26 02:01:09,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2118306.0, ans=0.07 2023-06-26 02:02:30,421 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2118486.0, ans=0.125 2023-06-26 02:02:36,290 INFO [train.py:996] (2/4) Epoch 12, batch 17650, loss[loss=0.2569, simple_loss=0.3344, pruned_loss=0.08971, over 21541.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3131, pruned_loss=0.07813, over 4266692.73 frames. ], batch size: 473, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:02:54,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2118546.0, ans=0.125 2023-06-26 02:02:55,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2118546.0, ans=0.125 2023-06-26 02:03:27,853 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.781e+02 8.671e+02 1.388e+03 2.220e+03 4.878e+03, threshold=2.775e+03, percent-clipped=22.0 2023-06-26 02:03:28,289 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2118666.0, ans=0.1 2023-06-26 02:03:41,508 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2118666.0, ans=0.0 2023-06-26 02:03:41,581 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2118666.0, ans=0.125 2023-06-26 02:03:43,447 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.min_positive, batch_count=2118666.0, ans=0.05 2023-06-26 02:03:59,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2118726.0, ans=0.0 2023-06-26 02:04:11,323 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2118786.0, ans=0.125 2023-06-26 02:04:31,214 INFO [train.py:996] (2/4) Epoch 12, batch 17700, loss[loss=0.2192, simple_loss=0.2985, pruned_loss=0.06991, over 20150.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3072, pruned_loss=0.07571, over 4258627.84 frames. ], batch size: 703, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:05:09,006 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2118906.0, ans=0.0 2023-06-26 02:05:58,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2119086.0, ans=0.0 2023-06-26 02:06:20,346 INFO [train.py:996] (2/4) Epoch 12, batch 17750, loss[loss=0.2455, simple_loss=0.332, pruned_loss=0.0795, over 21380.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3149, pruned_loss=0.07857, over 4260923.40 frames. ], batch size: 549, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:06:29,576 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2119146.0, ans=0.0 2023-06-26 02:06:42,527 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2119206.0, ans=0.0 2023-06-26 02:07:11,863 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer1.prob, batch_count=2119266.0, ans=0.125 2023-06-26 02:07:18,721 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.321e+02 9.910e+02 1.512e+03 2.052e+03 5.083e+03, threshold=3.025e+03, percent-clipped=13.0 2023-06-26 02:07:19,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2119266.0, ans=0.1 2023-06-26 02:07:54,806 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2119386.0, ans=0.07 2023-06-26 02:07:56,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2119386.0, ans=0.1 2023-06-26 02:08:11,448 INFO [train.py:996] (2/4) Epoch 12, batch 17800, loss[loss=0.2015, simple_loss=0.2709, pruned_loss=0.06604, over 21173.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3141, pruned_loss=0.07787, over 4259872.00 frames. ], batch size: 143, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:08:58,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2119506.0, ans=0.1 2023-06-26 02:09:18,047 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2119566.0, ans=0.0 2023-06-26 02:10:07,332 INFO [train.py:996] (2/4) Epoch 12, batch 17850, loss[loss=0.2826, simple_loss=0.3503, pruned_loss=0.1075, over 21732.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3153, pruned_loss=0.07869, over 4266940.72 frames. ], batch size: 441, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:10:08,473 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.10 vs. limit=15.0 2023-06-26 02:10:12,242 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.30 vs. limit=15.0 2023-06-26 02:10:40,635 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2119806.0, ans=10.0 2023-06-26 02:10:58,031 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.213e+02 1.072e+03 1.677e+03 2.667e+03 5.853e+03, threshold=3.353e+03, percent-clipped=21.0 2023-06-26 02:11:02,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2119866.0, ans=0.125 2023-06-26 02:11:47,466 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2119986.0, ans=0.125 2023-06-26 02:12:03,564 INFO [train.py:996] (2/4) Epoch 12, batch 17900, loss[loss=0.2139, simple_loss=0.3024, pruned_loss=0.06266, over 21248.00 frames. ], tot_loss[loss=0.2399, simple_loss=0.3194, pruned_loss=0.08023, over 4269968.05 frames. ], batch size: 159, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:12:05,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2120046.0, ans=0.125 2023-06-26 02:12:55,730 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer2.min_abs, batch_count=2120166.0, ans=0.5 2023-06-26 02:13:02,646 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2120166.0, ans=0.0 2023-06-26 02:13:04,490 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2120166.0, ans=0.125 2023-06-26 02:13:30,412 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-26 02:13:31,982 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.72 vs. limit=22.5 2023-06-26 02:13:59,308 INFO [train.py:996] (2/4) Epoch 12, batch 17950, loss[loss=0.1734, simple_loss=0.2499, pruned_loss=0.04841, over 21124.00 frames. ], tot_loss[loss=0.2361, simple_loss=0.3182, pruned_loss=0.07697, over 4269469.58 frames. ], batch size: 143, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:14:44,279 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.194e+02 1.009e+03 1.518e+03 2.027e+03 4.283e+03, threshold=3.036e+03, percent-clipped=1.0 2023-06-26 02:15:49,544 INFO [train.py:996] (2/4) Epoch 12, batch 18000, loss[loss=0.2182, simple_loss=0.2826, pruned_loss=0.07687, over 21785.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3129, pruned_loss=0.07545, over 4257894.22 frames. ], batch size: 317, lr: 2.41e-03, grad_scale: 32.0 2023-06-26 02:15:49,545 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-26 02:16:07,689 INFO [train.py:1028] (2/4) Epoch 12, validation: loss=0.258, simple_loss=0.3529, pruned_loss=0.08158, over 1796401.00 frames. 2023-06-26 02:16:07,690 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-26 02:16:16,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2120646.0, ans=0.1 2023-06-26 02:16:27,497 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.53 vs. limit=15.0 2023-06-26 02:17:01,005 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2120766.0, ans=0.0 2023-06-26 02:17:48,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2120886.0, ans=0.0 2023-06-26 02:17:55,808 INFO [train.py:996] (2/4) Epoch 12, batch 18050, loss[loss=0.221, simple_loss=0.2911, pruned_loss=0.07549, over 21622.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3074, pruned_loss=0.07498, over 4257798.96 frames. ], batch size: 247, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:18:55,880 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.179e+02 8.208e+02 1.156e+03 1.748e+03 3.501e+03, threshold=2.312e+03, percent-clipped=3.0 2023-06-26 02:19:12,180 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2121126.0, ans=0.125 2023-06-26 02:19:24,384 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2121126.0, ans=0.0 2023-06-26 02:19:36,596 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2121186.0, ans=0.125 2023-06-26 02:19:50,059 INFO [train.py:996] (2/4) Epoch 12, batch 18100, loss[loss=0.2363, simple_loss=0.3196, pruned_loss=0.07652, over 21591.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3111, pruned_loss=0.07778, over 4256643.72 frames. ], batch size: 112, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:20:10,801 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2121246.0, ans=0.0 2023-06-26 02:20:29,369 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2121366.0, ans=0.125 2023-06-26 02:20:54,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2121366.0, ans=0.0 2023-06-26 02:21:10,487 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.66 vs. limit=15.0 2023-06-26 02:21:13,922 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 02:21:21,775 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2121486.0, ans=0.2 2023-06-26 02:21:36,986 INFO [train.py:996] (2/4) Epoch 12, batch 18150, loss[loss=0.2175, simple_loss=0.2737, pruned_loss=0.08071, over 21813.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3144, pruned_loss=0.07758, over 4255937.12 frames. ], batch size: 112, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:22:34,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2121666.0, ans=0.0 2023-06-26 02:22:36,068 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.496e+02 9.455e+02 1.549e+03 2.118e+03 3.915e+03, threshold=3.098e+03, percent-clipped=17.0 2023-06-26 02:22:41,928 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2121666.0, ans=0.2 2023-06-26 02:22:59,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2121726.0, ans=0.0 2023-06-26 02:23:00,873 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2121786.0, ans=0.1 2023-06-26 02:23:24,022 INFO [train.py:996] (2/4) Epoch 12, batch 18200, loss[loss=0.2024, simple_loss=0.2749, pruned_loss=0.06495, over 21807.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3082, pruned_loss=0.07715, over 4255813.74 frames. ], batch size: 102, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:23:49,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2121906.0, ans=0.125 2023-06-26 02:23:51,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2121906.0, ans=0.125 2023-06-26 02:24:33,987 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.whiten, num_groups=1, num_channels=192, metric=4.34 vs. limit=12.0 2023-06-26 02:25:00,606 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2122146.0, ans=0.125 2023-06-26 02:25:01,798 INFO [train.py:996] (2/4) Epoch 12, batch 18250, loss[loss=0.18, simple_loss=0.2603, pruned_loss=0.04988, over 21859.00 frames. ], tot_loss[loss=0.2244, simple_loss=0.2999, pruned_loss=0.07447, over 4261593.82 frames. ], batch size: 124, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:25:02,244 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2122146.0, ans=0.0 2023-06-26 02:25:28,682 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2122206.0, ans=0.0 2023-06-26 02:25:36,179 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2122206.0, ans=0.1 2023-06-26 02:25:54,176 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.441e+02 8.526e+02 1.132e+03 1.532e+03 3.016e+03, threshold=2.265e+03, percent-clipped=0.0 2023-06-26 02:26:04,485 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.76 vs. limit=15.0 2023-06-26 02:26:24,621 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2122326.0, ans=0.2 2023-06-26 02:26:48,436 INFO [train.py:996] (2/4) Epoch 12, batch 18300, loss[loss=0.1991, simple_loss=0.2653, pruned_loss=0.06649, over 21300.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2993, pruned_loss=0.07387, over 4268661.70 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:27:47,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2122566.0, ans=0.025 2023-06-26 02:28:09,061 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.84 vs. limit=8.0 2023-06-26 02:28:34,404 INFO [train.py:996] (2/4) Epoch 12, batch 18350, loss[loss=0.208, simple_loss=0.2799, pruned_loss=0.06805, over 21839.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3031, pruned_loss=0.07327, over 4268956.39 frames. ], batch size: 118, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:29:22,728 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2122866.0, ans=10.0 2023-06-26 02:29:31,223 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.251e+02 1.279e+03 1.909e+03 2.961e+03 4.815e+03, threshold=3.819e+03, percent-clipped=39.0 2023-06-26 02:30:26,202 INFO [train.py:996] (2/4) Epoch 12, batch 18400, loss[loss=0.1965, simple_loss=0.2716, pruned_loss=0.06067, over 21212.00 frames. ], tot_loss[loss=0.2218, simple_loss=0.2995, pruned_loss=0.07202, over 4272535.45 frames. ], batch size: 176, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:31:19,421 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=14.44 vs. limit=22.5 2023-06-26 02:32:13,298 INFO [train.py:996] (2/4) Epoch 12, batch 18450, loss[loss=0.1846, simple_loss=0.2582, pruned_loss=0.05548, over 21690.00 frames. ], tot_loss[loss=0.2184, simple_loss=0.2975, pruned_loss=0.06959, over 4267867.67 frames. ], batch size: 298, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:33:07,634 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.751e+02 8.419e+02 1.219e+03 1.812e+03 4.554e+03, threshold=2.437e+03, percent-clipped=1.0 2023-06-26 02:33:09,018 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.whiten, num_groups=1, num_channels=192, metric=3.92 vs. limit=12.0 2023-06-26 02:33:38,023 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2123526.0, ans=0.035 2023-06-26 02:34:00,020 INFO [train.py:996] (2/4) Epoch 12, batch 18500, loss[loss=0.2053, simple_loss=0.271, pruned_loss=0.06973, over 21811.00 frames. ], tot_loss[loss=0.2154, simple_loss=0.2938, pruned_loss=0.06848, over 4260491.77 frames. ], batch size: 124, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:35:42,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2123886.0, ans=0.0 2023-06-26 02:35:50,248 INFO [train.py:996] (2/4) Epoch 12, batch 18550, loss[loss=0.2402, simple_loss=0.2958, pruned_loss=0.09228, over 21361.00 frames. ], tot_loss[loss=0.2145, simple_loss=0.2927, pruned_loss=0.06811, over 4254490.44 frames. ], batch size: 194, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:36:01,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2123946.0, ans=0.125 2023-06-26 02:36:04,883 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2123946.0, ans=10.0 2023-06-26 02:36:07,973 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2123946.0, ans=0.125 2023-06-26 02:36:35,248 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2124066.0, ans=0.125 2023-06-26 02:36:59,193 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.320e+02 1.196e+03 2.047e+03 2.769e+03 5.158e+03, threshold=4.094e+03, percent-clipped=37.0 2023-06-26 02:36:59,607 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2124066.0, ans=0.0 2023-06-26 02:37:47,742 INFO [train.py:996] (2/4) Epoch 12, batch 18600, loss[loss=0.2734, simple_loss=0.3566, pruned_loss=0.09516, over 21596.00 frames. ], tot_loss[loss=0.2144, simple_loss=0.2911, pruned_loss=0.06888, over 4256679.34 frames. ], batch size: 442, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:38:13,714 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.41 vs. limit=15.0 2023-06-26 02:38:18,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2124306.0, ans=0.0 2023-06-26 02:38:53,645 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2124366.0, ans=0.07 2023-06-26 02:39:00,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2124426.0, ans=0.1 2023-06-26 02:39:18,875 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.26 vs. limit=15.0 2023-06-26 02:39:30,991 INFO [train.py:996] (2/4) Epoch 12, batch 18650, loss[loss=0.2213, simple_loss=0.2924, pruned_loss=0.07514, over 21575.00 frames. ], tot_loss[loss=0.2135, simple_loss=0.2902, pruned_loss=0.06838, over 4264298.37 frames. ], batch size: 391, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:39:43,922 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2124546.0, ans=0.125 2023-06-26 02:40:25,766 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.40 vs. limit=6.0 2023-06-26 02:40:30,396 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.144e+02 8.452e+02 1.273e+03 1.829e+03 4.021e+03, threshold=2.546e+03, percent-clipped=0.0 2023-06-26 02:40:45,285 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.layerdrop_rate, batch_count=2124726.0, ans=0.015 2023-06-26 02:41:20,345 INFO [train.py:996] (2/4) Epoch 12, batch 18700, loss[loss=0.2149, simple_loss=0.2867, pruned_loss=0.07152, over 21873.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2873, pruned_loss=0.06973, over 4264111.34 frames. ], batch size: 371, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:42:20,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2124966.0, ans=0.2 2023-06-26 02:42:33,777 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2125026.0, ans=0.1 2023-06-26 02:43:09,819 INFO [train.py:996] (2/4) Epoch 12, batch 18750, loss[loss=0.2197, simple_loss=0.2866, pruned_loss=0.07638, over 21458.00 frames. ], tot_loss[loss=0.2162, simple_loss=0.288, pruned_loss=0.07216, over 4266932.48 frames. ], batch size: 131, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:43:35,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2125206.0, ans=0.125 2023-06-26 02:44:08,702 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.434e+02 9.819e+02 1.428e+03 2.632e+03 5.661e+03, threshold=2.856e+03, percent-clipped=25.0 2023-06-26 02:44:38,485 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.23 vs. limit=15.0 2023-06-26 02:44:57,205 INFO [train.py:996] (2/4) Epoch 12, batch 18800, loss[loss=0.2397, simple_loss=0.3017, pruned_loss=0.08882, over 21215.00 frames. ], tot_loss[loss=0.2225, simple_loss=0.2961, pruned_loss=0.07446, over 4251674.73 frames. ], batch size: 143, lr: 2.41e-03, grad_scale: 32.0 2023-06-26 02:45:39,069 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2125506.0, ans=0.0 2023-06-26 02:46:44,220 INFO [train.py:996] (2/4) Epoch 12, batch 18850, loss[loss=0.2136, simple_loss=0.2878, pruned_loss=0.06971, over 21803.00 frames. ], tot_loss[loss=0.2175, simple_loss=0.294, pruned_loss=0.07049, over 4257369.08 frames. ], batch size: 118, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:47:10,659 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.64 vs. limit=15.0 2023-06-26 02:47:45,912 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.700e+02 8.171e+02 1.119e+03 1.966e+03 5.674e+03, threshold=2.238e+03, percent-clipped=7.0 2023-06-26 02:48:24,289 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.54 vs. limit=15.0 2023-06-26 02:48:31,256 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.24 vs. limit=15.0 2023-06-26 02:48:31,738 INFO [train.py:996] (2/4) Epoch 12, batch 18900, loss[loss=0.1655, simple_loss=0.2266, pruned_loss=0.05217, over 20777.00 frames. ], tot_loss[loss=0.2134, simple_loss=0.2886, pruned_loss=0.06911, over 4264848.64 frames. ], batch size: 609, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:49:08,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2126166.0, ans=0.125 2023-06-26 02:50:19,385 INFO [train.py:996] (2/4) Epoch 12, batch 18950, loss[loss=0.2321, simple_loss=0.2927, pruned_loss=0.08575, over 21445.00 frames. ], tot_loss[loss=0.2167, simple_loss=0.2903, pruned_loss=0.07149, over 4279548.17 frames. ], batch size: 194, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:50:24,100 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.72 vs. limit=10.0 2023-06-26 02:51:21,183 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.969e+02 7.601e+02 1.025e+03 1.566e+03 4.291e+03, threshold=2.050e+03, percent-clipped=8.0 2023-06-26 02:51:24,890 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer_na.min_abs, batch_count=2126466.0, ans=0.02 2023-06-26 02:51:25,529 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.86 vs. limit=10.0 2023-06-26 02:51:50,370 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2126586.0, ans=0.125 2023-06-26 02:51:54,631 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=4.61 vs. limit=15.0 2023-06-26 02:52:07,277 INFO [train.py:996] (2/4) Epoch 12, batch 19000, loss[loss=0.3088, simple_loss=0.3711, pruned_loss=0.1232, over 21414.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3004, pruned_loss=0.07399, over 4283910.51 frames. ], batch size: 471, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:53:55,646 INFO [train.py:996] (2/4) Epoch 12, batch 19050, loss[loss=0.2321, simple_loss=0.2937, pruned_loss=0.08523, over 20076.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3062, pruned_loss=0.07863, over 4280725.32 frames. ], batch size: 703, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:54:00,334 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2126946.0, ans=0.0 2023-06-26 02:55:00,807 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.866e+02 8.291e+02 1.142e+03 1.622e+03 3.399e+03, threshold=2.283e+03, percent-clipped=17.0 2023-06-26 02:55:46,488 INFO [train.py:996] (2/4) Epoch 12, batch 19100, loss[loss=0.2052, simple_loss=0.2666, pruned_loss=0.0719, over 20781.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3038, pruned_loss=0.0795, over 4279895.67 frames. ], batch size: 608, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:56:31,125 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2127306.0, ans=0.0 2023-06-26 02:56:38,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2127366.0, ans=0.1 2023-06-26 02:56:41,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2127366.0, ans=0.025 2023-06-26 02:57:49,152 INFO [train.py:996] (2/4) Epoch 12, batch 19150, loss[loss=0.2422, simple_loss=0.3352, pruned_loss=0.07456, over 21581.00 frames. ], tot_loss[loss=0.2335, simple_loss=0.3064, pruned_loss=0.08026, over 4280170.57 frames. ], batch size: 230, lr: 2.41e-03, grad_scale: 8.0 2023-06-26 02:58:41,548 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2127666.0, ans=0.0 2023-06-26 02:58:50,856 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.298e+02 9.303e+02 1.285e+03 2.071e+03 6.086e+03, threshold=2.570e+03, percent-clipped=20.0 2023-06-26 02:58:55,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2127726.0, ans=0.125 2023-06-26 02:58:58,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2127726.0, ans=0.125 2023-06-26 02:58:58,661 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.07 vs. limit=10.0 2023-06-26 02:59:07,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2127726.0, ans=0.125 2023-06-26 02:59:33,308 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.46 vs. limit=15.0 2023-06-26 02:59:48,653 INFO [train.py:996] (2/4) Epoch 12, batch 19200, loss[loss=0.2917, simple_loss=0.3881, pruned_loss=0.09767, over 21682.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3175, pruned_loss=0.08185, over 4280084.69 frames. ], batch size: 389, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 02:59:56,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2127846.0, ans=0.125 2023-06-26 03:00:03,178 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:01:12,064 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2128086.0, ans=0.125 2023-06-26 03:01:36,764 INFO [train.py:996] (2/4) Epoch 12, batch 19250, loss[loss=0.2012, simple_loss=0.2858, pruned_loss=0.05831, over 21875.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3169, pruned_loss=0.0765, over 4264492.33 frames. ], batch size: 371, lr: 2.41e-03, grad_scale: 16.0 2023-06-26 03:01:51,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2128146.0, ans=0.2 2023-06-26 03:02:28,901 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.722e+02 8.401e+02 1.172e+03 1.980e+03 3.719e+03, threshold=2.345e+03, percent-clipped=11.0 2023-06-26 03:02:54,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2128386.0, ans=0.125 2023-06-26 03:03:18,857 INFO [train.py:996] (2/4) Epoch 12, batch 19300, loss[loss=0.2032, simple_loss=0.2764, pruned_loss=0.06502, over 21255.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.313, pruned_loss=0.07416, over 4266119.54 frames. ], batch size: 176, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:04:57,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2128686.0, ans=0.1 2023-06-26 03:05:09,199 INFO [train.py:996] (2/4) Epoch 12, batch 19350, loss[loss=0.1758, simple_loss=0.2598, pruned_loss=0.04587, over 21528.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.3073, pruned_loss=0.07094, over 4263811.05 frames. ], batch size: 212, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:06:02,469 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.957e+02 9.080e+02 1.347e+03 2.318e+03 4.849e+03, threshold=2.694e+03, percent-clipped=24.0 2023-06-26 03:06:03,101 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2128866.0, ans=0.125 2023-06-26 03:06:30,967 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2128926.0, ans=0.1 2023-06-26 03:06:30,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2128926.0, ans=0.2 2023-06-26 03:06:57,169 INFO [train.py:996] (2/4) Epoch 12, batch 19400, loss[loss=0.2495, simple_loss=0.3219, pruned_loss=0.08855, over 21758.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3046, pruned_loss=0.07046, over 4268944.04 frames. ], batch size: 112, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:07:12,247 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=7.83 vs. limit=15.0 2023-06-26 03:08:02,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2129226.0, ans=0.125 2023-06-26 03:08:20,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2129286.0, ans=0.2 2023-06-26 03:08:45,421 INFO [train.py:996] (2/4) Epoch 12, batch 19450, loss[loss=0.2034, simple_loss=0.262, pruned_loss=0.07234, over 21096.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.3028, pruned_loss=0.07284, over 4279622.09 frames. ], batch size: 608, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:08:46,462 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=15.0 2023-06-26 03:08:49,930 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2129346.0, ans=0.125 2023-06-26 03:08:54,558 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2129346.0, ans=0.2 2023-06-26 03:09:23,904 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2129466.0, ans=0.125 2023-06-26 03:09:36,555 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.95 vs. limit=15.0 2023-06-26 03:09:38,487 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.528e+02 8.932e+02 1.244e+03 1.603e+03 3.427e+03, threshold=2.488e+03, percent-clipped=5.0 2023-06-26 03:10:32,570 INFO [train.py:996] (2/4) Epoch 12, batch 19500, loss[loss=0.1995, simple_loss=0.2608, pruned_loss=0.06907, over 21462.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.2979, pruned_loss=0.0734, over 4284753.21 frames. ], batch size: 195, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:11:25,366 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2129766.0, ans=0.125 2023-06-26 03:12:18,573 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2129886.0, ans=0.125 2023-06-26 03:12:21,319 INFO [train.py:996] (2/4) Epoch 12, batch 19550, loss[loss=0.2241, simple_loss=0.3026, pruned_loss=0.07278, over 21214.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2939, pruned_loss=0.07249, over 4284716.24 frames. ], batch size: 159, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:13:05,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2130066.0, ans=0.125 2023-06-26 03:13:15,142 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.254e+02 9.637e+02 1.286e+03 1.805e+03 3.756e+03, threshold=2.572e+03, percent-clipped=14.0 2023-06-26 03:13:49,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2130186.0, ans=0.1 2023-06-26 03:13:57,203 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2130186.0, ans=0.125 2023-06-26 03:14:09,966 INFO [train.py:996] (2/4) Epoch 12, batch 19600, loss[loss=0.248, simple_loss=0.3083, pruned_loss=0.09387, over 21764.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2955, pruned_loss=0.07283, over 4288307.08 frames. ], batch size: 441, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:14:18,504 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.44 vs. limit=6.0 2023-06-26 03:14:31,277 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2130306.0, ans=0.2 2023-06-26 03:14:56,456 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.46 vs. limit=8.0 2023-06-26 03:15:41,671 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:16:00,207 INFO [train.py:996] (2/4) Epoch 12, batch 19650, loss[loss=0.2614, simple_loss=0.328, pruned_loss=0.09746, over 21857.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3004, pruned_loss=0.07591, over 4287060.44 frames. ], batch size: 351, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:17:06,810 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.323e+02 8.252e+02 1.350e+03 1.732e+03 4.354e+03, threshold=2.700e+03, percent-clipped=5.0 2023-06-26 03:17:39,073 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-26 03:18:00,761 INFO [train.py:996] (2/4) Epoch 12, batch 19700, loss[loss=0.2223, simple_loss=0.3125, pruned_loss=0.06603, over 21726.00 frames. ], tot_loss[loss=0.2304, simple_loss=0.306, pruned_loss=0.07743, over 4288149.43 frames. ], batch size: 298, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:18:09,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2130846.0, ans=0.2 2023-06-26 03:18:11,342 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.37 vs. limit=15.0 2023-06-26 03:18:16,337 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:18:47,474 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2130966.0, ans=0.125 2023-06-26 03:18:47,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2130966.0, ans=0.125 2023-06-26 03:18:57,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2130966.0, ans=0.0 2023-06-26 03:19:01,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2130966.0, ans=0.125 2023-06-26 03:19:09,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2131026.0, ans=0.1 2023-06-26 03:19:15,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2131026.0, ans=0.2 2023-06-26 03:19:39,743 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2131086.0, ans=0.04949747468305833 2023-06-26 03:19:51,711 INFO [train.py:996] (2/4) Epoch 12, batch 19750, loss[loss=0.4204, simple_loss=0.4833, pruned_loss=0.1787, over 21440.00 frames. ], tot_loss[loss=0.2363, simple_loss=0.3158, pruned_loss=0.07839, over 4278037.25 frames. ], batch size: 507, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:20:46,329 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=2131266.0, ans=0.04949747468305833 2023-06-26 03:20:59,644 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.834e+02 9.206e+02 1.478e+03 2.437e+03 4.883e+03, threshold=2.956e+03, percent-clipped=21.0 2023-06-26 03:21:00,850 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.65 vs. limit=15.0 2023-06-26 03:21:41,939 INFO [train.py:996] (2/4) Epoch 12, batch 19800, loss[loss=0.1951, simple_loss=0.2744, pruned_loss=0.05792, over 21771.00 frames. ], tot_loss[loss=0.237, simple_loss=0.316, pruned_loss=0.07907, over 4285323.45 frames. ], batch size: 298, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:22:02,873 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.73 vs. limit=15.0 2023-06-26 03:22:35,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2131566.0, ans=0.125 2023-06-26 03:23:33,444 INFO [train.py:996] (2/4) Epoch 12, batch 19850, loss[loss=0.2011, simple_loss=0.288, pruned_loss=0.05715, over 21689.00 frames. ], tot_loss[loss=0.228, simple_loss=0.3069, pruned_loss=0.07457, over 4273088.17 frames. ], batch size: 332, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:24:31,191 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2131866.0, ans=0.125 2023-06-26 03:24:41,352 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.737e+02 9.499e+02 1.416e+03 2.010e+03 4.711e+03, threshold=2.833e+03, percent-clipped=4.0 2023-06-26 03:25:05,661 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.03 vs. limit=22.5 2023-06-26 03:25:12,126 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2131986.0, ans=0.0 2023-06-26 03:25:29,026 INFO [train.py:996] (2/4) Epoch 12, batch 19900, loss[loss=0.1866, simple_loss=0.2729, pruned_loss=0.05015, over 21360.00 frames. ], tot_loss[loss=0.2255, simple_loss=0.3073, pruned_loss=0.07187, over 4277776.97 frames. ], batch size: 211, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:26:24,190 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2132166.0, ans=0.1 2023-06-26 03:26:33,422 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2132166.0, ans=0.2 2023-06-26 03:27:24,340 INFO [train.py:996] (2/4) Epoch 12, batch 19950, loss[loss=0.2571, simple_loss=0.326, pruned_loss=0.09413, over 21402.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.3018, pruned_loss=0.07189, over 4267300.00 frames. ], batch size: 507, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:27:30,993 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2132346.0, ans=0.125 2023-06-26 03:28:29,305 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.346e+02 8.702e+02 1.204e+03 1.763e+03 4.092e+03, threshold=2.408e+03, percent-clipped=5.0 2023-06-26 03:28:34,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.balancer2.prob, batch_count=2132526.0, ans=0.125 2023-06-26 03:29:17,278 INFO [train.py:996] (2/4) Epoch 12, batch 20000, loss[loss=0.2232, simple_loss=0.297, pruned_loss=0.07472, over 21884.00 frames. ], tot_loss[loss=0.2224, simple_loss=0.3008, pruned_loss=0.072, over 4265449.40 frames. ], batch size: 118, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:29:42,929 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:30:29,674 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.25 vs. limit=22.5 2023-06-26 03:31:06,186 INFO [train.py:996] (2/4) Epoch 12, batch 20050, loss[loss=0.2326, simple_loss=0.2972, pruned_loss=0.08399, over 21246.00 frames. ], tot_loss[loss=0.2261, simple_loss=0.3032, pruned_loss=0.07451, over 4277275.43 frames. ], batch size: 176, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:31:18,717 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2132946.0, ans=0.125 2023-06-26 03:31:58,140 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.11 vs. limit=22.5 2023-06-26 03:32:07,815 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.733e+02 9.408e+02 1.346e+03 1.718e+03 3.911e+03, threshold=2.692e+03, percent-clipped=11.0 2023-06-26 03:32:11,781 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2133126.0, ans=0.2 2023-06-26 03:32:15,539 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.max_abs, batch_count=2133126.0, ans=10.0 2023-06-26 03:32:55,545 INFO [train.py:996] (2/4) Epoch 12, batch 20100, loss[loss=0.2354, simple_loss=0.3168, pruned_loss=0.07702, over 21745.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3048, pruned_loss=0.07652, over 4287191.62 frames. ], batch size: 389, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:33:21,642 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.10 vs. limit=10.0 2023-06-26 03:33:40,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.scale_min, batch_count=2133366.0, ans=0.2 2023-06-26 03:34:46,172 INFO [train.py:996] (2/4) Epoch 12, batch 20150, loss[loss=0.3252, simple_loss=0.3786, pruned_loss=0.1359, over 21307.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3118, pruned_loss=0.07925, over 4290703.53 frames. ], batch size: 507, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:35:06,851 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=15.0 2023-06-26 03:35:29,948 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.61 vs. limit=22.5 2023-06-26 03:36:04,693 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.317e+02 8.895e+02 1.198e+03 1.726e+03 5.010e+03, threshold=2.397e+03, percent-clipped=8.0 2023-06-26 03:36:05,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2133726.0, ans=0.2 2023-06-26 03:36:53,585 INFO [train.py:996] (2/4) Epoch 12, batch 20200, loss[loss=0.2039, simple_loss=0.268, pruned_loss=0.06989, over 21837.00 frames. ], tot_loss[loss=0.242, simple_loss=0.3196, pruned_loss=0.08218, over 4285984.64 frames. ], batch size: 107, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:37:44,808 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.52 vs. limit=15.0 2023-06-26 03:37:46,195 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2133966.0, ans=0.2 2023-06-26 03:37:46,237 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2133966.0, ans=0.2 2023-06-26 03:37:58,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2134026.0, ans=0.2 2023-06-26 03:38:04,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2134026.0, ans=0.1 2023-06-26 03:38:09,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2134026.0, ans=0.1 2023-06-26 03:38:46,318 INFO [train.py:996] (2/4) Epoch 12, batch 20250, loss[loss=0.2142, simple_loss=0.3052, pruned_loss=0.06161, over 21822.00 frames. ], tot_loss[loss=0.2404, simple_loss=0.3196, pruned_loss=0.08065, over 4279166.66 frames. ], batch size: 316, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:38:50,528 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2134146.0, ans=0.1 2023-06-26 03:38:54,697 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.16 vs. limit=6.0 2023-06-26 03:38:57,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2134146.0, ans=0.125 2023-06-26 03:39:37,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2134266.0, ans=0.0 2023-06-26 03:39:49,450 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.636e+02 8.204e+02 1.224e+03 2.058e+03 5.091e+03, threshold=2.449e+03, percent-clipped=18.0 2023-06-26 03:40:32,180 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.99 vs. limit=12.0 2023-06-26 03:40:38,143 INFO [train.py:996] (2/4) Epoch 12, batch 20300, loss[loss=0.2251, simple_loss=0.318, pruned_loss=0.06609, over 21783.00 frames. ], tot_loss[loss=0.2377, simple_loss=0.3184, pruned_loss=0.07849, over 4283138.63 frames. ], batch size: 371, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:40:42,387 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.70 vs. limit=10.0 2023-06-26 03:40:53,508 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=9.59 vs. limit=22.5 2023-06-26 03:40:55,231 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=5.60 vs. limit=15.0 2023-06-26 03:41:18,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2134506.0, ans=0.125 2023-06-26 03:42:27,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2134746.0, ans=0.125 2023-06-26 03:42:28,312 INFO [train.py:996] (2/4) Epoch 12, batch 20350, loss[loss=0.2004, simple_loss=0.2855, pruned_loss=0.05766, over 21873.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3193, pruned_loss=0.07954, over 4287334.69 frames. ], batch size: 98, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:43:10,449 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2134806.0, ans=0.125 2023-06-26 03:43:31,515 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.507e+02 1.038e+03 1.419e+03 2.103e+03 3.160e+03, threshold=2.839e+03, percent-clipped=11.0 2023-06-26 03:43:39,996 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2134926.0, ans=0.1 2023-06-26 03:43:52,609 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2134986.0, ans=0.125 2023-06-26 03:44:09,546 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer2.min_positive, batch_count=2134986.0, ans=0.05 2023-06-26 03:44:11,120 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2134986.0, ans=0.07 2023-06-26 03:44:16,418 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=2.684e-03 2023-06-26 03:44:17,368 INFO [train.py:996] (2/4) Epoch 12, batch 20400, loss[loss=0.2385, simple_loss=0.3192, pruned_loss=0.0789, over 21908.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.3222, pruned_loss=0.08253, over 4283030.67 frames. ], batch size: 371, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:44:28,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2135046.0, ans=0.025 2023-06-26 03:44:50,325 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2135106.0, ans=0.125 2023-06-26 03:45:13,326 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2135166.0, ans=0.1 2023-06-26 03:45:21,918 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=2135226.0, ans=15.0 2023-06-26 03:45:53,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten.whitening_limit, batch_count=2135286.0, ans=15.0 2023-06-26 03:46:02,276 INFO [train.py:996] (2/4) Epoch 12, batch 20450, loss[loss=0.2126, simple_loss=0.262, pruned_loss=0.08162, over 20219.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3219, pruned_loss=0.0844, over 4259621.12 frames. ], batch size: 703, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:46:49,290 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.max_positive, batch_count=2135466.0, ans=0.95 2023-06-26 03:47:03,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2135466.0, ans=0.0 2023-06-26 03:47:05,915 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.613e+02 8.799e+02 1.280e+03 2.012e+03 4.043e+03, threshold=2.560e+03, percent-clipped=9.0 2023-06-26 03:47:18,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2135526.0, ans=0.07 2023-06-26 03:47:47,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2135586.0, ans=0.125 2023-06-26 03:47:52,246 INFO [train.py:996] (2/4) Epoch 12, batch 20500, loss[loss=0.2203, simple_loss=0.285, pruned_loss=0.07786, over 21692.00 frames. ], tot_loss[loss=0.2437, simple_loss=0.318, pruned_loss=0.08466, over 4261445.24 frames. ], batch size: 247, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:48:06,744 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:48:21,378 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2135706.0, ans=0.2 2023-06-26 03:48:34,702 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2135706.0, ans=0.0 2023-06-26 03:48:38,916 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.29 vs. limit=22.5 2023-06-26 03:48:39,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2135766.0, ans=0.1 2023-06-26 03:49:41,367 INFO [train.py:996] (2/4) Epoch 12, batch 20550, loss[loss=0.2314, simple_loss=0.3283, pruned_loss=0.06721, over 21182.00 frames. ], tot_loss[loss=0.2384, simple_loss=0.3112, pruned_loss=0.0828, over 4256796.05 frames. ], batch size: 548, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:50:26,529 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.67 vs. limit=12.0 2023-06-26 03:50:33,320 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2136066.0, ans=0.125 2023-06-26 03:50:48,921 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.549e+02 9.305e+02 1.211e+03 1.898e+03 4.893e+03, threshold=2.421e+03, percent-clipped=7.0 2023-06-26 03:51:32,358 INFO [train.py:996] (2/4) Epoch 12, batch 20600, loss[loss=0.2555, simple_loss=0.3229, pruned_loss=0.09403, over 21757.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3114, pruned_loss=0.07983, over 4256156.73 frames. ], batch size: 112, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:51:48,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2136246.0, ans=0.125 2023-06-26 03:52:09,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2136306.0, ans=0.0 2023-06-26 03:52:20,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2136366.0, ans=0.125 2023-06-26 03:52:25,316 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2136366.0, ans=0.125 2023-06-26 03:53:19,375 INFO [train.py:996] (2/4) Epoch 12, batch 20650, loss[loss=0.1837, simple_loss=0.2487, pruned_loss=0.05932, over 21278.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3077, pruned_loss=0.08, over 4251648.49 frames. ], batch size: 160, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:54:23,269 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.638e+02 7.911e+02 1.093e+03 1.392e+03 2.795e+03, threshold=2.187e+03, percent-clipped=3.0 2023-06-26 03:54:37,944 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2136726.0, ans=0.0 2023-06-26 03:54:39,441 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 03:55:06,906 INFO [train.py:996] (2/4) Epoch 12, batch 20700, loss[loss=0.2724, simple_loss=0.3555, pruned_loss=0.09459, over 21624.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3015, pruned_loss=0.07713, over 4242254.96 frames. ], batch size: 414, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:55:31,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2136906.0, ans=0.125 2023-06-26 03:56:00,219 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2136966.0, ans=0.0 2023-06-26 03:56:32,305 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff2_skip_rate, batch_count=2137026.0, ans=0.0 2023-06-26 03:56:57,175 INFO [train.py:996] (2/4) Epoch 12, batch 20750, loss[loss=0.2567, simple_loss=0.383, pruned_loss=0.06522, over 20794.00 frames. ], tot_loss[loss=0.2281, simple_loss=0.3037, pruned_loss=0.07629, over 4251583.69 frames. ], batch size: 607, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 03:57:52,701 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2137266.0, ans=0.0 2023-06-26 03:57:55,670 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2137266.0, ans=0.2 2023-06-26 03:57:55,753 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff3_skip_rate, batch_count=2137266.0, ans=0.0 2023-06-26 03:58:14,464 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.764e+02 1.037e+03 1.665e+03 2.328e+03 7.151e+03, threshold=3.329e+03, percent-clipped=27.0 2023-06-26 03:58:51,627 INFO [train.py:996] (2/4) Epoch 12, batch 20800, loss[loss=0.221, simple_loss=0.285, pruned_loss=0.07851, over 21571.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3074, pruned_loss=0.07761, over 4253816.52 frames. ], batch size: 414, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 03:59:46,905 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2137566.0, ans=0.2 2023-06-26 04:00:08,842 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:00:36,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2137686.0, ans=0.125 2023-06-26 04:00:39,503 INFO [train.py:996] (2/4) Epoch 12, batch 20850, loss[loss=0.1805, simple_loss=0.2513, pruned_loss=0.05487, over 21286.00 frames. ], tot_loss[loss=0.2256, simple_loss=0.2998, pruned_loss=0.07574, over 4255958.98 frames. ], batch size: 176, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:01:25,837 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2137866.0, ans=0.07 2023-06-26 04:01:49,783 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.795e+02 7.660e+02 1.067e+03 1.526e+03 3.659e+03, threshold=2.133e+03, percent-clipped=1.0 2023-06-26 04:01:55,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2137926.0, ans=0.125 2023-06-26 04:02:07,019 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.min_positive, batch_count=2137986.0, ans=0.05 2023-06-26 04:02:10,728 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.72 vs. limit=15.0 2023-06-26 04:02:23,706 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2137986.0, ans=0.125 2023-06-26 04:02:27,170 INFO [train.py:996] (2/4) Epoch 12, batch 20900, loss[loss=0.3152, simple_loss=0.3752, pruned_loss=0.1277, over 21654.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.3007, pruned_loss=0.07681, over 4264018.58 frames. ], batch size: 509, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:02:34,757 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=4.68 vs. limit=6.0 2023-06-26 04:02:41,766 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer1.prob, batch_count=2138046.0, ans=0.125 2023-06-26 04:02:43,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2138046.0, ans=0.1 2023-06-26 04:04:06,626 INFO [train.py:996] (2/4) Epoch 12, batch 20950, loss[loss=0.1778, simple_loss=0.2397, pruned_loss=0.05794, over 16512.00 frames. ], tot_loss[loss=0.2211, simple_loss=0.296, pruned_loss=0.07314, over 4246464.21 frames. ], batch size: 63, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:04:22,611 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer_ff3.min_abs, batch_count=2138346.0, ans=0.2 2023-06-26 04:05:18,159 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.308e+02 8.469e+02 1.535e+03 2.193e+03 7.053e+03, threshold=3.069e+03, percent-clipped=28.0 2023-06-26 04:05:46,509 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2138586.0, ans=0.0 2023-06-26 04:05:53,932 INFO [train.py:996] (2/4) Epoch 12, batch 21000, loss[loss=0.1775, simple_loss=0.2448, pruned_loss=0.0551, over 18234.00 frames. ], tot_loss[loss=0.2203, simple_loss=0.2941, pruned_loss=0.07323, over 4249367.22 frames. ], batch size: 70, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 04:05:53,932 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-26 04:06:16,510 INFO [train.py:1028] (2/4) Epoch 12, validation: loss=0.2617, simple_loss=0.359, pruned_loss=0.08218, over 1796401.00 frames. 2023-06-26 04:06:16,511 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-26 04:06:31,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2138646.0, ans=0.0 2023-06-26 04:06:44,429 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2138706.0, ans=0.0 2023-06-26 04:06:47,939 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2138706.0, ans=0.125 2023-06-26 04:07:27,879 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2138826.0, ans=0.0 2023-06-26 04:07:28,003 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2138826.0, ans=0.125 2023-06-26 04:07:39,900 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.52 vs. limit=10.0 2023-06-26 04:07:48,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2138886.0, ans=0.1 2023-06-26 04:07:52,145 INFO [train.py:996] (2/4) Epoch 12, batch 21050, loss[loss=0.2597, simple_loss=0.304, pruned_loss=0.1077, over 21406.00 frames. ], tot_loss[loss=0.2206, simple_loss=0.2934, pruned_loss=0.07393, over 4247459.80 frames. ], batch size: 509, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 04:08:19,256 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=2139006.0, ans=0.0 2023-06-26 04:08:22,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2139006.0, ans=0.0 2023-06-26 04:08:31,725 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.65 vs. limit=15.0 2023-06-26 04:08:46,811 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2139066.0, ans=0.125 2023-06-26 04:08:56,407 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.279e+02 7.328e+02 1.060e+03 1.380e+03 3.297e+03, threshold=2.119e+03, percent-clipped=1.0 2023-06-26 04:09:25,810 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=8.63 vs. limit=15.0 2023-06-26 04:09:37,024 INFO [train.py:996] (2/4) Epoch 12, batch 21100, loss[loss=0.1943, simple_loss=0.2617, pruned_loss=0.06351, over 21427.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2902, pruned_loss=0.07397, over 4257617.22 frames. ], batch size: 131, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 04:09:37,478 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2139246.0, ans=0.125 2023-06-26 04:09:37,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2139246.0, ans=0.125 2023-06-26 04:09:51,355 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.78 vs. limit=6.0 2023-06-26 04:10:28,015 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=3.69 vs. limit=10.0 2023-06-26 04:10:31,901 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2139366.0, ans=0.0 2023-06-26 04:10:38,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2139426.0, ans=0.0 2023-06-26 04:11:22,949 INFO [train.py:996] (2/4) Epoch 12, batch 21150, loss[loss=0.1822, simple_loss=0.2482, pruned_loss=0.05809, over 21532.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2863, pruned_loss=0.07412, over 4261342.58 frames. ], batch size: 263, lr: 2.40e-03, grad_scale: 8.0 2023-06-26 04:11:38,414 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.98 vs. limit=22.5 2023-06-26 04:11:39,625 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.30 vs. limit=15.0 2023-06-26 04:11:59,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2139606.0, ans=0.1 2023-06-26 04:12:27,399 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2139726.0, ans=0.125 2023-06-26 04:12:28,519 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.534e+02 8.733e+02 1.101e+03 1.484e+03 2.918e+03, threshold=2.203e+03, percent-clipped=8.0 2023-06-26 04:12:41,152 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2139726.0, ans=0.2 2023-06-26 04:12:50,969 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2139786.0, ans=0.2 2023-06-26 04:12:52,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2139786.0, ans=0.1 2023-06-26 04:13:08,700 INFO [train.py:996] (2/4) Epoch 12, batch 21200, loss[loss=0.2131, simple_loss=0.2796, pruned_loss=0.07329, over 21555.00 frames. ], tot_loss[loss=0.2157, simple_loss=0.2854, pruned_loss=0.073, over 4250081.78 frames. ], batch size: 414, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:13:35,690 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2139906.0, ans=0.1 2023-06-26 04:14:13,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2140026.0, ans=0.125 2023-06-26 04:14:23,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2140026.0, ans=0.0 2023-06-26 04:14:51,533 INFO [train.py:996] (2/4) Epoch 12, batch 21250, loss[loss=0.2067, simple_loss=0.2764, pruned_loss=0.06846, over 21169.00 frames. ], tot_loss[loss=0.2137, simple_loss=0.2825, pruned_loss=0.07248, over 4249761.26 frames. ], batch size: 143, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:15:02,626 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2140146.0, ans=0.125 2023-06-26 04:15:30,236 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2140206.0, ans=0.125 2023-06-26 04:15:32,630 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2140206.0, ans=0.125 2023-06-26 04:15:56,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2140266.0, ans=0.0 2023-06-26 04:16:01,724 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:16:07,977 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.686e+02 9.766e+02 1.414e+03 1.888e+03 3.901e+03, threshold=2.827e+03, percent-clipped=19.0 2023-06-26 04:16:18,050 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.93 vs. limit=15.0 2023-06-26 04:16:39,736 INFO [train.py:996] (2/4) Epoch 12, batch 21300, loss[loss=0.2308, simple_loss=0.2956, pruned_loss=0.08301, over 21318.00 frames. ], tot_loss[loss=0.2226, simple_loss=0.2924, pruned_loss=0.07639, over 4258957.63 frames. ], batch size: 176, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:16:43,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2140446.0, ans=0.125 2023-06-26 04:17:59,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2140626.0, ans=0.07 2023-06-26 04:18:37,773 INFO [train.py:996] (2/4) Epoch 12, batch 21350, loss[loss=0.2027, simple_loss=0.2764, pruned_loss=0.06449, over 21722.00 frames. ], tot_loss[loss=0.2236, simple_loss=0.295, pruned_loss=0.07613, over 4262067.36 frames. ], batch size: 112, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:19:09,113 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.98 vs. limit=15.0 2023-06-26 04:19:52,126 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.340e+02 8.083e+02 1.206e+03 1.998e+03 5.884e+03, threshold=2.412e+03, percent-clipped=11.0 2023-06-26 04:20:34,829 INFO [train.py:996] (2/4) Epoch 12, batch 21400, loss[loss=0.2447, simple_loss=0.3271, pruned_loss=0.08118, over 20674.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2976, pruned_loss=0.07507, over 4261385.12 frames. ], batch size: 607, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:21:12,607 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn1.whiten, num_groups=1, num_channels=256, metric=10.50 vs. limit=22.5 2023-06-26 04:21:51,574 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.96 vs. limit=6.0 2023-06-26 04:22:23,067 INFO [train.py:996] (2/4) Epoch 12, batch 21450, loss[loss=0.2292, simple_loss=0.293, pruned_loss=0.0827, over 21292.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3002, pruned_loss=0.07645, over 4264951.72 frames. ], batch size: 159, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:22:52,475 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2141406.0, ans=0.2 2023-06-26 04:23:04,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2141466.0, ans=0.125 2023-06-26 04:23:05,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2141466.0, ans=0.125 2023-06-26 04:23:28,959 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.921e+02 8.614e+02 1.088e+03 1.616e+03 2.799e+03, threshold=2.175e+03, percent-clipped=3.0 2023-06-26 04:24:01,745 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2141586.0, ans=0.0 2023-06-26 04:24:06,903 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2141586.0, ans=0.125 2023-06-26 04:24:08,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.skip_rate, batch_count=2141586.0, ans=0.07 2023-06-26 04:24:11,501 INFO [train.py:996] (2/4) Epoch 12, batch 21500, loss[loss=0.2331, simple_loss=0.2856, pruned_loss=0.09034, over 21314.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.2977, pruned_loss=0.07756, over 4262117.66 frames. ], batch size: 473, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:24:11,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2141646.0, ans=0.0 2023-06-26 04:25:14,060 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer2.prob, batch_count=2141826.0, ans=0.125 2023-06-26 04:25:23,696 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:25:51,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2141886.0, ans=0.125 2023-06-26 04:25:58,242 INFO [train.py:996] (2/4) Epoch 12, batch 21550, loss[loss=0.2111, simple_loss=0.2756, pruned_loss=0.07324, over 21793.00 frames. ], tot_loss[loss=0.2213, simple_loss=0.2914, pruned_loss=0.07555, over 4252775.73 frames. ], batch size: 317, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:26:31,477 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 04:26:59,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2142066.0, ans=0.0 2023-06-26 04:27:09,921 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.662e+02 8.216e+02 1.112e+03 1.423e+03 3.148e+03, threshold=2.223e+03, percent-clipped=7.0 2023-06-26 04:27:36,406 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn1.whiten, num_groups=1, num_channels=192, metric=11.52 vs. limit=22.5 2023-06-26 04:27:46,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2142186.0, ans=0.1 2023-06-26 04:27:50,971 INFO [train.py:996] (2/4) Epoch 12, batch 21600, loss[loss=0.2448, simple_loss=0.2938, pruned_loss=0.09787, over 21342.00 frames. ], tot_loss[loss=0.2173, simple_loss=0.2868, pruned_loss=0.07393, over 4260781.74 frames. ], batch size: 473, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:28:39,487 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff2_skip_rate, batch_count=2142366.0, ans=0.0 2023-06-26 04:29:17,232 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=7.71 vs. limit=15.0 2023-06-26 04:29:19,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.balancer2.prob, batch_count=2142426.0, ans=0.125 2023-06-26 04:29:43,392 INFO [train.py:996] (2/4) Epoch 12, batch 21650, loss[loss=0.2031, simple_loss=0.3017, pruned_loss=0.05226, over 21650.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.2917, pruned_loss=0.07172, over 4257175.27 frames. ], batch size: 263, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:30:41,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2142666.0, ans=0.0 2023-06-26 04:30:42,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2142666.0, ans=0.0 2023-06-26 04:30:56,293 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.299e+02 7.832e+02 1.347e+03 2.277e+03 5.515e+03, threshold=2.694e+03, percent-clipped=27.0 2023-06-26 04:31:30,512 INFO [train.py:996] (2/4) Epoch 12, batch 21700, loss[loss=0.237, simple_loss=0.2931, pruned_loss=0.09051, over 21251.00 frames. ], tot_loss[loss=0.217, simple_loss=0.2931, pruned_loss=0.07044, over 4260370.75 frames. ], batch size: 471, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:33:08,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2143086.0, ans=0.125 2023-06-26 04:33:09,408 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.74 vs. limit=15.0 2023-06-26 04:33:12,688 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2143086.0, ans=0.2 2023-06-26 04:33:20,598 INFO [train.py:996] (2/4) Epoch 12, batch 21750, loss[loss=0.221, simple_loss=0.2768, pruned_loss=0.08258, over 21387.00 frames. ], tot_loss[loss=0.2152, simple_loss=0.2888, pruned_loss=0.07084, over 4252015.12 frames. ], batch size: 195, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:33:21,100 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2143146.0, ans=0.0 2023-06-26 04:33:22,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2143146.0, ans=0.125 2023-06-26 04:33:51,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2143206.0, ans=0.125 2023-06-26 04:33:54,983 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2143206.0, ans=0.1 2023-06-26 04:34:30,623 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.472e+02 7.650e+02 1.064e+03 1.573e+03 4.038e+03, threshold=2.129e+03, percent-clipped=2.0 2023-06-26 04:34:35,755 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.20 vs. limit=15.0 2023-06-26 04:34:36,015 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=10.95 vs. limit=15.0 2023-06-26 04:35:12,909 INFO [train.py:996] (2/4) Epoch 12, batch 21800, loss[loss=0.2396, simple_loss=0.3234, pruned_loss=0.07792, over 21658.00 frames. ], tot_loss[loss=0.2147, simple_loss=0.2864, pruned_loss=0.07152, over 4260433.18 frames. ], batch size: 391, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:35:38,721 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.60 vs. limit=22.5 2023-06-26 04:35:54,388 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=6.84 vs. limit=15.0 2023-06-26 04:36:00,693 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2143566.0, ans=0.0 2023-06-26 04:36:20,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2143626.0, ans=0.125 2023-06-26 04:36:27,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2143626.0, ans=0.0 2023-06-26 04:37:09,675 INFO [train.py:996] (2/4) Epoch 12, batch 21850, loss[loss=0.2057, simple_loss=0.2897, pruned_loss=0.06088, over 20996.00 frames. ], tot_loss[loss=0.2177, simple_loss=0.2922, pruned_loss=0.07154, over 4259695.50 frames. ], batch size: 607, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:37:55,582 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2143866.0, ans=0.1 2023-06-26 04:38:16,239 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2143926.0, ans=0.125 2023-06-26 04:38:17,399 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.530e+02 9.705e+02 1.349e+03 2.023e+03 4.101e+03, threshold=2.697e+03, percent-clipped=20.0 2023-06-26 04:38:59,624 INFO [train.py:996] (2/4) Epoch 12, batch 21900, loss[loss=0.1868, simple_loss=0.2519, pruned_loss=0.06082, over 21700.00 frames. ], tot_loss[loss=0.2191, simple_loss=0.2926, pruned_loss=0.07277, over 4265384.57 frames. ], batch size: 264, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:38:59,992 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2144046.0, ans=0.125 2023-06-26 04:39:10,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2144046.0, ans=0.1 2023-06-26 04:39:12,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2144046.0, ans=0.025 2023-06-26 04:39:15,631 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2144106.0, ans=0.125 2023-06-26 04:39:18,959 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2144106.0, ans=0.2 2023-06-26 04:40:02,540 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=2144226.0, ans=0.0 2023-06-26 04:40:23,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2144286.0, ans=0.2 2023-06-26 04:40:28,876 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.03 vs. limit=15.0 2023-06-26 04:40:35,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2144286.0, ans=0.1 2023-06-26 04:40:41,114 INFO [train.py:996] (2/4) Epoch 12, batch 21950, loss[loss=0.2023, simple_loss=0.259, pruned_loss=0.0728, over 21840.00 frames. ], tot_loss[loss=0.216, simple_loss=0.2875, pruned_loss=0.07225, over 4270505.40 frames. ], batch size: 107, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:40:50,603 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2144346.0, ans=0.2 2023-06-26 04:40:54,007 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2144346.0, ans=0.125 2023-06-26 04:41:28,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff3_skip_rate, batch_count=2144466.0, ans=0.0 2023-06-26 04:41:28,515 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2144466.0, ans=0.125 2023-06-26 04:41:55,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2144526.0, ans=10.0 2023-06-26 04:41:57,275 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.471e+02 7.504e+02 1.022e+03 1.599e+03 3.109e+03, threshold=2.043e+03, percent-clipped=2.0 2023-06-26 04:42:05,026 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2144526.0, ans=0.0 2023-06-26 04:42:18,881 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2144586.0, ans=0.2 2023-06-26 04:42:33,001 INFO [train.py:996] (2/4) Epoch 12, batch 22000, loss[loss=0.1895, simple_loss=0.2596, pruned_loss=0.0597, over 21622.00 frames. ], tot_loss[loss=0.2097, simple_loss=0.2819, pruned_loss=0.06877, over 4250824.47 frames. ], batch size: 298, lr: 2.40e-03, grad_scale: 32.0 2023-06-26 04:42:51,830 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2144646.0, ans=0.5 2023-06-26 04:43:49,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2144826.0, ans=0.1 2023-06-26 04:44:28,354 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=9.90 vs. limit=15.0 2023-06-26 04:44:30,617 INFO [train.py:996] (2/4) Epoch 12, batch 22050, loss[loss=0.277, simple_loss=0.3403, pruned_loss=0.1069, over 21256.00 frames. ], tot_loss[loss=0.2168, simple_loss=0.2896, pruned_loss=0.07197, over 4243193.60 frames. ], batch size: 159, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:44:32,033 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=11.22 vs. limit=15.0 2023-06-26 04:44:52,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2145006.0, ans=0.125 2023-06-26 04:45:47,961 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.424e+02 9.779e+02 1.604e+03 2.153e+03 5.995e+03, threshold=3.207e+03, percent-clipped=28.0 2023-06-26 04:45:55,393 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.29 vs. limit=22.5 2023-06-26 04:46:16,991 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2145186.0, ans=0.125 2023-06-26 04:46:21,811 INFO [train.py:996] (2/4) Epoch 12, batch 22100, loss[loss=0.2002, simple_loss=0.2804, pruned_loss=0.05993, over 21960.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3005, pruned_loss=0.07682, over 4245226.46 frames. ], batch size: 316, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:47:18,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2145366.0, ans=0.2 2023-06-26 04:48:11,849 INFO [train.py:996] (2/4) Epoch 12, batch 22150, loss[loss=0.2533, simple_loss=0.3174, pruned_loss=0.09464, over 21766.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3016, pruned_loss=0.07793, over 4258041.03 frames. ], batch size: 441, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:48:21,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2145546.0, ans=0.1 2023-06-26 04:48:42,558 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.59 vs. limit=15.0 2023-06-26 04:49:27,202 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.601e+02 7.424e+02 1.002e+03 1.530e+03 2.924e+03, threshold=2.004e+03, percent-clipped=0.0 2023-06-26 04:50:01,396 INFO [train.py:996] (2/4) Epoch 12, batch 22200, loss[loss=0.2031, simple_loss=0.2736, pruned_loss=0.06635, over 21699.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.302, pruned_loss=0.07858, over 4274909.64 frames. ], batch size: 263, lr: 2.40e-03, grad_scale: 16.0 2023-06-26 04:50:09,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2145846.0, ans=0.035 2023-06-26 04:50:37,051 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2145906.0, ans=0.125 2023-06-26 04:51:25,252 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.21 vs. limit=6.0 2023-06-26 04:51:52,886 INFO [train.py:996] (2/4) Epoch 12, batch 22250, loss[loss=0.2387, simple_loss=0.3179, pruned_loss=0.07972, over 21668.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.3089, pruned_loss=0.08023, over 4273290.05 frames. ], batch size: 230, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 04:52:14,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2146146.0, ans=0.1 2023-06-26 04:52:27,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2146206.0, ans=0.2 2023-06-26 04:53:05,015 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2146326.0, ans=0.0 2023-06-26 04:53:07,424 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.98 vs. limit=15.0 2023-06-26 04:53:10,014 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.929e+02 1.009e+03 1.467e+03 2.182e+03 5.502e+03, threshold=2.934e+03, percent-clipped=31.0 2023-06-26 04:53:29,689 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.84 vs. limit=15.0 2023-06-26 04:53:41,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2146446.0, ans=0.125 2023-06-26 04:53:42,558 INFO [train.py:996] (2/4) Epoch 12, batch 22300, loss[loss=0.2828, simple_loss=0.3287, pruned_loss=0.1184, over 21763.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3126, pruned_loss=0.08225, over 4274377.78 frames. ], batch size: 508, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 04:53:43,519 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.27 vs. limit=12.0 2023-06-26 04:53:53,163 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2146446.0, ans=0.0 2023-06-26 04:54:16,097 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2146506.0, ans=0.125 2023-06-26 04:55:03,337 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.53 vs. limit=12.0 2023-06-26 04:55:28,428 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2146686.0, ans=0.2 2023-06-26 04:55:32,942 INFO [train.py:996] (2/4) Epoch 12, batch 22350, loss[loss=0.2383, simple_loss=0.3, pruned_loss=0.08836, over 21452.00 frames. ], tot_loss[loss=0.2383, simple_loss=0.3107, pruned_loss=0.08297, over 4282420.90 frames. ], batch size: 177, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 04:55:33,263 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2146746.0, ans=0.125 2023-06-26 04:56:25,083 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.15 vs. limit=12.0 2023-06-26 04:56:33,699 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2146866.0, ans=0.125 2023-06-26 04:56:57,061 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.431e+02 9.155e+02 1.132e+03 1.676e+03 3.302e+03, threshold=2.265e+03, percent-clipped=3.0 2023-06-26 04:57:06,281 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2146986.0, ans=0.125 2023-06-26 04:57:23,240 INFO [train.py:996] (2/4) Epoch 12, batch 22400, loss[loss=0.2088, simple_loss=0.2956, pruned_loss=0.061, over 21764.00 frames. ], tot_loss[loss=0.2327, simple_loss=0.3077, pruned_loss=0.07889, over 4263452.05 frames. ], batch size: 351, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 04:57:58,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=2147106.0, ans=0.2 2023-06-26 04:58:09,132 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2147106.0, ans=0.07 2023-06-26 04:58:10,809 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.attention_skip_rate, batch_count=2147106.0, ans=0.0 2023-06-26 04:58:19,604 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2147166.0, ans=0.125 2023-06-26 04:58:31,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2147226.0, ans=0.2 2023-06-26 04:58:33,674 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.75 vs. limit=15.0 2023-06-26 04:58:34,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2147226.0, ans=0.0 2023-06-26 04:59:01,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2147286.0, ans=0.125 2023-06-26 04:59:03,452 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2147286.0, ans=0.1 2023-06-26 04:59:13,357 INFO [train.py:996] (2/4) Epoch 12, batch 22450, loss[loss=0.2589, simple_loss=0.2974, pruned_loss=0.1102, over 21398.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3012, pruned_loss=0.07811, over 4257049.66 frames. ], batch size: 509, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 04:59:32,730 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=12.41 vs. limit=22.5 2023-06-26 05:00:36,584 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.29 vs. limit=15.0 2023-06-26 05:00:37,265 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.261e+02 8.888e+02 1.140e+03 1.642e+03 4.602e+03, threshold=2.279e+03, percent-clipped=11.0 2023-06-26 05:01:13,412 INFO [train.py:996] (2/4) Epoch 12, batch 22500, loss[loss=0.1912, simple_loss=0.2401, pruned_loss=0.07115, over 20714.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2955, pruned_loss=0.07728, over 4268766.46 frames. ], batch size: 607, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:01:34,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2147706.0, ans=0.125 2023-06-26 05:01:40,042 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_positive, batch_count=2147706.0, ans=0.05 2023-06-26 05:02:28,005 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.08 vs. limit=15.0 2023-06-26 05:02:52,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer2.prob, batch_count=2147886.0, ans=0.125 2023-06-26 05:02:52,284 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2147886.0, ans=0.2 2023-06-26 05:03:04,502 INFO [train.py:996] (2/4) Epoch 12, batch 22550, loss[loss=0.2378, simple_loss=0.3065, pruned_loss=0.08458, over 21549.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.301, pruned_loss=0.07783, over 4273558.45 frames. ], batch size: 548, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:03:23,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2147946.0, ans=0.035 2023-06-26 05:03:39,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.max_positive, batch_count=2148006.0, ans=0.95 2023-06-26 05:03:57,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2148066.0, ans=0.0 2023-06-26 05:04:03,473 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.03 vs. limit=10.0 2023-06-26 05:04:13,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2148126.0, ans=0.025 2023-06-26 05:04:18,092 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.902e+02 9.532e+02 1.362e+03 1.911e+03 4.517e+03, threshold=2.723e+03, percent-clipped=17.0 2023-06-26 05:04:38,491 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2148186.0, ans=0.2 2023-06-26 05:04:43,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.prob, batch_count=2148186.0, ans=0.125 2023-06-26 05:05:00,743 INFO [train.py:996] (2/4) Epoch 12, batch 22600, loss[loss=0.2203, simple_loss=0.2858, pruned_loss=0.07735, over 21624.00 frames. ], tot_loss[loss=0.2296, simple_loss=0.3026, pruned_loss=0.07831, over 4274175.56 frames. ], batch size: 230, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:05:15,382 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2148246.0, ans=0.1 2023-06-26 05:05:27,579 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2148306.0, ans=0.0 2023-06-26 05:05:53,894 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=3.47 vs. limit=12.0 2023-06-26 05:06:42,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2148486.0, ans=0.0 2023-06-26 05:06:45,103 INFO [train.py:996] (2/4) Epoch 12, batch 22650, loss[loss=0.2209, simple_loss=0.2776, pruned_loss=0.08209, over 21762.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3005, pruned_loss=0.0781, over 4273330.15 frames. ], batch size: 112, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:07:02,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2148606.0, ans=0.125 2023-06-26 05:07:25,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2148666.0, ans=0.125 2023-06-26 05:07:52,291 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.795e+02 9.798e+02 1.388e+03 1.950e+03 5.687e+03, threshold=2.777e+03, percent-clipped=13.0 2023-06-26 05:08:27,949 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.73 vs. limit=15.0 2023-06-26 05:08:31,900 INFO [train.py:996] (2/4) Epoch 12, batch 22700, loss[loss=0.2497, simple_loss=0.292, pruned_loss=0.1037, over 21400.00 frames. ], tot_loss[loss=0.2242, simple_loss=0.294, pruned_loss=0.07719, over 4270162.12 frames. ], batch size: 509, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:08:58,062 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.29 vs. limit=15.0 2023-06-26 05:08:59,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2148906.0, ans=0.2 2023-06-26 05:09:14,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2148966.0, ans=0.0 2023-06-26 05:09:40,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2149026.0, ans=0.1 2023-06-26 05:10:24,741 INFO [train.py:996] (2/4) Epoch 12, batch 22750, loss[loss=0.2729, simple_loss=0.3363, pruned_loss=0.1048, over 21776.00 frames. ], tot_loss[loss=0.2262, simple_loss=0.2949, pruned_loss=0.07874, over 4265458.76 frames. ], batch size: 332, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:10:58,159 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2149206.0, ans=0.1 2023-06-26 05:11:44,541 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.150e+02 7.737e+02 1.012e+03 1.483e+03 2.915e+03, threshold=2.025e+03, percent-clipped=0.0 2023-06-26 05:12:14,844 INFO [train.py:996] (2/4) Epoch 12, batch 22800, loss[loss=0.22, simple_loss=0.2877, pruned_loss=0.0761, over 21421.00 frames. ], tot_loss[loss=0.23, simple_loss=0.2994, pruned_loss=0.08028, over 4268918.06 frames. ], batch size: 211, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:12:59,532 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.31 vs. limit=15.0 2023-06-26 05:13:44,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2149686.0, ans=0.2 2023-06-26 05:13:54,088 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:14:03,882 INFO [train.py:996] (2/4) Epoch 12, batch 22850, loss[loss=0.1854, simple_loss=0.2536, pruned_loss=0.05867, over 21511.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.2964, pruned_loss=0.07965, over 4277766.16 frames. ], batch size: 230, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:14:40,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2149806.0, ans=0.125 2023-06-26 05:14:43,718 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2149866.0, ans=0.025 2023-06-26 05:15:04,346 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2149926.0, ans=0.125 2023-06-26 05:15:15,242 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2149926.0, ans=0.125 2023-06-26 05:15:22,793 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.75 vs. limit=15.0 2023-06-26 05:15:23,328 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.516e+02 9.685e+02 1.570e+03 2.544e+03 4.880e+03, threshold=3.139e+03, percent-clipped=35.0 2023-06-26 05:15:35,209 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2149986.0, ans=0.035 2023-06-26 05:15:53,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten.whitening_limit, batch_count=2150046.0, ans=22.5 2023-06-26 05:15:54,409 INFO [train.py:996] (2/4) Epoch 12, batch 22900, loss[loss=0.2304, simple_loss=0.3343, pruned_loss=0.06325, over 21820.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.2997, pruned_loss=0.07881, over 4274006.88 frames. ], batch size: 371, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:16:39,103 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2150166.0, ans=0.1 2023-06-26 05:17:23,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2150226.0, ans=0.0 2023-06-26 05:17:34,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2150286.0, ans=0.0 2023-06-26 05:17:53,769 INFO [train.py:996] (2/4) Epoch 12, batch 22950, loss[loss=0.2304, simple_loss=0.3325, pruned_loss=0.0641, over 21298.00 frames. ], tot_loss[loss=0.2336, simple_loss=0.3121, pruned_loss=0.07752, over 4273653.74 frames. ], batch size: 176, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:18:46,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2150466.0, ans=0.125 2023-06-26 05:18:47,279 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.75 vs. limit=15.0 2023-06-26 05:18:59,332 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.27 vs. limit=10.0 2023-06-26 05:19:08,727 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=3.55 vs. limit=15.0 2023-06-26 05:19:12,499 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.963e+02 8.459e+02 1.367e+03 2.050e+03 4.078e+03, threshold=2.734e+03, percent-clipped=4.0 2023-06-26 05:19:34,401 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2150586.0, ans=0.125 2023-06-26 05:19:42,498 INFO [train.py:996] (2/4) Epoch 12, batch 23000, loss[loss=0.1963, simple_loss=0.2751, pruned_loss=0.05879, over 21790.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3125, pruned_loss=0.07552, over 4269503.77 frames. ], batch size: 247, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:20:53,398 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2150826.0, ans=0.125 2023-06-26 05:21:06,329 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=6.73 vs. limit=15.0 2023-06-26 05:21:26,351 INFO [train.py:996] (2/4) Epoch 12, batch 23050, loss[loss=0.2263, simple_loss=0.3024, pruned_loss=0.07514, over 20669.00 frames. ], tot_loss[loss=0.2331, simple_loss=0.3122, pruned_loss=0.07704, over 4278279.41 frames. ], batch size: 608, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:22:08,197 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.09 vs. limit=22.5 2023-06-26 05:22:53,287 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2151126.0, ans=0.125 2023-06-26 05:22:54,533 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.118e+02 9.048e+02 1.306e+03 1.787e+03 2.952e+03, threshold=2.611e+03, percent-clipped=5.0 2023-06-26 05:23:19,191 INFO [train.py:996] (2/4) Epoch 12, batch 23100, loss[loss=0.2059, simple_loss=0.2671, pruned_loss=0.07233, over 21782.00 frames. ], tot_loss[loss=0.2319, simple_loss=0.3082, pruned_loss=0.07778, over 4278571.81 frames. ], batch size: 317, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:23:23,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2151246.0, ans=0.125 2023-06-26 05:24:22,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2151366.0, ans=0.125 2023-06-26 05:24:33,139 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2151426.0, ans=0.1 2023-06-26 05:25:08,310 INFO [train.py:996] (2/4) Epoch 12, batch 23150, loss[loss=0.2065, simple_loss=0.2739, pruned_loss=0.06959, over 21567.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3026, pruned_loss=0.07768, over 4273105.65 frames. ], batch size: 548, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:25:51,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2151666.0, ans=0.2 2023-06-26 05:25:55,572 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.09 vs. limit=15.0 2023-06-26 05:26:25,822 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.357e+02 7.610e+02 9.726e+02 1.363e+03 3.124e+03, threshold=1.945e+03, percent-clipped=3.0 2023-06-26 05:26:26,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2151726.0, ans=0.04949747468305833 2023-06-26 05:26:55,438 INFO [train.py:996] (2/4) Epoch 12, batch 23200, loss[loss=0.2764, simple_loss=0.3284, pruned_loss=0.1122, over 21790.00 frames. ], tot_loss[loss=0.2298, simple_loss=0.3021, pruned_loss=0.07878, over 4283828.82 frames. ], batch size: 508, lr: 2.39e-03, grad_scale: 32.0 2023-06-26 05:27:01,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2151846.0, ans=0.1 2023-06-26 05:27:10,111 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-26 05:27:12,170 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=8.0 2023-06-26 05:27:38,872 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2151966.0, ans=0.125 2023-06-26 05:28:38,825 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2152086.0, ans=0.125 2023-06-26 05:28:41,961 INFO [train.py:996] (2/4) Epoch 12, batch 23250, loss[loss=0.1914, simple_loss=0.2529, pruned_loss=0.065, over 21238.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3007, pruned_loss=0.07884, over 4288048.37 frames. ], batch size: 608, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:29:38,799 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2152266.0, ans=0.125 2023-06-26 05:29:52,931 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2152326.0, ans=0.0 2023-06-26 05:30:12,612 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.710e+02 8.107e+02 1.074e+03 1.814e+03 3.794e+03, threshold=2.148e+03, percent-clipped=19.0 2023-06-26 05:30:13,347 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2152326.0, ans=0.125 2023-06-26 05:30:26,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.scale_min, batch_count=2152386.0, ans=0.2 2023-06-26 05:30:35,141 INFO [train.py:996] (2/4) Epoch 12, batch 23300, loss[loss=0.2963, simple_loss=0.3998, pruned_loss=0.09641, over 21847.00 frames. ], tot_loss[loss=0.2344, simple_loss=0.3078, pruned_loss=0.08052, over 4283588.93 frames. ], batch size: 371, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:30:39,420 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.18 vs. limit=15.0 2023-06-26 05:30:40,946 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2152446.0, ans=0.1 2023-06-26 05:31:05,648 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.5.prob, batch_count=2152506.0, ans=0.125 2023-06-26 05:31:48,647 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2152626.0, ans=0.0 2023-06-26 05:31:48,672 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2152626.0, ans=0.0 2023-06-26 05:31:54,360 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.attention_skip_rate, batch_count=2152626.0, ans=0.0 2023-06-26 05:31:57,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2152626.0, ans=0.1 2023-06-26 05:31:57,575 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2152626.0, ans=0.125 2023-06-26 05:32:31,087 INFO [train.py:996] (2/4) Epoch 12, batch 23350, loss[loss=0.2005, simple_loss=0.2912, pruned_loss=0.05488, over 21615.00 frames. ], tot_loss[loss=0.2347, simple_loss=0.311, pruned_loss=0.07918, over 4272296.73 frames. ], batch size: 389, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:33:02,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.ff2_skip_rate, batch_count=2152806.0, ans=0.0 2023-06-26 05:33:09,192 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.out_combiner.scale_min, batch_count=2152806.0, ans=0.2 2023-06-26 05:33:54,022 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.865e+02 9.116e+02 1.311e+03 1.870e+03 4.347e+03, threshold=2.623e+03, percent-clipped=16.0 2023-06-26 05:34:08,553 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2152986.0, ans=0.0 2023-06-26 05:34:11,960 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2152986.0, ans=0.125 2023-06-26 05:34:21,110 INFO [train.py:996] (2/4) Epoch 12, batch 23400, loss[loss=0.1852, simple_loss=0.2902, pruned_loss=0.04011, over 20747.00 frames. ], tot_loss[loss=0.2292, simple_loss=0.3061, pruned_loss=0.07616, over 4274208.41 frames. ], batch size: 607, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:34:59,934 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.82 vs. limit=22.5 2023-06-26 05:35:39,255 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2153226.0, ans=0.0 2023-06-26 05:35:45,579 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.32 vs. limit=12.0 2023-06-26 05:36:11,446 INFO [train.py:996] (2/4) Epoch 12, batch 23450, loss[loss=0.2597, simple_loss=0.3177, pruned_loss=0.1009, over 21337.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3065, pruned_loss=0.07801, over 4275780.73 frames. ], batch size: 176, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:36:32,858 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:36:46,773 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2153406.0, ans=0.125 2023-06-26 05:37:31,632 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.599e+02 8.250e+02 1.125e+03 1.414e+03 2.942e+03, threshold=2.251e+03, percent-clipped=1.0 2023-06-26 05:37:47,775 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=13.59 vs. limit=15.0 2023-06-26 05:38:03,791 INFO [train.py:996] (2/4) Epoch 12, batch 23500, loss[loss=0.2119, simple_loss=0.2844, pruned_loss=0.06974, over 21937.00 frames. ], tot_loss[loss=0.2324, simple_loss=0.3064, pruned_loss=0.07922, over 4269482.31 frames. ], batch size: 316, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:38:24,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2153706.0, ans=0.1 2023-06-26 05:38:43,753 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=5.94 vs. limit=15.0 2023-06-26 05:38:43,759 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.26 vs. limit=12.0 2023-06-26 05:38:46,952 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=2153706.0, ans=15.0 2023-06-26 05:39:52,418 INFO [train.py:996] (2/4) Epoch 12, batch 23550, loss[loss=0.193, simple_loss=0.2532, pruned_loss=0.06641, over 21643.00 frames. ], tot_loss[loss=0.2303, simple_loss=0.3028, pruned_loss=0.07893, over 4260968.60 frames. ], batch size: 282, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:40:23,218 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2154006.0, ans=0.125 2023-06-26 05:40:47,899 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=5.87 vs. limit=15.0 2023-06-26 05:41:04,580 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:41:09,209 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.684e+02 8.231e+02 1.204e+03 1.979e+03 6.408e+03, threshold=2.407e+03, percent-clipped=19.0 2023-06-26 05:41:26,893 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:41:48,755 INFO [train.py:996] (2/4) Epoch 12, batch 23600, loss[loss=0.2456, simple_loss=0.302, pruned_loss=0.09454, over 21823.00 frames. ], tot_loss[loss=0.2309, simple_loss=0.3034, pruned_loss=0.07923, over 4261516.64 frames. ], batch size: 98, lr: 2.39e-03, grad_scale: 32.0 2023-06-26 05:41:53,297 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.54 vs. limit=12.0 2023-06-26 05:42:49,954 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2154426.0, ans=0.2 2023-06-26 05:43:33,950 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2154486.0, ans=0.125 2023-06-26 05:43:40,740 INFO [train.py:996] (2/4) Epoch 12, batch 23650, loss[loss=0.2618, simple_loss=0.3326, pruned_loss=0.09549, over 21297.00 frames. ], tot_loss[loss=0.2287, simple_loss=0.3032, pruned_loss=0.0771, over 4269644.17 frames. ], batch size: 176, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:44:05,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2154606.0, ans=0.035 2023-06-26 05:44:07,667 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2154606.0, ans=0.07 2023-06-26 05:44:29,427 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2154666.0, ans=0.125 2023-06-26 05:44:54,396 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.05 vs. limit=6.0 2023-06-26 05:45:11,537 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.083e+02 1.007e+03 1.364e+03 1.897e+03 4.250e+03, threshold=2.728e+03, percent-clipped=14.0 2023-06-26 05:45:27,577 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2154786.0, ans=0.0 2023-06-26 05:45:36,480 INFO [train.py:996] (2/4) Epoch 12, batch 23700, loss[loss=0.2664, simple_loss=0.3415, pruned_loss=0.09561, over 21425.00 frames. ], tot_loss[loss=0.2307, simple_loss=0.3061, pruned_loss=0.07763, over 4265392.64 frames. ], batch size: 471, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:45:45,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2154846.0, ans=0.0 2023-06-26 05:47:26,987 INFO [train.py:996] (2/4) Epoch 12, batch 23750, loss[loss=0.1757, simple_loss=0.2692, pruned_loss=0.04112, over 21420.00 frames. ], tot_loss[loss=0.2334, simple_loss=0.3094, pruned_loss=0.0787, over 4272469.38 frames. ], batch size: 211, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:47:29,774 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 05:47:38,895 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2155146.0, ans=0.125 2023-06-26 05:47:54,689 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2155206.0, ans=0.125 2023-06-26 05:48:55,966 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.447e+02 9.334e+02 1.253e+03 1.718e+03 3.362e+03, threshold=2.506e+03, percent-clipped=5.0 2023-06-26 05:49:21,860 INFO [train.py:996] (2/4) Epoch 12, batch 23800, loss[loss=0.2976, simple_loss=0.3847, pruned_loss=0.1053, over 21585.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3084, pruned_loss=0.07715, over 4273120.47 frames. ], batch size: 441, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:49:22,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2155446.0, ans=0.125 2023-06-26 05:49:48,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2155506.0, ans=0.0 2023-06-26 05:50:13,900 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.38 vs. limit=22.5 2023-06-26 05:50:29,599 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2155566.0, ans=0.125 2023-06-26 05:50:46,238 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.out_combiner.scale_min, batch_count=2155626.0, ans=0.2 2023-06-26 05:51:03,272 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.27 vs. limit=6.0 2023-06-26 05:51:16,631 INFO [train.py:996] (2/4) Epoch 12, batch 23850, loss[loss=0.2237, simple_loss=0.3074, pruned_loss=0.06997, over 21811.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3174, pruned_loss=0.07955, over 4275745.52 frames. ], batch size: 282, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:52:36,274 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2155926.0, ans=0.125 2023-06-26 05:52:49,998 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.758e+02 1.136e+03 1.834e+03 2.559e+03 6.160e+03, threshold=3.668e+03, percent-clipped=28.0 2023-06-26 05:53:14,987 INFO [train.py:996] (2/4) Epoch 12, batch 23900, loss[loss=0.2273, simple_loss=0.3057, pruned_loss=0.07443, over 21720.00 frames. ], tot_loss[loss=0.243, simple_loss=0.3235, pruned_loss=0.0813, over 4279011.14 frames. ], batch size: 282, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:53:55,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2156106.0, ans=0.07 2023-06-26 05:54:07,796 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.dropout.p, batch_count=2156166.0, ans=0.1 2023-06-26 05:54:49,709 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.53 vs. limit=5.0 2023-06-26 05:55:04,333 INFO [train.py:996] (2/4) Epoch 12, batch 23950, loss[loss=0.1995, simple_loss=0.2615, pruned_loss=0.06873, over 21277.00 frames. ], tot_loss[loss=0.2386, simple_loss=0.3158, pruned_loss=0.08077, over 4278878.70 frames. ], batch size: 177, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 05:55:30,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2156346.0, ans=0.2 2023-06-26 05:56:32,156 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.839e+02 7.798e+02 1.081e+03 1.467e+03 3.120e+03, threshold=2.162e+03, percent-clipped=0.0 2023-06-26 05:57:07,628 INFO [train.py:996] (2/4) Epoch 12, batch 24000, loss[loss=0.2582, simple_loss=0.3247, pruned_loss=0.09587, over 21977.00 frames. ], tot_loss[loss=0.2425, simple_loss=0.3172, pruned_loss=0.0839, over 4280247.94 frames. ], batch size: 317, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:57:07,629 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-26 05:57:18,482 INFO [zipformer.py:1728] (2/4) name=encoder.encoders.3.encoder.layers.0.self_attn_weights, attn_weights_entropy = tensor([1.6244, 1.5557, 2.3350, 2.1212, 1.3108, 2.3825, 2.3516, 1.2575], device='cuda:2') 2023-06-26 05:57:25,651 INFO [train.py:1028] (2/4) Epoch 12, validation: loss=0.2659, simple_loss=0.36, pruned_loss=0.08593, over 1796401.00 frames. 2023-06-26 05:57:25,652 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-26 05:57:32,704 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-06-26 05:57:55,042 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=9.10 vs. limit=10.0 2023-06-26 05:57:55,867 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2156706.0, ans=0.0 2023-06-26 05:58:00,864 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2156706.0, ans=0.0 2023-06-26 05:58:11,605 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2156766.0, ans=0.125 2023-06-26 05:59:10,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2156886.0, ans=0.125 2023-06-26 05:59:13,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2156946.0, ans=0.125 2023-06-26 05:59:13,389 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2156946.0, ans=0.5 2023-06-26 05:59:15,059 INFO [train.py:996] (2/4) Epoch 12, batch 24050, loss[loss=0.2098, simple_loss=0.295, pruned_loss=0.06232, over 21492.00 frames. ], tot_loss[loss=0.2447, simple_loss=0.3199, pruned_loss=0.08476, over 4284597.84 frames. ], batch size: 211, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 05:59:57,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.max_abs, batch_count=2157066.0, ans=10.0 2023-06-26 06:00:41,096 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2157126.0, ans=0.0 2023-06-26 06:00:45,857 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.641e+02 9.752e+02 1.281e+03 1.752e+03 4.034e+03, threshold=2.563e+03, percent-clipped=15.0 2023-06-26 06:01:00,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2157186.0, ans=0.0 2023-06-26 06:01:05,425 INFO [train.py:996] (2/4) Epoch 12, batch 24100, loss[loss=0.2491, simple_loss=0.3257, pruned_loss=0.08619, over 21423.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3185, pruned_loss=0.08285, over 4283993.92 frames. ], batch size: 211, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:01:06,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2157246.0, ans=0.125 2023-06-26 06:01:44,500 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2157306.0, ans=0.125 2023-06-26 06:02:03,404 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer2.prob, batch_count=2157366.0, ans=0.125 2023-06-26 06:02:19,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2157426.0, ans=0.0 2023-06-26 06:02:23,768 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.53 vs. limit=15.0 2023-06-26 06:02:53,068 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2157486.0, ans=0.09899494936611666 2023-06-26 06:02:57,804 INFO [train.py:996] (2/4) Epoch 12, batch 24150, loss[loss=0.2499, simple_loss=0.3149, pruned_loss=0.09239, over 21479.00 frames. ], tot_loss[loss=0.244, simple_loss=0.3186, pruned_loss=0.08463, over 4288769.38 frames. ], batch size: 548, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:03:17,222 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.whiten.whitening_limit, batch_count=2157546.0, ans=15.0 2023-06-26 06:03:39,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2157666.0, ans=0.125 2023-06-26 06:04:30,175 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.135e+02 1.050e+03 1.414e+03 1.892e+03 4.392e+03, threshold=2.829e+03, percent-clipped=9.0 2023-06-26 06:04:43,459 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2157786.0, ans=0.0 2023-06-26 06:04:55,088 INFO [train.py:996] (2/4) Epoch 12, batch 24200, loss[loss=0.2216, simple_loss=0.3406, pruned_loss=0.0513, over 19836.00 frames. ], tot_loss[loss=0.2456, simple_loss=0.3207, pruned_loss=0.08528, over 4291269.10 frames. ], batch size: 702, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:05:15,763 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2157906.0, ans=0.125 2023-06-26 06:06:44,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.scale_min, batch_count=2158086.0, ans=0.2 2023-06-26 06:06:47,665 INFO [train.py:996] (2/4) Epoch 12, batch 24250, loss[loss=0.1945, simple_loss=0.297, pruned_loss=0.04599, over 21669.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3173, pruned_loss=0.07955, over 4283744.99 frames. ], batch size: 414, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:06:52,482 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.53 vs. limit=15.0 2023-06-26 06:07:23,861 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2158206.0, ans=0.1 2023-06-26 06:07:23,926 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2158206.0, ans=0.09899494936611666 2023-06-26 06:07:43,973 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=9.92 vs. limit=15.0 2023-06-26 06:07:50,559 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2158266.0, ans=0.0 2023-06-26 06:08:18,618 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.845e+02 9.840e+02 1.673e+03 2.724e+03 4.672e+03, threshold=3.346e+03, percent-clipped=24.0 2023-06-26 06:08:25,109 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.87 vs. limit=12.0 2023-06-26 06:08:29,653 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:08:37,637 INFO [train.py:996] (2/4) Epoch 12, batch 24300, loss[loss=0.1117, simple_loss=0.1712, pruned_loss=0.02615, over 16318.00 frames. ], tot_loss[loss=0.2278, simple_loss=0.3096, pruned_loss=0.07306, over 4273601.89 frames. ], batch size: 60, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:08:38,245 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2158446.0, ans=0.0 2023-06-26 06:08:57,070 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2158446.0, ans=0.0 2023-06-26 06:08:57,762 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=10.82 vs. limit=22.5 2023-06-26 06:09:32,311 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2158566.0, ans=0.125 2023-06-26 06:09:40,452 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.56 vs. limit=6.0 2023-06-26 06:10:28,363 INFO [train.py:996] (2/4) Epoch 12, batch 24350, loss[loss=0.2392, simple_loss=0.3051, pruned_loss=0.08668, over 21671.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3054, pruned_loss=0.0721, over 4282488.28 frames. ], batch size: 263, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:10:32,662 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-26 06:10:56,273 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2158806.0, ans=0.125 2023-06-26 06:11:32,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2158866.0, ans=0.125 2023-06-26 06:12:01,495 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.100e+02 8.199e+02 1.178e+03 1.878e+03 3.422e+03, threshold=2.355e+03, percent-clipped=1.0 2023-06-26 06:12:24,468 INFO [train.py:996] (2/4) Epoch 12, batch 24400, loss[loss=0.2215, simple_loss=0.3135, pruned_loss=0.06472, over 16888.00 frames. ], tot_loss[loss=0.2301, simple_loss=0.3096, pruned_loss=0.07527, over 4282232.96 frames. ], batch size: 60, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:13:05,929 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=9.62 vs. limit=15.0 2023-06-26 06:13:34,843 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.balancer1.prob, batch_count=2159226.0, ans=0.125 2023-06-26 06:14:22,184 INFO [train.py:996] (2/4) Epoch 12, batch 24450, loss[loss=0.2785, simple_loss=0.3822, pruned_loss=0.08739, over 21245.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3125, pruned_loss=0.07707, over 4283893.98 frames. ], batch size: 549, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:15:26,149 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=3.96 vs. limit=15.0 2023-06-26 06:15:29,185 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass.skip_rate, batch_count=2159526.0, ans=0.04949747468305833 2023-06-26 06:15:42,751 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.744e+02 8.987e+02 1.393e+03 2.049e+03 5.528e+03, threshold=2.786e+03, percent-clipped=20.0 2023-06-26 06:16:11,943 INFO [train.py:996] (2/4) Epoch 12, batch 24500, loss[loss=0.2156, simple_loss=0.2745, pruned_loss=0.07835, over 20250.00 frames. ], tot_loss[loss=0.235, simple_loss=0.3137, pruned_loss=0.0782, over 4283830.94 frames. ], batch size: 703, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:16:12,366 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.0.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:16:36,697 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2159706.0, ans=0.0 2023-06-26 06:16:59,785 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2159766.0, ans=0.0 2023-06-26 06:17:19,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass_mid.scale_min, batch_count=2159826.0, ans=0.2 2023-06-26 06:17:31,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer1.prob, batch_count=2159826.0, ans=0.125 2023-06-26 06:17:41,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2159886.0, ans=0.125 2023-06-26 06:17:49,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer1.prob, batch_count=2159886.0, ans=0.125 2023-06-26 06:18:10,476 INFO [train.py:996] (2/4) Epoch 12, batch 24550, loss[loss=0.2537, simple_loss=0.3345, pruned_loss=0.08645, over 21275.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3152, pruned_loss=0.07919, over 4283598.23 frames. ], batch size: 176, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:18:48,597 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2160006.0, ans=0.125 2023-06-26 06:19:40,397 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.015e+02 1.028e+03 1.361e+03 2.117e+03 3.876e+03, threshold=2.722e+03, percent-clipped=8.0 2023-06-26 06:20:03,157 INFO [train.py:996] (2/4) Epoch 12, batch 24600, loss[loss=0.2788, simple_loss=0.3258, pruned_loss=0.1159, over 21261.00 frames. ], tot_loss[loss=0.2364, simple_loss=0.3128, pruned_loss=0.07994, over 4273770.99 frames. ], batch size: 471, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:20:06,201 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=5.36 vs. limit=12.0 2023-06-26 06:20:36,196 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2160306.0, ans=0.1 2023-06-26 06:20:49,787 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass_mid.scale_min, batch_count=2160366.0, ans=0.2 2023-06-26 06:20:51,332 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2160366.0, ans=0.0 2023-06-26 06:20:51,405 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2160366.0, ans=0.0 2023-06-26 06:20:56,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2160366.0, ans=0.07 2023-06-26 06:20:57,570 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=22.5 2023-06-26 06:21:16,981 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.17 vs. limit=10.0 2023-06-26 06:21:53,018 INFO [train.py:996] (2/4) Epoch 12, batch 24650, loss[loss=0.2356, simple_loss=0.2992, pruned_loss=0.08603, over 21591.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3064, pruned_loss=0.0797, over 4266832.36 frames. ], batch size: 247, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:22:06,210 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=12.74 vs. limit=15.0 2023-06-26 06:22:19,199 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2160606.0, ans=0.1 2023-06-26 06:23:21,782 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.831e+02 9.882e+02 1.634e+03 2.648e+03 5.082e+03, threshold=3.268e+03, percent-clipped=24.0 2023-06-26 06:23:45,413 INFO [train.py:996] (2/4) Epoch 12, batch 24700, loss[loss=0.2015, simple_loss=0.2668, pruned_loss=0.06816, over 21595.00 frames. ], tot_loss[loss=0.231, simple_loss=0.3036, pruned_loss=0.07916, over 4269857.94 frames. ], batch size: 247, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:23:45,976 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2160846.0, ans=0.2 2023-06-26 06:25:20,429 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=16.55 vs. limit=22.5 2023-06-26 06:25:34,915 INFO [train.py:996] (2/4) Epoch 12, batch 24750, loss[loss=0.2338, simple_loss=0.2913, pruned_loss=0.08814, over 21618.00 frames. ], tot_loss[loss=0.2249, simple_loss=0.2971, pruned_loss=0.07632, over 4264454.79 frames. ], batch size: 415, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:26:14,798 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.88 vs. limit=22.5 2023-06-26 06:26:19,650 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2161266.0, ans=0.0 2023-06-26 06:26:22,887 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_skip_rate, batch_count=2161266.0, ans=0.0 2023-06-26 06:26:41,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2161326.0, ans=0.125 2023-06-26 06:26:44,292 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.05 vs. limit=15.0 2023-06-26 06:26:48,731 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2161326.0, ans=0.125 2023-06-26 06:26:55,422 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.337e+02 7.152e+02 1.055e+03 1.533e+03 3.030e+03, threshold=2.111e+03, percent-clipped=0.0 2023-06-26 06:27:06,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2161386.0, ans=0.125 2023-06-26 06:27:24,950 INFO [train.py:996] (2/4) Epoch 12, batch 24800, loss[loss=0.1689, simple_loss=0.2308, pruned_loss=0.05351, over 20735.00 frames. ], tot_loss[loss=0.2219, simple_loss=0.2921, pruned_loss=0.07586, over 4264600.62 frames. ], batch size: 609, lr: 2.39e-03, grad_scale: 32.0 2023-06-26 06:27:39,208 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer1.prob, batch_count=2161446.0, ans=0.125 2023-06-26 06:28:18,279 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:28:38,857 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2161626.0, ans=0.1 2023-06-26 06:28:47,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2161686.0, ans=0.0 2023-06-26 06:29:09,399 INFO [train.py:996] (2/4) Epoch 12, batch 24850, loss[loss=0.2215, simple_loss=0.2911, pruned_loss=0.07594, over 21646.00 frames. ], tot_loss[loss=0.2247, simple_loss=0.2938, pruned_loss=0.0778, over 4275843.86 frames. ], batch size: 263, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:29:40,910 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:29:44,231 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2161806.0, ans=0.0 2023-06-26 06:29:51,176 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass_mid.scale_min, batch_count=2161806.0, ans=0.2 2023-06-26 06:30:20,711 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2161926.0, ans=0.0 2023-06-26 06:30:24,503 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2161926.0, ans=0.125 2023-06-26 06:30:24,595 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2161926.0, ans=0.125 2023-06-26 06:30:33,619 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2161926.0, ans=0.0 2023-06-26 06:30:34,287 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=11.01 vs. limit=22.5 2023-06-26 06:30:45,763 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.468e+02 9.714e+02 1.514e+03 2.212e+03 4.137e+03, threshold=3.028e+03, percent-clipped=27.0 2023-06-26 06:31:07,664 INFO [train.py:996] (2/4) Epoch 12, batch 24900, loss[loss=0.3287, simple_loss=0.3947, pruned_loss=0.1313, over 21844.00 frames. ], tot_loss[loss=0.2272, simple_loss=0.2974, pruned_loss=0.07853, over 4278593.59 frames. ], batch size: 118, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:31:34,813 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2162106.0, ans=0.125 2023-06-26 06:31:41,868 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2162106.0, ans=0.0 2023-06-26 06:32:04,358 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.64 vs. limit=15.0 2023-06-26 06:32:47,353 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2162286.0, ans=0.2 2023-06-26 06:32:52,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2162286.0, ans=0.125 2023-06-26 06:33:08,354 INFO [train.py:996] (2/4) Epoch 12, batch 24950, loss[loss=0.2594, simple_loss=0.3367, pruned_loss=0.09108, over 21447.00 frames. ], tot_loss[loss=0.2368, simple_loss=0.3069, pruned_loss=0.0834, over 4277235.87 frames. ], batch size: 131, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:33:14,112 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2162346.0, ans=0.0 2023-06-26 06:33:20,306 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2162346.0, ans=0.125 2023-06-26 06:33:46,416 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2162406.0, ans=0.0 2023-06-26 06:34:21,643 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 06:34:43,961 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.936e+02 9.070e+02 1.289e+03 1.993e+03 4.042e+03, threshold=2.579e+03, percent-clipped=8.0 2023-06-26 06:34:59,514 INFO [train.py:996] (2/4) Epoch 12, batch 25000, loss[loss=0.2319, simple_loss=0.3027, pruned_loss=0.08052, over 21647.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.3113, pruned_loss=0.08366, over 4279717.06 frames. ], batch size: 441, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:36:49,066 INFO [train.py:996] (2/4) Epoch 12, batch 25050, loss[loss=0.1966, simple_loss=0.2636, pruned_loss=0.0648, over 21765.00 frames. ], tot_loss[loss=0.2346, simple_loss=0.3044, pruned_loss=0.0824, over 4278283.73 frames. ], batch size: 317, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:37:11,351 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.conv_module2.whiten, num_groups=1, num_channels=192, metric=9.45 vs. limit=15.0 2023-06-26 06:37:25,858 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2163006.0, ans=0.1 2023-06-26 06:37:53,246 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=5.55 vs. limit=15.0 2023-06-26 06:38:26,470 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.221e+02 8.095e+02 1.236e+03 1.556e+03 3.258e+03, threshold=2.471e+03, percent-clipped=4.0 2023-06-26 06:38:42,006 INFO [train.py:996] (2/4) Epoch 12, batch 25100, loss[loss=0.2161, simple_loss=0.2779, pruned_loss=0.07717, over 21743.00 frames. ], tot_loss[loss=0.2297, simple_loss=0.2985, pruned_loss=0.08046, over 4264373.77 frames. ], batch size: 124, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:39:24,435 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.12 vs. limit=22.5 2023-06-26 06:39:39,232 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys.whitening_limit, batch_count=2163426.0, ans=6.0 2023-06-26 06:39:57,313 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2163426.0, ans=0.125 2023-06-26 06:40:23,749 INFO [train.py:996] (2/4) Epoch 12, batch 25150, loss[loss=0.2072, simple_loss=0.3099, pruned_loss=0.05225, over 21781.00 frames. ], tot_loss[loss=0.229, simple_loss=0.3029, pruned_loss=0.07757, over 4271891.16 frames. ], batch size: 282, lr: 2.39e-03, grad_scale: 8.0 2023-06-26 06:40:29,000 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2163546.0, ans=0.125 2023-06-26 06:41:31,598 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2163726.0, ans=0.125 2023-06-26 06:41:48,285 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.357e+02 7.684e+02 1.120e+03 1.527e+03 4.461e+03, threshold=2.241e+03, percent-clipped=8.0 2023-06-26 06:42:06,970 INFO [train.py:996] (2/4) Epoch 12, batch 25200, loss[loss=0.1954, simple_loss=0.274, pruned_loss=0.0584, over 21261.00 frames. ], tot_loss[loss=0.2271, simple_loss=0.3025, pruned_loss=0.07583, over 4242934.89 frames. ], batch size: 159, lr: 2.39e-03, grad_scale: 16.0 2023-06-26 06:42:23,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2163906.0, ans=0.125 2023-06-26 06:43:51,985 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2164086.0, ans=0.2 2023-06-26 06:43:56,898 INFO [train.py:996] (2/4) Epoch 12, batch 25250, loss[loss=0.2046, simple_loss=0.2823, pruned_loss=0.06346, over 21610.00 frames. ], tot_loss[loss=0.2235, simple_loss=0.2994, pruned_loss=0.07374, over 4253401.43 frames. ], batch size: 332, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:44:56,779 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2164326.0, ans=0.0 2023-06-26 06:45:25,558 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.169e+02 7.134e+02 9.127e+02 1.634e+03 5.272e+03, threshold=1.825e+03, percent-clipped=13.0 2023-06-26 06:45:45,404 INFO [train.py:996] (2/4) Epoch 12, batch 25300, loss[loss=0.2277, simple_loss=0.315, pruned_loss=0.07017, over 21318.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2977, pruned_loss=0.07401, over 4245074.19 frames. ], batch size: 548, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:46:00,128 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2164446.0, ans=0.0 2023-06-26 06:46:24,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2164566.0, ans=0.04949747468305833 2023-06-26 06:46:31,705 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2164566.0, ans=0.1 2023-06-26 06:47:22,432 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.63 vs. limit=12.0 2023-06-26 06:47:27,423 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2164686.0, ans=0.0 2023-06-26 06:47:35,405 INFO [train.py:996] (2/4) Epoch 12, batch 25350, loss[loss=0.1737, simple_loss=0.2692, pruned_loss=0.03908, over 21777.00 frames. ], tot_loss[loss=0.2229, simple_loss=0.2994, pruned_loss=0.07314, over 4248442.40 frames. ], batch size: 282, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:47:35,862 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2164746.0, ans=0.125 2023-06-26 06:47:39,866 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.31 vs. limit=6.0 2023-06-26 06:47:59,805 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2164806.0, ans=0.125 2023-06-26 06:48:20,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2164866.0, ans=0.125 2023-06-26 06:48:45,430 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=3.03 vs. limit=6.0 2023-06-26 06:48:51,835 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2164926.0, ans=0.0 2023-06-26 06:49:04,261 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.341e+02 9.859e+02 1.527e+03 2.457e+03 4.731e+03, threshold=3.054e+03, percent-clipped=38.0 2023-06-26 06:49:17,672 INFO [train.py:996] (2/4) Epoch 12, batch 25400, loss[loss=0.1911, simple_loss=0.2608, pruned_loss=0.06069, over 21878.00 frames. ], tot_loss[loss=0.2194, simple_loss=0.2948, pruned_loss=0.07204, over 4248558.62 frames. ], batch size: 373, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:49:59,434 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.skip_rate, batch_count=2165166.0, ans=0.07 2023-06-26 06:50:31,074 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=6.44 vs. limit=10.0 2023-06-26 06:50:55,625 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2165286.0, ans=0.0 2023-06-26 06:51:05,427 INFO [train.py:996] (2/4) Epoch 12, batch 25450, loss[loss=0.2633, simple_loss=0.3338, pruned_loss=0.09644, over 21610.00 frames. ], tot_loss[loss=0.222, simple_loss=0.2962, pruned_loss=0.07387, over 4260885.74 frames. ], batch size: 471, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:51:13,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff2_skip_rate, batch_count=2165346.0, ans=0.0 2023-06-26 06:51:47,749 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.07 vs. limit=6.0 2023-06-26 06:51:48,744 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2165466.0, ans=0.2 2023-06-26 06:51:56,257 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2165466.0, ans=0.125 2023-06-26 06:51:58,549 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=8.08 vs. limit=15.0 2023-06-26 06:52:02,947 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2165526.0, ans=0.125 2023-06-26 06:52:20,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2165526.0, ans=0.125 2023-06-26 06:52:43,154 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.859e+02 7.438e+02 1.024e+03 1.496e+03 3.770e+03, threshold=2.047e+03, percent-clipped=1.0 2023-06-26 06:52:55,777 INFO [train.py:996] (2/4) Epoch 12, batch 25500, loss[loss=0.2263, simple_loss=0.3208, pruned_loss=0.0659, over 21884.00 frames. ], tot_loss[loss=0.2188, simple_loss=0.2963, pruned_loss=0.07059, over 4252779.85 frames. ], batch size: 372, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 06:53:09,117 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=15.74 vs. limit=22.5 2023-06-26 06:53:21,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2165706.0, ans=0.1 2023-06-26 06:53:29,105 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.46 vs. limit=22.5 2023-06-26 06:54:25,214 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.48 vs. limit=22.5 2023-06-26 06:54:38,445 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2165886.0, ans=0.025 2023-06-26 06:54:53,050 INFO [train.py:996] (2/4) Epoch 12, batch 25550, loss[loss=0.2376, simple_loss=0.3399, pruned_loss=0.0677, over 21669.00 frames. ], tot_loss[loss=0.2221, simple_loss=0.3032, pruned_loss=0.07054, over 4254156.40 frames. ], batch size: 441, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 06:55:21,951 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2166006.0, ans=0.0 2023-06-26 06:56:31,793 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.571e+02 9.605e+02 1.487e+03 2.203e+03 4.525e+03, threshold=2.973e+03, percent-clipped=31.0 2023-06-26 06:56:41,163 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.51 vs. limit=22.5 2023-06-26 06:56:43,668 INFO [train.py:996] (2/4) Epoch 12, batch 25600, loss[loss=0.2735, simple_loss=0.3465, pruned_loss=0.1002, over 21440.00 frames. ], tot_loss[loss=0.2248, simple_loss=0.3067, pruned_loss=0.07148, over 4262395.38 frames. ], batch size: 131, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:57:01,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2166306.0, ans=0.0 2023-06-26 06:57:10,374 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=12.94 vs. limit=15.0 2023-06-26 06:57:44,860 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2166366.0, ans=0.1 2023-06-26 06:58:24,717 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.77 vs. limit=22.5 2023-06-26 06:58:27,674 INFO [train.py:996] (2/4) Epoch 12, batch 25650, loss[loss=0.2016, simple_loss=0.2625, pruned_loss=0.0703, over 21655.00 frames. ], tot_loss[loss=0.2275, simple_loss=0.3069, pruned_loss=0.07407, over 4261000.35 frames. ], batch size: 247, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 06:58:30,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2166546.0, ans=0.125 2023-06-26 06:59:27,165 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_skip_rate, batch_count=2166666.0, ans=0.0 2023-06-26 06:59:57,890 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.69 vs. limit=12.0 2023-06-26 07:00:05,373 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.624e+02 9.508e+02 1.344e+03 1.911e+03 3.919e+03, threshold=2.688e+03, percent-clipped=6.0 2023-06-26 07:00:13,598 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=3.97 vs. limit=15.0 2023-06-26 07:00:17,989 INFO [train.py:996] (2/4) Epoch 12, batch 25700, loss[loss=0.2212, simple_loss=0.3072, pruned_loss=0.06767, over 21650.00 frames. ], tot_loss[loss=0.2283, simple_loss=0.3056, pruned_loss=0.07549, over 4251598.43 frames. ], batch size: 389, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:00:56,359 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2166906.0, ans=0.125 2023-06-26 07:01:00,020 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2166966.0, ans=0.1 2023-06-26 07:01:53,072 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2167086.0, ans=0.125 2023-06-26 07:02:06,704 INFO [train.py:996] (2/4) Epoch 12, batch 25750, loss[loss=0.2417, simple_loss=0.3205, pruned_loss=0.08142, over 21599.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3103, pruned_loss=0.07919, over 4256880.02 frames. ], batch size: 263, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:02:29,453 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.min_abs, batch_count=2167206.0, ans=0.5 2023-06-26 07:02:38,772 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:02:55,225 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2167206.0, ans=0.125 2023-06-26 07:03:19,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer2.prob, batch_count=2167266.0, ans=0.125 2023-06-26 07:03:47,665 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.983e+02 9.621e+02 1.292e+03 1.996e+03 6.312e+03, threshold=2.583e+03, percent-clipped=12.0 2023-06-26 07:03:52,089 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2167386.0, ans=0.1 2023-06-26 07:03:52,111 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2167386.0, ans=0.0 2023-06-26 07:04:10,199 INFO [train.py:996] (2/4) Epoch 12, batch 25800, loss[loss=0.2709, simple_loss=0.3535, pruned_loss=0.0942, over 21850.00 frames. ], tot_loss[loss=0.2439, simple_loss=0.3213, pruned_loss=0.08328, over 4258205.27 frames. ], batch size: 124, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:04:40,817 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2167506.0, ans=0.025 2023-06-26 07:04:59,354 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass_mid.scale_min, batch_count=2167566.0, ans=0.2 2023-06-26 07:05:24,726 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn1.whiten, num_groups=1, num_channels=192, metric=14.03 vs. limit=22.5 2023-06-26 07:06:05,641 INFO [train.py:996] (2/4) Epoch 12, batch 25850, loss[loss=0.2363, simple_loss=0.3011, pruned_loss=0.08574, over 21496.00 frames. ], tot_loss[loss=0.245, simple_loss=0.3235, pruned_loss=0.08326, over 4265994.54 frames. ], batch size: 211, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:06:13,956 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.84 vs. limit=22.5 2023-06-26 07:06:32,266 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:06:39,049 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2167806.0, ans=0.125 2023-06-26 07:06:53,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff2_skip_rate, batch_count=2167866.0, ans=0.0 2023-06-26 07:07:48,458 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.401e+02 9.971e+02 1.375e+03 1.928e+03 5.111e+03, threshold=2.750e+03, percent-clipped=7.0 2023-06-26 07:08:04,018 INFO [train.py:996] (2/4) Epoch 12, batch 25900, loss[loss=0.2916, simple_loss=0.3805, pruned_loss=0.1013, over 21760.00 frames. ], tot_loss[loss=0.2465, simple_loss=0.3251, pruned_loss=0.08395, over 4267362.54 frames. ], batch size: 414, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:08:04,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2168046.0, ans=0.125 2023-06-26 07:09:00,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2168166.0, ans=0.2 2023-06-26 07:09:52,367 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2168286.0, ans=0.125 2023-06-26 07:09:55,125 INFO [train.py:996] (2/4) Epoch 12, batch 25950, loss[loss=0.2885, simple_loss=0.397, pruned_loss=0.08997, over 20847.00 frames. ], tot_loss[loss=0.2517, simple_loss=0.3308, pruned_loss=0.08628, over 4269049.31 frames. ], batch size: 607, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:11:24,823 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_skip_rate, batch_count=2168526.0, ans=0.0 2023-06-26 07:11:35,005 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.821e+02 8.955e+02 1.246e+03 1.921e+03 4.535e+03, threshold=2.491e+03, percent-clipped=11.0 2023-06-26 07:11:45,162 INFO [train.py:996] (2/4) Epoch 12, batch 26000, loss[loss=0.3485, simple_loss=0.4098, pruned_loss=0.1437, over 21396.00 frames. ], tot_loss[loss=0.2516, simple_loss=0.3319, pruned_loss=0.08564, over 4266491.66 frames. ], batch size: 507, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:12:44,168 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.58 vs. limit=6.0 2023-06-26 07:13:01,961 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff3.min_abs, batch_count=2168826.0, ans=0.2 2023-06-26 07:13:34,700 INFO [train.py:996] (2/4) Epoch 12, batch 26050, loss[loss=0.2311, simple_loss=0.2952, pruned_loss=0.08354, over 21236.00 frames. ], tot_loss[loss=0.2527, simple_loss=0.3318, pruned_loss=0.0868, over 4272404.96 frames. ], batch size: 143, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:13:43,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2168946.0, ans=0.125 2023-06-26 07:14:30,226 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2169066.0, ans=0.025 2023-06-26 07:14:40,217 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2169066.0, ans=0.0 2023-06-26 07:14:45,511 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass.scale_min, batch_count=2169126.0, ans=0.2 2023-06-26 07:14:49,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2169126.0, ans=0.0 2023-06-26 07:15:11,925 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.629e+02 1.039e+03 1.480e+03 2.039e+03 5.924e+03, threshold=2.960e+03, percent-clipped=13.0 2023-06-26 07:15:22,463 INFO [train.py:996] (2/4) Epoch 12, batch 26100, loss[loss=0.222, simple_loss=0.2863, pruned_loss=0.07883, over 21828.00 frames. ], tot_loss[loss=0.2493, simple_loss=0.3257, pruned_loss=0.08642, over 4282424.23 frames. ], batch size: 247, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:16:32,722 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.15 vs. limit=10.0 2023-06-26 07:16:41,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2169426.0, ans=0.1 2023-06-26 07:17:10,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2169486.0, ans=0.125 2023-06-26 07:17:13,275 INFO [train.py:996] (2/4) Epoch 12, batch 26150, loss[loss=0.2747, simple_loss=0.3414, pruned_loss=0.104, over 21270.00 frames. ], tot_loss[loss=0.2473, simple_loss=0.322, pruned_loss=0.08627, over 4285259.32 frames. ], batch size: 159, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:17:25,655 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2169546.0, ans=0.125 2023-06-26 07:18:28,388 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2169726.0, ans=0.2 2023-06-26 07:18:36,145 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2169726.0, ans=0.0 2023-06-26 07:18:52,702 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.87 vs. limit=10.0 2023-06-26 07:18:52,963 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.447e+02 8.636e+02 1.260e+03 1.715e+03 2.544e+03, threshold=2.520e+03, percent-clipped=0.0 2023-06-26 07:19:00,593 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2169786.0, ans=0.0 2023-06-26 07:19:03,282 INFO [train.py:996] (2/4) Epoch 12, batch 26200, loss[loss=0.2388, simple_loss=0.343, pruned_loss=0.06735, over 21750.00 frames. ], tot_loss[loss=0.2446, simple_loss=0.3217, pruned_loss=0.08379, over 4284992.90 frames. ], batch size: 298, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:19:57,646 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=10.87 vs. limit=22.5 2023-06-26 07:20:52,544 INFO [train.py:996] (2/4) Epoch 12, batch 26250, loss[loss=0.2205, simple_loss=0.2952, pruned_loss=0.07287, over 21378.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3245, pruned_loss=0.08297, over 4284308.09 frames. ], batch size: 159, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:21:54,371 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff2_skip_rate, batch_count=2170266.0, ans=0.0 2023-06-26 07:21:57,783 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2170266.0, ans=0.125 2023-06-26 07:22:04,035 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.19 vs. limit=15.0 2023-06-26 07:22:11,583 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2170326.0, ans=0.125 2023-06-26 07:22:31,942 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.498e+02 1.014e+03 1.360e+03 1.996e+03 4.754e+03, threshold=2.720e+03, percent-clipped=15.0 2023-06-26 07:22:42,523 INFO [train.py:996] (2/4) Epoch 12, batch 26300, loss[loss=0.2479, simple_loss=0.3124, pruned_loss=0.09165, over 21804.00 frames. ], tot_loss[loss=0.2452, simple_loss=0.3224, pruned_loss=0.08402, over 4294752.20 frames. ], batch size: 247, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:22:47,913 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2170446.0, ans=0.0 2023-06-26 07:23:09,441 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=8.09 vs. limit=22.5 2023-06-26 07:23:26,018 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2170506.0, ans=0.0 2023-06-26 07:23:43,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2170566.0, ans=0.125 2023-06-26 07:23:46,957 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2170566.0, ans=0.125 2023-06-26 07:24:10,848 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2170626.0, ans=0.2 2023-06-26 07:24:16,617 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2170686.0, ans=0.1 2023-06-26 07:24:29,017 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2170686.0, ans=0.1 2023-06-26 07:24:40,570 INFO [train.py:996] (2/4) Epoch 12, batch 26350, loss[loss=0.3086, simple_loss=0.3706, pruned_loss=0.1234, over 21592.00 frames. ], tot_loss[loss=0.2457, simple_loss=0.3215, pruned_loss=0.08491, over 4297268.75 frames. ], batch size: 415, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:25:05,923 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2170746.0, ans=0.125 2023-06-26 07:26:09,383 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2.whitening_limit, batch_count=2170986.0, ans=15.0 2023-06-26 07:26:10,258 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.697e+02 8.952e+02 1.093e+03 1.459e+03 3.186e+03, threshold=2.186e+03, percent-clipped=0.0 2023-06-26 07:26:25,978 INFO [train.py:996] (2/4) Epoch 12, batch 26400, loss[loss=0.1985, simple_loss=0.2588, pruned_loss=0.06912, over 21630.00 frames. ], tot_loss[loss=0.2421, simple_loss=0.3153, pruned_loss=0.08445, over 4294833.55 frames. ], batch size: 298, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:26:33,418 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2171046.0, ans=0.1 2023-06-26 07:27:41,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2171226.0, ans=0.0 2023-06-26 07:27:48,548 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=4.53 vs. limit=15.0 2023-06-26 07:28:30,884 INFO [train.py:996] (2/4) Epoch 12, batch 26450, loss[loss=0.2704, simple_loss=0.3714, pruned_loss=0.08473, over 21837.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3142, pruned_loss=0.08355, over 4284574.54 frames. ], batch size: 372, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:28:35,741 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2171346.0, ans=0.125 2023-06-26 07:30:11,348 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.179e+02 1.055e+03 2.037e+03 2.907e+03 6.247e+03, threshold=4.074e+03, percent-clipped=46.0 2023-06-26 07:30:20,916 INFO [train.py:996] (2/4) Epoch 12, batch 26500, loss[loss=0.2304, simple_loss=0.3477, pruned_loss=0.05656, over 20817.00 frames. ], tot_loss[loss=0.2393, simple_loss=0.316, pruned_loss=0.0813, over 4277322.64 frames. ], batch size: 607, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:32:13,358 INFO [train.py:996] (2/4) Epoch 12, batch 26550, loss[loss=0.193, simple_loss=0.2794, pruned_loss=0.05331, over 21584.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.314, pruned_loss=0.07852, over 4265044.29 frames. ], batch size: 230, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:32:24,364 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.out_combiner.scale_min, batch_count=2171946.0, ans=0.2 2023-06-26 07:32:38,477 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer2.prob, batch_count=2172006.0, ans=0.125 2023-06-26 07:33:11,278 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=22.5 2023-06-26 07:33:23,997 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2172126.0, ans=0.0 2023-06-26 07:33:48,932 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.879e+02 9.613e+02 1.344e+03 2.252e+03 5.344e+03, threshold=2.687e+03, percent-clipped=4.0 2023-06-26 07:33:57,153 INFO [train.py:996] (2/4) Epoch 12, batch 26600, loss[loss=0.2028, simple_loss=0.276, pruned_loss=0.06479, over 21623.00 frames. ], tot_loss[loss=0.2308, simple_loss=0.3115, pruned_loss=0.07504, over 4264073.53 frames. ], batch size: 247, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:33:57,734 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=2172246.0, ans=0.125 2023-06-26 07:34:05,636 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.56 vs. limit=22.5 2023-06-26 07:34:10,532 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=2.08 vs. limit=6.0 2023-06-26 07:35:01,987 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2172366.0, ans=0.1 2023-06-26 07:35:45,178 INFO [train.py:996] (2/4) Epoch 12, batch 26650, loss[loss=0.1687, simple_loss=0.2476, pruned_loss=0.04492, over 21518.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.3049, pruned_loss=0.07407, over 4265803.80 frames. ], batch size: 195, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:35:54,942 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.53 vs. limit=15.0 2023-06-26 07:35:55,891 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2172546.0, ans=0.125 2023-06-26 07:36:15,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2172606.0, ans=0.0 2023-06-26 07:36:48,518 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.balancer1.prob, batch_count=2172666.0, ans=0.125 2023-06-26 07:37:20,215 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.879e+02 7.426e+02 9.644e+02 1.380e+03 2.955e+03, threshold=1.929e+03, percent-clipped=1.0 2023-06-26 07:37:27,391 INFO [train.py:996] (2/4) Epoch 12, batch 26700, loss[loss=0.2179, simple_loss=0.2842, pruned_loss=0.07586, over 21596.00 frames. ], tot_loss[loss=0.221, simple_loss=0.2987, pruned_loss=0.07161, over 4266559.06 frames. ], batch size: 548, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:37:30,627 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=7.26 vs. limit=22.5 2023-06-26 07:37:41,134 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2172846.0, ans=0.125 2023-06-26 07:38:21,956 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:38:26,666 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.attention_skip_rate, batch_count=2172966.0, ans=0.0 2023-06-26 07:39:02,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2173086.0, ans=0.0 2023-06-26 07:39:14,908 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2173086.0, ans=0.125 2023-06-26 07:39:17,609 INFO [train.py:996] (2/4) Epoch 12, batch 26750, loss[loss=0.2111, simple_loss=0.2979, pruned_loss=0.06221, over 21754.00 frames. ], tot_loss[loss=0.2189, simple_loss=0.2973, pruned_loss=0.07027, over 4271396.65 frames. ], batch size: 332, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:39:54,416 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=9.41 vs. limit=15.0 2023-06-26 07:40:04,342 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.attention_skip_rate, batch_count=2173206.0, ans=0.0 2023-06-26 07:40:11,934 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2173266.0, ans=0.0 2023-06-26 07:40:17,484 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.51 vs. limit=15.0 2023-06-26 07:41:00,874 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.608e+02 7.962e+02 9.923e+02 1.450e+03 3.695e+03, threshold=1.985e+03, percent-clipped=8.0 2023-06-26 07:41:18,478 INFO [train.py:996] (2/4) Epoch 12, batch 26800, loss[loss=0.2626, simple_loss=0.3335, pruned_loss=0.09588, over 21819.00 frames. ], tot_loss[loss=0.226, simple_loss=0.3045, pruned_loss=0.07379, over 4272005.09 frames. ], batch size: 441, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:41:34,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2173506.0, ans=0.0 2023-06-26 07:41:46,231 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.25 vs. limit=12.0 2023-06-26 07:41:58,010 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2173566.0, ans=0.1 2023-06-26 07:42:02,810 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2173566.0, ans=0.1 2023-06-26 07:42:55,841 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer_ff2.min_abs, batch_count=2173686.0, ans=0.1 2023-06-26 07:43:10,559 INFO [train.py:996] (2/4) Epoch 12, batch 26850, loss[loss=0.2407, simple_loss=0.2986, pruned_loss=0.0914, over 21856.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.3058, pruned_loss=0.07702, over 4273812.37 frames. ], batch size: 118, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:43:16,549 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.min_positive, batch_count=2173746.0, ans=0.025 2023-06-26 07:43:19,534 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2173746.0, ans=0.0 2023-06-26 07:43:21,057 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2173746.0, ans=0.125 2023-06-26 07:43:21,110 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:43:30,885 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=4.73 vs. limit=15.0 2023-06-26 07:43:40,483 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.hidden_balancer.prob, batch_count=2173806.0, ans=0.125 2023-06-26 07:43:58,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2173866.0, ans=0.0 2023-06-26 07:44:01,664 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer2.prob, batch_count=2173926.0, ans=0.125 2023-06-26 07:44:18,202 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2173986.0, ans=0.125 2023-06-26 07:44:40,221 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 4.975e+02 7.931e+02 1.142e+03 1.522e+03 3.683e+03, threshold=2.283e+03, percent-clipped=9.0 2023-06-26 07:44:53,086 INFO [train.py:996] (2/4) Epoch 12, batch 26900, loss[loss=0.2161, simple_loss=0.2784, pruned_loss=0.07687, over 21540.00 frames. ], tot_loss[loss=0.2252, simple_loss=0.2982, pruned_loss=0.0761, over 4271673.29 frames. ], batch size: 391, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:45:00,022 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2174046.0, ans=0.0 2023-06-26 07:45:14,722 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2174106.0, ans=0.125 2023-06-26 07:45:34,138 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2174166.0, ans=0.07 2023-06-26 07:45:39,780 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.33 vs. limit=12.0 2023-06-26 07:46:23,842 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2174286.0, ans=0.125 2023-06-26 07:46:41,244 INFO [train.py:996] (2/4) Epoch 12, batch 26950, loss[loss=0.2523, simple_loss=0.3447, pruned_loss=0.07996, over 21825.00 frames. ], tot_loss[loss=0.2246, simple_loss=0.2973, pruned_loss=0.07593, over 4268674.59 frames. ], batch size: 371, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:46:58,988 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff3_skip_rate, batch_count=2174406.0, ans=0.0 2023-06-26 07:48:17,130 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.465e+02 7.709e+02 1.449e+03 2.208e+03 6.166e+03, threshold=2.897e+03, percent-clipped=23.0 2023-06-26 07:48:22,385 INFO [train.py:996] (2/4) Epoch 12, batch 27000, loss[loss=0.208, simple_loss=0.2991, pruned_loss=0.05847, over 21759.00 frames. ], tot_loss[loss=0.2216, simple_loss=0.2966, pruned_loss=0.07325, over 4275833.97 frames. ], batch size: 282, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:48:22,385 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-26 07:48:40,352 INFO [train.py:1028] (2/4) Epoch 12, validation: loss=0.2401, simple_loss=0.3367, pruned_loss=0.07176, over 1796401.00 frames. 2023-06-26 07:48:40,352 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-26 07:49:00,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2174706.0, ans=0.125 2023-06-26 07:49:16,457 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.27 vs. limit=15.0 2023-06-26 07:49:48,659 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2174826.0, ans=0.125 2023-06-26 07:50:26,726 INFO [train.py:996] (2/4) Epoch 12, batch 27050, loss[loss=0.2627, simple_loss=0.3409, pruned_loss=0.09223, over 21596.00 frames. ], tot_loss[loss=0.2204, simple_loss=0.2998, pruned_loss=0.07049, over 4281892.82 frames. ], batch size: 507, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:50:52,813 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=8, num_channels=256, metric=3.24 vs. limit=6.0 2023-06-26 07:51:27,807 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2175126.0, ans=0.1 2023-06-26 07:51:48,638 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2175126.0, ans=0.0 2023-06-26 07:52:02,932 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.82 vs. limit=10.0 2023-06-26 07:52:06,706 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.688e+02 1.022e+03 1.429e+03 2.477e+03 5.037e+03, threshold=2.858e+03, percent-clipped=17.0 2023-06-26 07:52:12,159 INFO [train.py:996] (2/4) Epoch 12, batch 27100, loss[loss=0.2082, simple_loss=0.3001, pruned_loss=0.05813, over 21899.00 frames. ], tot_loss[loss=0.222, simple_loss=0.3015, pruned_loss=0.07127, over 4290155.93 frames. ], batch size: 316, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:52:12,880 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2175246.0, ans=0.125 2023-06-26 07:52:27,063 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2175246.0, ans=0.1 2023-06-26 07:52:47,349 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2175366.0, ans=0.0 2023-06-26 07:53:36,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.skip_rate, batch_count=2175426.0, ans=0.07 2023-06-26 07:54:02,039 INFO [train.py:996] (2/4) Epoch 12, batch 27150, loss[loss=0.2725, simple_loss=0.359, pruned_loss=0.09302, over 21820.00 frames. ], tot_loss[loss=0.2314, simple_loss=0.3127, pruned_loss=0.07501, over 4285223.51 frames. ], batch size: 282, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:54:05,778 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2175546.0, ans=0.125 2023-06-26 07:54:09,728 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=3.76 vs. limit=15.0 2023-06-26 07:54:18,082 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.min_abs, batch_count=2175606.0, ans=0.5 2023-06-26 07:54:51,482 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2175666.0, ans=0.1 2023-06-26 07:55:17,723 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2175726.0, ans=0.125 2023-06-26 07:55:20,912 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2175726.0, ans=0.125 2023-06-26 07:55:25,089 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 07:55:40,273 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.180e+02 9.529e+02 1.614e+03 2.368e+03 4.440e+03, threshold=3.227e+03, percent-clipped=12.0 2023-06-26 07:55:45,337 INFO [train.py:996] (2/4) Epoch 12, batch 27200, loss[loss=0.3481, simple_loss=0.4011, pruned_loss=0.1475, over 21406.00 frames. ], tot_loss[loss=0.238, simple_loss=0.3205, pruned_loss=0.07772, over 4287025.14 frames. ], batch size: 508, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 07:55:51,876 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=2.39 vs. limit=12.0 2023-06-26 07:56:10,953 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2175906.0, ans=0.0 2023-06-26 07:56:24,724 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=13.51 vs. limit=15.0 2023-06-26 07:56:38,116 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2175906.0, ans=0.1 2023-06-26 07:57:33,911 INFO [train.py:996] (2/4) Epoch 12, batch 27250, loss[loss=0.2238, simple_loss=0.2946, pruned_loss=0.07654, over 21652.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3231, pruned_loss=0.08166, over 4286059.67 frames. ], batch size: 263, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 07:58:24,127 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2176266.0, ans=0.0 2023-06-26 07:59:02,521 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2176386.0, ans=0.0 2023-06-26 07:59:24,079 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.268e+02 8.548e+02 1.157e+03 1.722e+03 4.031e+03, threshold=2.315e+03, percent-clipped=3.0 2023-06-26 07:59:32,600 INFO [train.py:996] (2/4) Epoch 12, batch 27300, loss[loss=0.2356, simple_loss=0.3197, pruned_loss=0.07574, over 21781.00 frames. ], tot_loss[loss=0.2453, simple_loss=0.3251, pruned_loss=0.08279, over 4281911.46 frames. ], batch size: 247, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:00:27,077 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2176566.0, ans=0.2 2023-06-26 08:01:21,493 INFO [train.py:996] (2/4) Epoch 12, batch 27350, loss[loss=0.2666, simple_loss=0.3424, pruned_loss=0.09539, over 21743.00 frames. ], tot_loss[loss=0.2464, simple_loss=0.3269, pruned_loss=0.08294, over 4284112.33 frames. ], batch size: 441, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:01:30,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2176746.0, ans=0.1 2023-06-26 08:01:33,824 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer2.prob, batch_count=2176746.0, ans=0.125 2023-06-26 08:01:35,587 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2176746.0, ans=0.125 2023-06-26 08:02:00,812 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2176866.0, ans=0.125 2023-06-26 08:02:02,410 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2176866.0, ans=0.1 2023-06-26 08:02:14,473 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.ff3_skip_rate, batch_count=2176926.0, ans=0.0 2023-06-26 08:02:33,055 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.min_positive, batch_count=2176986.0, ans=0.05 2023-06-26 08:02:53,586 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer2.prob, batch_count=2176986.0, ans=0.125 2023-06-26 08:02:59,822 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.844e+02 9.303e+02 1.156e+03 1.696e+03 4.537e+03, threshold=2.312e+03, percent-clipped=11.0 2023-06-26 08:03:08,449 INFO [train.py:996] (2/4) Epoch 12, batch 27400, loss[loss=0.2143, simple_loss=0.2815, pruned_loss=0.07355, over 21715.00 frames. ], tot_loss[loss=0.2442, simple_loss=0.3226, pruned_loss=0.08294, over 4286851.50 frames. ], batch size: 414, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:03:09,447 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=4.06 vs. limit=15.0 2023-06-26 08:03:18,601 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=1.72 vs. limit=6.0 2023-06-26 08:03:50,600 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2177166.0, ans=0.125 2023-06-26 08:04:06,821 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer2.min_positive, batch_count=2177226.0, ans=0.05 2023-06-26 08:04:43,854 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.43 vs. limit=15.0 2023-06-26 08:04:56,339 INFO [train.py:996] (2/4) Epoch 12, batch 27450, loss[loss=0.231, simple_loss=0.3126, pruned_loss=0.07472, over 21306.00 frames. ], tot_loss[loss=0.2401, simple_loss=0.3176, pruned_loss=0.08131, over 4279801.49 frames. ], batch size: 548, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:05:00,542 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module2.balancer1.prob, batch_count=2177346.0, ans=0.125 2023-06-26 08:05:15,300 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2177406.0, ans=0.1 2023-06-26 08:05:35,739 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2177466.0, ans=0.2 2023-06-26 08:06:35,020 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.529e+02 8.908e+02 1.196e+03 1.781e+03 4.590e+03, threshold=2.391e+03, percent-clipped=13.0 2023-06-26 08:06:38,347 INFO [train.py:996] (2/4) Epoch 12, batch 27500, loss[loss=0.2018, simple_loss=0.2778, pruned_loss=0.06289, over 21869.00 frames. ], tot_loss[loss=0.2406, simple_loss=0.3166, pruned_loss=0.0823, over 4291626.23 frames. ], batch size: 298, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:07:04,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2177706.0, ans=0.125 2023-06-26 08:07:08,560 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.46 vs. limit=22.5 2023-06-26 08:07:16,221 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module2.balancer1.max_abs, batch_count=2177766.0, ans=10.0 2023-06-26 08:07:36,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff2_skip_rate, batch_count=2177826.0, ans=0.0 2023-06-26 08:08:20,090 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2177886.0, ans=0.125 2023-06-26 08:08:20,634 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=6.07 vs. limit=15.0 2023-06-26 08:08:27,999 INFO [train.py:996] (2/4) Epoch 12, batch 27550, loss[loss=0.1909, simple_loss=0.2693, pruned_loss=0.0563, over 21746.00 frames. ], tot_loss[loss=0.2343, simple_loss=0.3111, pruned_loss=0.07874, over 4284057.45 frames. ], batch size: 351, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:08:39,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.ff2_skip_rate, batch_count=2177946.0, ans=0.0 2023-06-26 08:08:53,198 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=4.94 vs. limit=10.0 2023-06-26 08:08:56,273 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=9.09 vs. limit=22.5 2023-06-26 08:10:12,154 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.726e+02 8.167e+02 1.305e+03 2.054e+03 4.575e+03, threshold=2.609e+03, percent-clipped=17.0 2023-06-26 08:10:15,787 INFO [train.py:996] (2/4) Epoch 12, batch 27600, loss[loss=0.2023, simple_loss=0.2673, pruned_loss=0.06865, over 21331.00 frames. ], tot_loss[loss=0.2299, simple_loss=0.304, pruned_loss=0.07792, over 4286984.16 frames. ], batch size: 211, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:11:12,309 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2178426.0, ans=0.0 2023-06-26 08:11:32,935 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module2.balancer2.prob, batch_count=2178426.0, ans=0.125 2023-06-26 08:11:50,589 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.88 vs. limit=15.0 2023-06-26 08:11:51,707 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2178486.0, ans=0.125 2023-06-26 08:12:00,917 INFO [train.py:996] (2/4) Epoch 12, batch 27650, loss[loss=0.216, simple_loss=0.2893, pruned_loss=0.07138, over 16751.00 frames. ], tot_loss[loss=0.2263, simple_loss=0.2985, pruned_loss=0.07702, over 4274063.80 frames. ], batch size: 64, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:12:08,432 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2178546.0, ans=0.125 2023-06-26 08:12:24,456 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.56 vs. limit=15.0 2023-06-26 08:12:25,774 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.attention_skip_rate, batch_count=2178606.0, ans=0.0 2023-06-26 08:12:34,710 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=2178666.0, ans=0.125 2023-06-26 08:12:54,750 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2178666.0, ans=0.125 2023-06-26 08:13:46,689 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.067e+02 8.445e+02 1.173e+03 2.476e+03 5.396e+03, threshold=2.346e+03, percent-clipped=23.0 2023-06-26 08:13:49,052 INFO [train.py:996] (2/4) Epoch 12, batch 27700, loss[loss=0.2037, simple_loss=0.274, pruned_loss=0.06666, over 21891.00 frames. ], tot_loss[loss=0.2264, simple_loss=0.3004, pruned_loss=0.07624, over 4271229.99 frames. ], batch size: 98, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:14:18,291 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_skip_rate, batch_count=2178906.0, ans=0.0 2023-06-26 08:14:30,929 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2178966.0, ans=0.2 2023-06-26 08:14:41,149 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2178966.0, ans=0.0 2023-06-26 08:14:41,150 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2178966.0, ans=0.0 2023-06-26 08:15:14,830 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.self_attn2.whiten, num_groups=1, num_channels=192, metric=9.80 vs. limit=22.5 2023-06-26 08:15:17,849 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2179086.0, ans=0.1 2023-06-26 08:15:21,039 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass_mid.scale_min, batch_count=2179086.0, ans=0.2 2023-06-26 08:15:33,985 INFO [train.py:996] (2/4) Epoch 12, batch 27750, loss[loss=0.2424, simple_loss=0.3089, pruned_loss=0.08794, over 21865.00 frames. ], tot_loss[loss=0.2285, simple_loss=0.3044, pruned_loss=0.07634, over 4276703.96 frames. ], batch size: 107, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:15:36,552 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2179146.0, ans=0.2 2023-06-26 08:15:47,636 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2179146.0, ans=0.125 2023-06-26 08:15:56,970 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.05 vs. limit=15.0 2023-06-26 08:16:13,786 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2179266.0, ans=0.2 2023-06-26 08:17:17,689 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.559e+02 9.501e+02 1.486e+03 2.223e+03 3.680e+03, threshold=2.972e+03, percent-clipped=21.0 2023-06-26 08:17:19,380 INFO [train.py:996] (2/4) Epoch 12, batch 27800, loss[loss=0.2498, simple_loss=0.3137, pruned_loss=0.093, over 21765.00 frames. ], tot_loss[loss=0.2282, simple_loss=0.3027, pruned_loss=0.07688, over 4285582.61 frames. ], batch size: 441, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:17:28,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2179446.0, ans=0.1 2023-06-26 08:17:36,241 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2179506.0, ans=0.1 2023-06-26 08:17:49,782 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2179506.0, ans=0.125 2023-06-26 08:18:12,307 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2179566.0, ans=0.125 2023-06-26 08:18:24,618 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2179626.0, ans=0.125 2023-06-26 08:18:40,472 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.attention_skip_rate, batch_count=2179626.0, ans=0.0 2023-06-26 08:19:10,330 INFO [train.py:996] (2/4) Epoch 12, batch 27850, loss[loss=0.2297, simple_loss=0.2974, pruned_loss=0.08097, over 21583.00 frames. ], tot_loss[loss=0.2284, simple_loss=0.3014, pruned_loss=0.07774, over 4294707.96 frames. ], batch size: 131, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:19:38,563 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2179806.0, ans=0.125 2023-06-26 08:19:40,431 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2179806.0, ans=0.0 2023-06-26 08:19:45,671 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2179866.0, ans=0.125 2023-06-26 08:20:53,739 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.551e+02 9.151e+02 1.351e+03 2.039e+03 4.854e+03, threshold=2.703e+03, percent-clipped=13.0 2023-06-26 08:20:55,774 INFO [train.py:996] (2/4) Epoch 12, batch 27900, loss[loss=0.2813, simple_loss=0.3763, pruned_loss=0.09314, over 21775.00 frames. ], tot_loss[loss=0.2329, simple_loss=0.3095, pruned_loss=0.07811, over 4287096.28 frames. ], batch size: 351, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:20:56,524 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2180046.0, ans=0.1 2023-06-26 08:22:22,761 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.5.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:22:47,166 INFO [train.py:996] (2/4) Epoch 12, batch 27950, loss[loss=0.2273, simple_loss=0.3326, pruned_loss=0.06103, over 21197.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.3093, pruned_loss=0.07485, over 4284520.93 frames. ], batch size: 549, lr: 2.38e-03, grad_scale: 8.0 2023-06-26 08:23:23,711 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:23:46,805 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=4.45 vs. limit=12.0 2023-06-26 08:24:13,999 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.30 vs. limit=15.0 2023-06-26 08:24:25,151 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2180586.0, ans=0.125 2023-06-26 08:24:33,134 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.115e+02 7.857e+02 1.156e+03 1.726e+03 4.008e+03, threshold=2.312e+03, percent-clipped=7.0 2023-06-26 08:24:34,663 INFO [train.py:996] (2/4) Epoch 12, batch 28000, loss[loss=0.2286, simple_loss=0.3077, pruned_loss=0.07473, over 21796.00 frames. ], tot_loss[loss=0.227, simple_loss=0.3077, pruned_loss=0.07316, over 4275735.64 frames. ], batch size: 112, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:24:40,556 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=2180646.0, ans=0.2 2023-06-26 08:25:59,665 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=11.96 vs. limit=15.0 2023-06-26 08:26:05,738 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=2180886.0, ans=0.2 2023-06-26 08:26:22,256 INFO [train.py:996] (2/4) Epoch 12, batch 28050, loss[loss=0.2495, simple_loss=0.3268, pruned_loss=0.08609, over 21840.00 frames. ], tot_loss[loss=0.2276, simple_loss=0.3062, pruned_loss=0.07454, over 4281028.31 frames. ], batch size: 371, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:28:11,570 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.128e+02 1.251e+03 1.568e+03 2.445e+03 4.428e+03, threshold=3.136e+03, percent-clipped=24.0 2023-06-26 08:28:13,336 INFO [train.py:996] (2/4) Epoch 12, batch 28100, loss[loss=0.216, simple_loss=0.2834, pruned_loss=0.07435, over 21755.00 frames. ], tot_loss[loss=0.2265, simple_loss=0.304, pruned_loss=0.07451, over 4277112.50 frames. ], batch size: 107, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:28:20,948 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.balancer1.prob, batch_count=2181246.0, ans=0.125 2023-06-26 08:28:47,143 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2181306.0, ans=0.0 2023-06-26 08:29:25,377 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.42 vs. limit=8.0 2023-06-26 08:30:00,419 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.whiten, num_groups=1, num_channels=256, metric=3.47 vs. limit=12.0 2023-06-26 08:30:00,922 INFO [train.py:996] (2/4) Epoch 12, batch 28150, loss[loss=0.2192, simple_loss=0.2777, pruned_loss=0.08035, over 21582.00 frames. ], tot_loss[loss=0.2234, simple_loss=0.2983, pruned_loss=0.07423, over 4266116.99 frames. ], batch size: 298, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:30:39,911 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2181606.0, ans=0.2 2023-06-26 08:31:50,385 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.207e+02 9.538e+02 1.303e+03 2.099e+03 3.954e+03, threshold=2.605e+03, percent-clipped=5.0 2023-06-26 08:31:52,088 INFO [train.py:996] (2/4) Epoch 12, batch 28200, loss[loss=0.2429, simple_loss=0.3075, pruned_loss=0.08919, over 21289.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.2981, pruned_loss=0.07661, over 4265015.51 frames. ], batch size: 176, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:33:02,590 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=7.94 vs. limit=15.0 2023-06-26 08:33:24,205 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2182086.0, ans=0.125 2023-06-26 08:33:40,943 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2182086.0, ans=0.1 2023-06-26 08:33:42,406 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2182086.0, ans=0.2 2023-06-26 08:33:53,923 INFO [train.py:996] (2/4) Epoch 12, batch 28250, loss[loss=0.2218, simple_loss=0.2876, pruned_loss=0.07799, over 21672.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.301, pruned_loss=0.07833, over 4260949.27 frames. ], batch size: 298, lr: 2.38e-03, grad_scale: 16.0 2023-06-26 08:34:22,441 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.86 vs. limit=15.0 2023-06-26 08:34:32,365 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.ff3_skip_rate, batch_count=2182206.0, ans=0.0 2023-06-26 08:34:56,001 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass.skip_rate, batch_count=2182326.0, ans=0.09899494936611666 2023-06-26 08:34:56,033 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2182326.0, ans=0.125 2023-06-26 08:35:42,658 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.099e+02 9.506e+02 1.674e+03 2.975e+03 5.994e+03, threshold=3.348e+03, percent-clipped=30.0 2023-06-26 08:35:44,313 INFO [train.py:996] (2/4) Epoch 12, batch 28300, loss[loss=0.1714, simple_loss=0.2743, pruned_loss=0.03425, over 21791.00 frames. ], tot_loss[loss=0.225, simple_loss=0.2981, pruned_loss=0.07598, over 4258355.11 frames. ], batch size: 371, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 08:36:10,684 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=6.97 vs. limit=15.0 2023-06-26 08:36:34,216 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.2.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:36:52,932 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2182626.0, ans=0.1 2023-06-26 08:37:25,086 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.60 vs. limit=15.0 2023-06-26 08:37:36,907 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2182686.0, ans=0.0 2023-06-26 08:37:41,285 INFO [train.py:996] (2/4) Epoch 12, batch 28350, loss[loss=0.1962, simple_loss=0.2701, pruned_loss=0.0612, over 21627.00 frames. ], tot_loss[loss=0.2192, simple_loss=0.2959, pruned_loss=0.07121, over 4258382.96 frames. ], batch size: 332, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 08:37:47,491 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=4.91 vs. limit=8.0 2023-06-26 08:38:19,436 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2182866.0, ans=0.0 2023-06-26 08:38:40,941 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2182926.0, ans=0.0 2023-06-26 08:39:09,268 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.skip_rate, batch_count=2182986.0, ans=0.035 2023-06-26 08:39:27,865 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.238e+02 9.118e+02 1.283e+03 2.054e+03 3.751e+03, threshold=2.567e+03, percent-clipped=3.0 2023-06-26 08:39:29,436 INFO [train.py:996] (2/4) Epoch 12, batch 28400, loss[loss=0.2343, simple_loss=0.3041, pruned_loss=0.08219, over 21173.00 frames. ], tot_loss[loss=0.2176, simple_loss=0.293, pruned_loss=0.07113, over 4252279.82 frames. ], batch size: 143, lr: 2.37e-03, grad_scale: 32.0 2023-06-26 08:40:39,085 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2183226.0, ans=0.125 2023-06-26 08:41:22,193 INFO [train.py:996] (2/4) Epoch 12, batch 28450, loss[loss=0.2292, simple_loss=0.2981, pruned_loss=0.08012, over 21832.00 frames. ], tot_loss[loss=0.2239, simple_loss=0.2982, pruned_loss=0.07482, over 4257459.04 frames. ], batch size: 282, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:41:25,046 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=256, metric=6.92 vs. limit=15.0 2023-06-26 08:42:50,341 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2183526.0, ans=0.2 2023-06-26 08:43:03,803 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.scale_min, batch_count=2183586.0, ans=0.2 2023-06-26 08:43:11,122 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2183646.0, ans=0.125 2023-06-26 08:43:12,231 INFO [train.py:996] (2/4) Epoch 12, batch 28500, loss[loss=0.1894, simple_loss=0.2523, pruned_loss=0.06321, over 20127.00 frames. ], tot_loss[loss=0.2257, simple_loss=0.2989, pruned_loss=0.07622, over 4265053.99 frames. ], batch size: 703, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:43:13,807 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.783e+02 7.825e+02 1.127e+03 1.628e+03 3.596e+03, threshold=2.254e+03, percent-clipped=4.0 2023-06-26 08:43:18,065 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2183646.0, ans=0.1 2023-06-26 08:43:28,961 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=5.06 vs. limit=10.0 2023-06-26 08:43:29,978 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2183706.0, ans=0.125 2023-06-26 08:43:38,641 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2183706.0, ans=0.0 2023-06-26 08:43:48,994 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2183706.0, ans=0.125 2023-06-26 08:43:54,379 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.conv_skip_rate, batch_count=2183766.0, ans=0.0 2023-06-26 08:43:58,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2183766.0, ans=0.1 2023-06-26 08:44:47,711 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module1.whiten, num_groups=1, num_channels=192, metric=7.84 vs. limit=15.0 2023-06-26 08:44:55,208 INFO [train.py:996] (2/4) Epoch 12, batch 28550, loss[loss=0.2501, simple_loss=0.3426, pruned_loss=0.07875, over 21289.00 frames. ], tot_loss[loss=0.2333, simple_loss=0.3082, pruned_loss=0.07914, over 4276127.50 frames. ], batch size: 176, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:44:56,029 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2183946.0, ans=0.125 2023-06-26 08:45:04,836 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.scale_min, batch_count=2183946.0, ans=0.2 2023-06-26 08:46:14,127 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.27 vs. limit=12.0 2023-06-26 08:46:22,328 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass_mid.scale_min, batch_count=2184126.0, ans=0.2 2023-06-26 08:46:37,639 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2184186.0, ans=0.125 2023-06-26 08:46:39,356 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:46:45,256 INFO [train.py:996] (2/4) Epoch 12, batch 28600, loss[loss=0.2443, simple_loss=0.3228, pruned_loss=0.08293, over 20661.00 frames. ], tot_loss[loss=0.2382, simple_loss=0.3144, pruned_loss=0.08098, over 4276777.36 frames. ], batch size: 607, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:46:47,101 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.084e+02 9.095e+02 1.267e+03 2.017e+03 3.986e+03, threshold=2.534e+03, percent-clipped=14.0 2023-06-26 08:47:02,081 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.balancer2.prob, batch_count=2184246.0, ans=0.125 2023-06-26 08:47:04,141 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.14 vs. limit=15.0 2023-06-26 08:47:47,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.skip_rate, batch_count=2184366.0, ans=0.04949747468305833 2023-06-26 08:47:57,580 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2184366.0, ans=0.125 2023-06-26 08:48:39,285 INFO [train.py:996] (2/4) Epoch 12, batch 28650, loss[loss=0.2153, simple_loss=0.2888, pruned_loss=0.07085, over 20146.00 frames. ], tot_loss[loss=0.2339, simple_loss=0.308, pruned_loss=0.07989, over 4276852.35 frames. ], batch size: 702, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:48:57,729 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2184546.0, ans=0.1 2023-06-26 08:49:17,538 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2184606.0, ans=0.1 2023-06-26 08:49:43,634 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.prob, batch_count=2184666.0, ans=0.125 2023-06-26 08:50:36,131 INFO [train.py:996] (2/4) Epoch 12, batch 28700, loss[loss=0.2407, simple_loss=0.3227, pruned_loss=0.07941, over 21800.00 frames. ], tot_loss[loss=0.2352, simple_loss=0.3074, pruned_loss=0.08146, over 4273958.41 frames. ], batch size: 351, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:50:37,768 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.799e+02 8.894e+02 1.316e+03 2.106e+03 4.413e+03, threshold=2.633e+03, percent-clipped=13.0 2023-06-26 08:51:51,461 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.8.prob, batch_count=2185026.0, ans=0.125 2023-06-26 08:52:00,393 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2185086.0, ans=0.2 2023-06-26 08:52:08,497 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.conv.2.prob, batch_count=2185086.0, ans=0.125 2023-06-26 08:52:24,667 INFO [train.py:996] (2/4) Epoch 12, batch 28750, loss[loss=0.2146, simple_loss=0.2926, pruned_loss=0.06833, over 21860.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3076, pruned_loss=0.08163, over 4272570.61 frames. ], batch size: 124, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:52:35,443 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2185146.0, ans=0.125 2023-06-26 08:54:00,838 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2185386.0, ans=0.125 2023-06-26 08:54:13,705 INFO [train.py:996] (2/4) Epoch 12, batch 28800, loss[loss=0.2826, simple_loss=0.3491, pruned_loss=0.108, over 21890.00 frames. ], tot_loss[loss=0.2374, simple_loss=0.3112, pruned_loss=0.08185, over 4275845.20 frames. ], batch size: 371, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 08:54:15,168 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.541e+02 8.442e+02 1.126e+03 1.600e+03 2.728e+03, threshold=2.251e+03, percent-clipped=1.0 2023-06-26 08:54:51,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward1.hidden_balancer.prob, batch_count=2185506.0, ans=0.125 2023-06-26 08:54:57,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2185566.0, ans=0.1 2023-06-26 08:55:29,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff3_skip_rate, batch_count=2185626.0, ans=0.0 2023-06-26 08:55:45,574 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2185686.0, ans=0.125 2023-06-26 08:55:48,945 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.1.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 08:55:54,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=2185686.0, ans=0.125 2023-06-26 08:56:00,420 INFO [train.py:996] (2/4) Epoch 12, batch 28850, loss[loss=0.2285, simple_loss=0.3021, pruned_loss=0.07747, over 21831.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3118, pruned_loss=0.08312, over 4278237.22 frames. ], batch size: 124, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:56:14,965 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.00 vs. limit=15.0 2023-06-26 08:56:44,945 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2185806.0, ans=0.1 2023-06-26 08:57:07,816 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward2.out_whiten.whitening_limit, batch_count=2185866.0, ans=15.0 2023-06-26 08:57:24,878 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.balancer1.prob, batch_count=2185926.0, ans=0.125 2023-06-26 08:57:30,481 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=6.04 vs. limit=15.0 2023-06-26 08:57:35,496 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2185986.0, ans=0.0 2023-06-26 08:57:57,019 INFO [train.py:996] (2/4) Epoch 12, batch 28900, loss[loss=0.277, simple_loss=0.3442, pruned_loss=0.1049, over 21775.00 frames. ], tot_loss[loss=0.2432, simple_loss=0.3158, pruned_loss=0.08532, over 4275085.98 frames. ], batch size: 298, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 08:58:06,909 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.674e+02 9.203e+02 1.221e+03 1.627e+03 4.762e+03, threshold=2.441e+03, percent-clipped=10.0 2023-06-26 08:58:19,329 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=4.17 vs. limit=5.0 2023-06-26 08:58:42,299 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=8.49 vs. limit=15.0 2023-06-26 08:59:41,986 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2186286.0, ans=0.125 2023-06-26 08:59:43,495 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_skip_rate, batch_count=2186286.0, ans=0.0 2023-06-26 08:59:49,860 INFO [train.py:996] (2/4) Epoch 12, batch 28950, loss[loss=0.228, simple_loss=0.3378, pruned_loss=0.05906, over 20769.00 frames. ], tot_loss[loss=0.2413, simple_loss=0.3148, pruned_loss=0.0839, over 4278434.61 frames. ], batch size: 608, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:00:00,794 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_skip_rate, batch_count=2186346.0, ans=0.0 2023-06-26 09:00:14,831 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.balancer_na.min_abs, batch_count=2186406.0, ans=0.02 2023-06-26 09:00:27,197 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2186406.0, ans=0.125 2023-06-26 09:00:45,451 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer2.prob, batch_count=2186466.0, ans=0.125 2023-06-26 09:01:24,361 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2186586.0, ans=0.125 2023-06-26 09:01:39,462 INFO [train.py:996] (2/4) Epoch 12, batch 29000, loss[loss=0.2713, simple_loss=0.3433, pruned_loss=0.09959, over 21287.00 frames. ], tot_loss[loss=0.2429, simple_loss=0.3192, pruned_loss=0.08326, over 4277253.99 frames. ], batch size: 143, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:01:42,714 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.937e+02 9.875e+02 1.439e+03 2.075e+03 4.824e+03, threshold=2.879e+03, percent-clipped=20.0 2023-06-26 09:03:08,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer1.max_abs, batch_count=2186826.0, ans=10.0 2023-06-26 09:03:09,981 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module1.balancer2.prob, batch_count=2186886.0, ans=0.125 2023-06-26 09:03:27,703 INFO [train.py:996] (2/4) Epoch 12, batch 29050, loss[loss=0.2217, simple_loss=0.2969, pruned_loss=0.07324, over 21441.00 frames. ], tot_loss[loss=0.2436, simple_loss=0.318, pruned_loss=0.08457, over 4275926.43 frames. ], batch size: 131, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:05:02,569 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.bypass_mid.scale_min, batch_count=2187186.0, ans=0.2 2023-06-26 09:05:16,296 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer2.min_abs, batch_count=2187246.0, ans=0.5 2023-06-26 09:05:17,253 INFO [train.py:996] (2/4) Epoch 12, batch 29100, loss[loss=0.1976, simple_loss=0.2618, pruned_loss=0.0667, over 21623.00 frames. ], tot_loss[loss=0.2366, simple_loss=0.3093, pruned_loss=0.08197, over 4272202.78 frames. ], batch size: 298, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:05:20,348 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.010e+02 8.838e+02 1.250e+03 1.736e+03 3.044e+03, threshold=2.501e+03, percent-clipped=1.0 2023-06-26 09:05:56,220 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.attention_skip_rate, batch_count=2187306.0, ans=0.0 2023-06-26 09:06:04,665 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2187366.0, ans=0.125 2023-06-26 09:06:41,807 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=5.66 vs. limit=15.0 2023-06-26 09:06:44,501 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer1.min_positive, batch_count=2187426.0, ans=0.025 2023-06-26 09:06:49,761 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2187486.0, ans=0.0 2023-06-26 09:07:00,153 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer2.prob, batch_count=2187486.0, ans=0.125 2023-06-26 09:07:04,363 INFO [train.py:996] (2/4) Epoch 12, batch 29150, loss[loss=0.2548, simple_loss=0.3501, pruned_loss=0.07976, over 21622.00 frames. ], tot_loss[loss=0.2348, simple_loss=0.3086, pruned_loss=0.08051, over 4276327.51 frames. ], batch size: 389, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:08:13,955 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer1.max_abs, batch_count=2187666.0, ans=10.0 2023-06-26 09:08:16,092 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.feed_forward2.hidden_balancer.prob, batch_count=2187726.0, ans=0.125 2023-06-26 09:08:19,021 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2187726.0, ans=0.2 2023-06-26 09:08:34,801 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-26 09:08:51,329 INFO [train.py:996] (2/4) Epoch 12, batch 29200, loss[loss=0.1971, simple_loss=0.2834, pruned_loss=0.05544, over 20030.00 frames. ], tot_loss[loss=0.2318, simple_loss=0.3041, pruned_loss=0.07976, over 4276547.18 frames. ], batch size: 702, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:08:54,713 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.344e+02 9.543e+02 1.197e+03 1.931e+03 4.156e+03, threshold=2.395e+03, percent-clipped=13.0 2023-06-26 09:09:05,900 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2187846.0, ans=0.1 2023-06-26 09:10:21,713 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.skip_rate, batch_count=2188086.0, ans=0.09899494936611666 2023-06-26 09:10:21,727 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2188086.0, ans=0.0 2023-06-26 09:10:25,216 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass_mid.scale_min, batch_count=2188086.0, ans=0.2 2023-06-26 09:10:39,276 INFO [train.py:996] (2/4) Epoch 12, batch 29250, loss[loss=0.2123, simple_loss=0.3128, pruned_loss=0.05588, over 21751.00 frames. ], tot_loss[loss=0.2289, simple_loss=0.3037, pruned_loss=0.07706, over 4271048.69 frames. ], batch size: 352, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:11:34,295 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2188266.0, ans=0.125 2023-06-26 09:12:15,093 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass_mid.scale_min, batch_count=2188386.0, ans=0.2 2023-06-26 09:12:24,399 INFO [train.py:996] (2/4) Epoch 12, batch 29300, loss[loss=0.216, simple_loss=0.2876, pruned_loss=0.07226, over 21593.00 frames. ], tot_loss[loss=0.2293, simple_loss=0.3058, pruned_loss=0.07639, over 4274615.30 frames. ], batch size: 263, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:12:27,563 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.498e+02 8.933e+02 1.350e+03 1.847e+03 4.429e+03, threshold=2.700e+03, percent-clipped=13.0 2023-06-26 09:13:38,156 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2188626.0, ans=0.125 2023-06-26 09:13:40,130 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.ff3_skip_rate, batch_count=2188626.0, ans=0.0 2023-06-26 09:13:54,859 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn1.whiten.whitening_limit, batch_count=2188686.0, ans=22.5 2023-06-26 09:14:07,977 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff2_skip_rate, batch_count=2188686.0, ans=0.0 2023-06-26 09:14:12,345 INFO [train.py:996] (2/4) Epoch 12, batch 29350, loss[loss=0.2465, simple_loss=0.3404, pruned_loss=0.07634, over 21847.00 frames. ], tot_loss[loss=0.2268, simple_loss=0.3013, pruned_loss=0.07618, over 4279083.30 frames. ], batch size: 372, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:14:51,411 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.skip_rate, batch_count=2188806.0, ans=0.07 2023-06-26 09:15:13,086 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.bypass_mid.scale_min, batch_count=2188866.0, ans=0.2 2023-06-26 09:15:29,263 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=11.81 vs. limit=15.0 2023-06-26 09:15:37,011 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2188926.0, ans=0.125 2023-06-26 09:16:13,882 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2189046.0, ans=0.0 2023-06-26 09:16:14,809 INFO [train.py:996] (2/4) Epoch 12, batch 29400, loss[loss=0.1891, simple_loss=0.2828, pruned_loss=0.04769, over 21698.00 frames. ], tot_loss[loss=0.2228, simple_loss=0.2993, pruned_loss=0.07318, over 4262639.90 frames. ], batch size: 332, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:16:18,021 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.568e+02 9.109e+02 1.324e+03 1.907e+03 3.871e+03, threshold=2.647e+03, percent-clipped=5.0 2023-06-26 09:17:00,392 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass_mid.scale_min, batch_count=2189106.0, ans=0.2 2023-06-26 09:17:56,188 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.out_combiner.scale_min, batch_count=2189286.0, ans=0.2 2023-06-26 09:18:07,016 INFO [train.py:996] (2/4) Epoch 12, batch 29450, loss[loss=0.2741, simple_loss=0.3427, pruned_loss=0.1027, over 21725.00 frames. ], tot_loss[loss=0.2215, simple_loss=0.2975, pruned_loss=0.07277, over 4257505.86 frames. ], batch size: 332, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:18:37,684 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.bypass.scale_min, batch_count=2189406.0, ans=0.2 2023-06-26 09:18:56,543 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2189466.0, ans=0.0 2023-06-26 09:19:07,435 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=2189466.0, ans=0.125 2023-06-26 09:19:55,299 INFO [train.py:996] (2/4) Epoch 12, batch 29500, loss[loss=0.2463, simple_loss=0.3107, pruned_loss=0.09096, over 21627.00 frames. ], tot_loss[loss=0.2279, simple_loss=0.3028, pruned_loss=0.07656, over 4260740.08 frames. ], batch size: 471, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:20:06,875 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.147e+02 8.502e+02 1.287e+03 1.918e+03 4.991e+03, threshold=2.573e+03, percent-clipped=9.0 2023-06-26 09:20:09,494 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=15.94 vs. limit=15.0 2023-06-26 09:20:45,707 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=10.74 vs. limit=15.0 2023-06-26 09:20:53,351 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2189766.0, ans=0.1 2023-06-26 09:20:58,358 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2189826.0, ans=0.125 2023-06-26 09:21:28,722 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=192, metric=12.47 vs. limit=15.0 2023-06-26 09:21:45,268 INFO [train.py:996] (2/4) Epoch 12, batch 29550, loss[loss=0.2215, simple_loss=0.2883, pruned_loss=0.07736, over 21637.00 frames. ], tot_loss[loss=0.2295, simple_loss=0.303, pruned_loss=0.07802, over 4275386.94 frames. ], batch size: 548, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:21:53,372 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_proj.dropout_p, batch_count=2189946.0, ans=0.1 2023-06-26 09:22:43,235 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.ff2_skip_rate, batch_count=2190066.0, ans=0.0 2023-06-26 09:22:56,445 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=7.80 vs. limit=15.0 2023-06-26 09:23:13,585 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=4.13 vs. limit=15.0 2023-06-26 09:23:20,109 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_skip_rate, batch_count=2190186.0, ans=0.0 2023-06-26 09:23:49,838 INFO [train.py:996] (2/4) Epoch 12, batch 29600, loss[loss=0.248, simple_loss=0.3275, pruned_loss=0.08421, over 21293.00 frames. ], tot_loss[loss=0.236, simple_loss=0.3107, pruned_loss=0.0807, over 4275385.15 frames. ], batch size: 176, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:23:54,357 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.1.conv_module2.whiten, num_groups=1, num_channels=192, metric=8.06 vs. limit=15.0 2023-06-26 09:23:55,089 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.495e+02 9.952e+02 1.397e+03 2.179e+03 6.553e+03, threshold=2.795e+03, percent-clipped=14.0 2023-06-26 09:24:56,784 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.bypass.scale_min, batch_count=2190426.0, ans=0.2 2023-06-26 09:25:01,123 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=5.27 vs. limit=15.0 2023-06-26 09:25:16,132 INFO [scaling.py:962] (2/4) Whitening: name=encoder_embed.convnext.out_whiten, num_groups=1, num_channels=128, metric=3.48 vs. limit=5.0 2023-06-26 09:25:22,324 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_module2.balancer2.prob, batch_count=2190486.0, ans=0.125 2023-06-26 09:25:32,971 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2190486.0, ans=0.2 2023-06-26 09:25:35,910 INFO [train.py:996] (2/4) Epoch 12, batch 29650, loss[loss=0.1728, simple_loss=0.2458, pruned_loss=0.04987, over 21241.00 frames. ], tot_loss[loss=0.232, simple_loss=0.3084, pruned_loss=0.07783, over 4277591.93 frames. ], batch size: 159, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:25:40,177 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2190546.0, ans=0.125 2023-06-26 09:25:59,102 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=6.90 vs. limit=15.0 2023-06-26 09:26:07,299 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2190606.0, ans=0.125 2023-06-26 09:26:51,544 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_skip_rate, batch_count=2190786.0, ans=0.0 2023-06-26 09:27:00,793 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.prob, batch_count=2190786.0, ans=0.125 2023-06-26 09:27:15,134 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.1.conv_module1.whiten, num_groups=1, num_channels=256, metric=4.22 vs. limit=15.0 2023-06-26 09:27:22,338 INFO [train.py:996] (2/4) Epoch 12, batch 29700, loss[loss=0.2533, simple_loss=0.3566, pruned_loss=0.075, over 21797.00 frames. ], tot_loss[loss=0.2317, simple_loss=0.3087, pruned_loss=0.07736, over 4277422.90 frames. ], batch size: 282, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:27:26,099 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2190846.0, ans=0.125 2023-06-26 09:27:27,211 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.270e+02 1.083e+03 1.930e+03 2.745e+03 6.322e+03, threshold=3.861e+03, percent-clipped=21.0 2023-06-26 09:28:20,725 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.min_positive, batch_count=2191026.0, ans=0.025 2023-06-26 09:28:26,084 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.nonlin_attention.balancer.prob, batch_count=2191026.0, ans=0.125 2023-06-26 09:28:33,002 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=8.61 vs. limit=15.0 2023-06-26 09:28:59,034 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module2.balancer1.prob, batch_count=2191086.0, ans=0.125 2023-06-26 09:28:59,098 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2191086.0, ans=0.125 2023-06-26 09:29:06,948 INFO [train.py:996] (2/4) Epoch 12, batch 29750, loss[loss=0.2399, simple_loss=0.3414, pruned_loss=0.06923, over 21710.00 frames. ], tot_loss[loss=0.2338, simple_loss=0.3138, pruned_loss=0.07692, over 4281102.23 frames. ], batch size: 389, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:30:07,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.conv_module1.balancer2.prob, batch_count=2191326.0, ans=0.125 2023-06-26 09:30:36,211 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.attention_skip_rate, batch_count=2191386.0, ans=0.0 2023-06-26 09:30:53,160 INFO [train.py:996] (2/4) Epoch 12, batch 29800, loss[loss=0.2155, simple_loss=0.2927, pruned_loss=0.06918, over 21887.00 frames. ], tot_loss[loss=0.2349, simple_loss=0.3146, pruned_loss=0.07761, over 4286531.48 frames. ], batch size: 332, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:30:54,320 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=10.86 vs. limit=15.0 2023-06-26 09:30:58,556 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.420e+02 9.303e+02 1.235e+03 1.702e+03 3.129e+03, threshold=2.469e+03, percent-clipped=0.0 2023-06-26 09:31:25,470 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2191566.0, ans=0.125 2023-06-26 09:31:28,756 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.self_attn_weights.pos_emb_skip_rate, batch_count=2191566.0, ans=0.0 2023-06-26 09:32:02,683 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=2191626.0, ans=0.125 2023-06-26 09:32:28,937 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2191686.0, ans=0.125 2023-06-26 09:32:37,767 INFO [train.py:996] (2/4) Epoch 12, batch 29850, loss[loss=0.1951, simple_loss=0.2723, pruned_loss=0.05896, over 21822.00 frames. ], tot_loss[loss=0.23, simple_loss=0.3094, pruned_loss=0.07528, over 4292725.31 frames. ], batch size: 282, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:32:46,918 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=8.70 vs. limit=15.0 2023-06-26 09:34:29,945 INFO [train.py:996] (2/4) Epoch 12, batch 29900, loss[loss=0.2316, simple_loss=0.3488, pruned_loss=0.05717, over 19798.00 frames. ], tot_loss[loss=0.2313, simple_loss=0.3091, pruned_loss=0.07682, over 4296369.07 frames. ], batch size: 703, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:34:34,063 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:34:35,015 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 6.187e+02 8.891e+02 1.162e+03 1.804e+03 4.059e+03, threshold=2.324e+03, percent-clipped=12.0 2023-06-26 09:34:52,829 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=7.33 vs. limit=15.0 2023-06-26 09:35:36,825 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.conv_module1.whiten, num_groups=1, num_channels=256, metric=12.12 vs. limit=15.0 2023-06-26 09:35:59,699 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=10.67 vs. limit=15.0 2023-06-26 09:36:19,871 INFO [train.py:996] (2/4) Epoch 12, batch 29950, loss[loss=0.285, simple_loss=0.3471, pruned_loss=0.1114, over 21340.00 frames. ], tot_loss[loss=0.2375, simple_loss=0.3137, pruned_loss=0.08059, over 4291888.81 frames. ], batch size: 548, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:37:46,681 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2192526.0, ans=0.1 2023-06-26 09:37:53,318 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.out_combiner.scale_min, batch_count=2192586.0, ans=0.2 2023-06-26 09:37:53,337 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.attention_skip_rate, batch_count=2192586.0, ans=0.0 2023-06-26 09:37:55,735 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.1.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.39 vs. limit=15.0 2023-06-26 09:38:08,271 INFO [train.py:996] (2/4) Epoch 12, batch 30000, loss[loss=0.2125, simple_loss=0.3076, pruned_loss=0.05868, over 21723.00 frames. ], tot_loss[loss=0.2392, simple_loss=0.3163, pruned_loss=0.081, over 4295739.36 frames. ], batch size: 247, lr: 2.37e-03, grad_scale: 32.0 2023-06-26 09:38:08,272 INFO [train.py:1019] (2/4) Computing validation loss 2023-06-26 09:38:26,479 INFO [train.py:1028] (2/4) Epoch 12, validation: loss=0.2465, simple_loss=0.3441, pruned_loss=0.07444, over 1796401.00 frames. 2023-06-26 09:38:26,480 INFO [train.py:1029] (2/4) Maximum memory allocated so far is 24834MB 2023-06-26 09:38:37,055 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.580e+02 9.008e+02 1.239e+03 1.696e+03 3.151e+03, threshold=2.479e+03, percent-clipped=8.0 2023-06-26 09:39:30,637 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.ff3_skip_rate, batch_count=2192766.0, ans=0.0 2023-06-26 09:39:32,175 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2192766.0, ans=0.1 2023-06-26 09:39:33,828 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.1.feed_forward2.hidden_balancer.prob, batch_count=2192766.0, ans=0.125 2023-06-26 09:40:00,852 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=2192886.0, ans=0.0 2023-06-26 09:40:26,439 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.bypass.skip_rate, batch_count=2192886.0, ans=0.035 2023-06-26 09:40:32,843 INFO [train.py:996] (2/4) Epoch 12, batch 30050, loss[loss=0.3256, simple_loss=0.4241, pruned_loss=0.1135, over 21468.00 frames. ], tot_loss[loss=0.2369, simple_loss=0.3181, pruned_loss=0.07786, over 4282142.28 frames. ], batch size: 507, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:40:56,701 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=8.57 vs. limit=15.0 2023-06-26 09:41:09,936 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=2193006.0, ans=0.125 2023-06-26 09:42:30,782 INFO [train.py:996] (2/4) Epoch 12, batch 30100, loss[loss=0.2385, simple_loss=0.3014, pruned_loss=0.08785, over 21340.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3169, pruned_loss=0.07699, over 4277048.62 frames. ], batch size: 131, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:42:42,349 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.763e+02 1.437e+03 2.326e+03 3.330e+03 7.267e+03, threshold=4.652e+03, percent-clipped=46.0 2023-06-26 09:43:37,850 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.balancer1.prob, batch_count=2193426.0, ans=0.125 2023-06-26 09:44:01,073 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2193486.0, ans=0.0 2023-06-26 09:44:19,956 INFO [train.py:996] (2/4) Epoch 12, batch 30150, loss[loss=0.2368, simple_loss=0.3144, pruned_loss=0.07965, over 21306.00 frames. ], tot_loss[loss=0.2354, simple_loss=0.3132, pruned_loss=0.07879, over 4274808.38 frames. ], batch size: 176, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:44:24,515 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=8.30 vs. limit=10.0 2023-06-26 09:45:12,815 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2193666.0, ans=0.0 2023-06-26 09:45:18,212 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.balancer1.prob, batch_count=2193726.0, ans=0.125 2023-06-26 09:45:46,808 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_skip_rate, batch_count=2193726.0, ans=0.0 2023-06-26 09:46:06,654 INFO [train.py:996] (2/4) Epoch 12, batch 30200, loss[loss=0.2519, simple_loss=0.3318, pruned_loss=0.08602, over 21140.00 frames. ], tot_loss[loss=0.2355, simple_loss=0.3159, pruned_loss=0.07758, over 4281242.63 frames. ], batch size: 143, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:46:08,980 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.feed_forward3.hidden_balancer.prob, batch_count=2193846.0, ans=0.125 2023-06-26 09:46:15,761 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.538e+02 9.559e+02 1.334e+03 1.956e+03 4.030e+03, threshold=2.668e+03, percent-clipped=0.0 2023-06-26 09:47:05,087 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff2_skip_rate, batch_count=2193966.0, ans=0.0 2023-06-26 09:47:38,522 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module2.balancer2.prob, batch_count=2194086.0, ans=0.125 2023-06-26 09:47:51,638 INFO [train.py:996] (2/4) Epoch 12, batch 30250, loss[loss=0.2511, simple_loss=0.3506, pruned_loss=0.07579, over 20012.00 frames. ], tot_loss[loss=0.2414, simple_loss=0.3235, pruned_loss=0.07965, over 4275974.76 frames. ], batch size: 702, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:48:00,505 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.bypass.scale_min, batch_count=2194146.0, ans=0.2 2023-06-26 09:48:03,651 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.self_attn_weights.pos_emb_skip_rate, batch_count=2194146.0, ans=0.0 2023-06-26 09:48:42,114 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.1.ff3_skip_rate, batch_count=2194266.0, ans=0.0 2023-06-26 09:48:48,818 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.bypass.scale_min, batch_count=2194266.0, ans=0.2 2023-06-26 09:49:05,530 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.conv_skip_rate, batch_count=2194326.0, ans=0.0 2023-06-26 09:49:07,623 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.bypass.skip_rate, batch_count=2194326.0, ans=0.04949747468305833 2023-06-26 09:49:39,819 INFO [train.py:996] (2/4) Epoch 12, batch 30300, loss[loss=0.208, simple_loss=0.2783, pruned_loss=0.06886, over 21930.00 frames. ], tot_loss[loss=0.239, simple_loss=0.3192, pruned_loss=0.07935, over 4270473.27 frames. ], batch size: 113, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:49:48,248 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.277e+02 8.599e+02 1.166e+03 1.867e+03 3.678e+03, threshold=2.332e+03, percent-clipped=6.0 2023-06-26 09:50:00,715 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.ff3_skip_rate, batch_count=2194506.0, ans=0.0 2023-06-26 09:50:46,310 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.1.ff3_skip_rate, batch_count=2194566.0, ans=0.0 2023-06-26 09:51:17,668 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=2194686.0, ans=0.1 2023-06-26 09:51:23,337 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.4.encoder.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=4.42 vs. limit=15.0 2023-06-26 09:51:33,181 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.conv_module1.balancer2.prob, batch_count=2194746.0, ans=0.125 2023-06-26 09:51:33,267 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.conv_module2.balancer1.prob, batch_count=2194746.0, ans=0.125 2023-06-26 09:51:34,362 INFO [train.py:996] (2/4) Epoch 12, batch 30350, loss[loss=0.2553, simple_loss=0.3244, pruned_loss=0.09308, over 21370.00 frames. ], tot_loss[loss=0.2402, simple_loss=0.3186, pruned_loss=0.08086, over 4270357.83 frames. ], batch size: 194, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:51:35,023 INFO [scaling.py:1052] (2/4) WithLoss: name=encoder.encoders.3.encoder.layers.1.self_attn_weights, loss-sum=0.000e+00 2023-06-26 09:52:36,479 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.conv_module1.balancer1.prob, batch_count=2194866.0, ans=0.125 2023-06-26 09:53:10,579 INFO [train.py:996] (2/4) Epoch 12, batch 30400, loss[loss=0.2282, simple_loss=0.2845, pruned_loss=0.08597, over 20237.00 frames. ], tot_loss[loss=0.2358, simple_loss=0.3132, pruned_loss=0.07923, over 4264537.91 frames. ], batch size: 703, lr: 2.37e-03, grad_scale: 16.0 2023-06-26 09:53:18,333 INFO [optim.py:471] (2/4) Clipping_scale=2.0, grad-norm quartiles 5.851e+02 1.020e+03 1.381e+03 2.107e+03 4.576e+03, threshold=2.761e+03, percent-clipped=19.0 2023-06-26 09:53:57,350 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.balancer1.prob, batch_count=2195166.0, ans=0.125 2023-06-26 09:54:22,390 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn1.whiten, num_groups=1, num_channels=256, metric=8.59 vs. limit=22.5 2023-06-26 09:54:35,594 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.1.encoder.layers.1.ff3_skip_rate, batch_count=2195286.0, ans=0.0 2023-06-26 09:54:39,894 INFO [train.py:996] (2/4) Epoch 12, batch 30450, loss[loss=0.2709, simple_loss=0.3991, pruned_loss=0.07133, over 19781.00 frames. ], tot_loss[loss=0.2351, simple_loss=0.3143, pruned_loss=0.07793, over 4204775.34 frames. ], batch size: 702, lr: 2.37e-03, grad_scale: 8.0 2023-06-26 09:54:44,509 INFO [scaling.py:962] (2/4) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=2.53 vs. limit=12.0 2023-06-26 09:55:05,814 INFO [scaling.py:182] (2/4) ScheduledFloat: name=encoder.encoders.0.layers.0.nonlin_attention.balancer.max_positive, batch_count=2195406.0, ans=0.95 2023-06-26 09:55:52,894 INFO [train.py:1249] (2/4) Done!